gah, sorry about that :(

thumb_stats.csv.gz is:
thumbnail size, bytes used, number of thumbs

thumb_stats_month.csv.gz is:
year-month, thumbnail size, bytes used

the year-month date is taken from swift's last_modified field for the
thumbnail itself

HTH!
filippo


On Wed, Aug 13, 2014 at 5:25 PM, Neil Kandalgaonkar <[email protected]> wrote:

>  Those CSVs don't have column names -- what do they represent?
>
>
>
> On 8/13/14, 7:24 AM, Gilles Dubuc wrote:
>
>  Just to get a sense of the scale, "non-standard" sizes > 1280 represent
> approximately 2 TB of Swift storage at the moment. And all sizes <= 1280
> (where we can't tell "non-standard/standard" apart) represent approximately
> 16 TB. As for "standard sizes" > 1280, they total around 1.6 TB.
>
> It's hard to estimate how much we're looking to save on sizes < 1280 due
> to the issue I've described earlier. But it's probably something expressed
> in terabytes.
>
> Filippo told me that the space I've just mentioned doesn't take into
> account the swift replication (currently 3 copies). Which means that we're
> currently talking about three times as much physical storage space.
>
> I've looked at the amount of hits for sizes > 1280 and "non-standard"
> thumbnails are viewed 3.3 times less than "standard ones". That means some
> strange sizes are getting a decent amount of traffic, but I haven't looked
> at the distribution yet to see if there are some sizes that clearly stand
> out and might be "standard" sizes which we don't know about lurking in
> there.
>
> I've attached Filippo's CSV dumps, so that everyone can have fun at home
> extracting meaning from that data.
>
>  For reference, this is the list of "standard" sizes we've come up with,
> by hunting for various areas of the code that govern thumbnail sizes served:
>
>
>
>
> On Wed, Aug 13, 2014 at 12:59 PM, Gilles Dubuc <[email protected]>
> wrote:
>
>>   The context is that Filippo from Ops would like to run a regular
>> cleanup job that deletes thumbnails from swift that have
>> non-mediawiki-requested sizes, when they haven't been accessed for X amount
>> of time. Currently we keep all thumbnails forever.
>>
>>  The idea is that 3rd-party tool requesting odd sizes would result in
>> less storage space used, as what they request would be deleted after a
>> while. This would be accompanied with documentation towards developers
>> indicating that best performance is obtained when using a predefined set of
>> sizes currently in use by the various tools in production (core,
>> extensions, mobile apps and sites, etc.).
>>
>> This is an interim solution while we still store thumbnails on swift,
>> which in itself is something we want to change in the future.
>>
>>
>>  - we want to use less storage space
>>>
>>
>>  Yes
>>
>>
>> - images we are generating and caching for not-Wikipedia should be the
>>> first to go
>>>
>>
>>  Yes. More accurately, images we are currently generating for unknown
>> 3rd parties requesting unusual sizes.
>>
>>
>>  - we assume weird sizes are from not-Wikipedia. So let's cache them for
>>> less time
>>>
>>
>>  Either they are coming from unknown 3rd parties, or from defunct code.
>> And yes, the idea is to keep them in swift for a period, instead of keeping
>> them in swift forever.
>>
>>
>>  - except, that doesn't work, because of tall images
>>>
>>
>>  We can't differentiate requests coming from core's file page for tall
>> images from odd sizes for anything below 1280px width. Above that, it's a
>> lot easier to tell the difference between code we run and 3rd parties.
>> Which means that we're probably already going to see some significant
>> storage savings. In fact Filippo has given me figures from production, I
>> just have to compile them to know how much storage we're talking about.
>> I'll do that soon and it will be a good opportunity to see how much we're
>> "missing out" due to the <1280 tall images case.
>>
>>
>> - so maybe we should change the image request format?
>>>
>>
>>  If the thumbnail url format could be done by height in addition to
>> width, we could keep the existing file page behavior and differenciate
>> "ours vs theirs" thumbnail requests for sizes below 1280px. It would be a
>> lot of work, we have to see if it's worth it.
>>
>>
>> - If you want to prioritize Wiki[mp]edia thumbnails, why not use the
>>> referrer header instead? Why use the width parameter to detect this?
>>>
>>
>>  Referrer is unreliable in the real world.  Browsers can suppress it, so
>> can proxies, etc. The width parameter doesn't tell us the source. If we
>> receive a request for "469" width, we can't tell if it's coming from a 3rd
>> party or a visitor of the file page for an image which is for example 469px
>> wide and 1024px tall.
>>
>>
>> - Are we sure we'll improve overall performance by evicting certain files
>>> from cache quicker? Why not trust the LRU cache algorithm?
>>>
>>
>>  Performance, no, but storage space yes. The idea is that the
>> performance impact would only be limited to clients requesting weird image
>> sizes. I don't think we have a LRU option to speak of, it would be a job
>> written by Ops.
>>
>>  - as maintainers of the wikimedia media file servers, we want to reduce
>>> the number of images cached in order to save storage space and cost?
>>>
>>
>>  Yes, and in particular this would allow us to use the existing capacity
>> for more useful purposes, such as pre-generating all expected thumbnail
>> sizes at upload time. Meaning that on "official" clients, or on clients
>> sticking to the extensive list of sizes we'll support will never hit a
>> thumbnail size that needs to be generated on the fly.
>>
>>
>> is it possible to cache based on a last accessed timestamp?
>>>
>>
>>   When we move away from swift, this is exactly what we want to set up.
>> Although it would be interesting to contemplate making exceptions for
>> widely used sizes. What I'm describing is a temporary solution while we
>> still live in the thumbnails-on-swift status quo.
>>
>>
>> - if an image size has not been accessed within x number of days purge it
>>> from the cache
>>>
>>
>>  Basically this is an attempt to do this on swift, while not touching
>> sizes that we know are requested by a lot of clients.
>>
>>
>> On Wed, Aug 13, 2014 at 12:35 PM, dan-nl <[email protected]>
>> wrote:
>>
>>> what is the main use case?
>>>
>>> - as maintainers of the wikimedia media file servers, we want to reduce
>>> the number of images cached in order to save storage space and cost?
>>>
>>> - and/or something else?
>>>
>>>
>>> is it possible to cache based on a last accessed timestamp?
>>>
>>> - if an image size has not been accessed within x number of days purge
>>> it from the cache
>>>
>>>
>>> with kind regards,
>>> dan
>>>
>>>
>>> On Aug 13, 2014, at 11:18 , Neil Kandalgaonkar <[email protected]> wrote:
>>>
>>> > I think I need more context. Is this what you're saying?
>>> >
>>> > - we want to use less storage space
>>> > - images we are generating and caching for not-Wikipedia should be the
>>> first to go
>>> > - we assume weird sizes are from not-Wikipedia. So let's cache them
>>> for less time
>>> > - except, that doesn't work, because of tall images
>>> > - so maybe we should change the image request format?
>>> >
>>> > If this is accurate I have a few questions:
>>> > - If you want to prioritize Wiki[mp]edia thumbnails, why not use the
>>> referrer header instead? Why use the width parameter to detect this?
>>> > - Are we sure we'll improve overall performance by evicting certain
>>> files from cache quicker? Why not trust the LRU cache algorithm?
>>> >
>>> >
>>> >
>>> > On 8/13/14, 1:36 AM, Gilles Dubuc wrote:
>>> >> Currently the file page provides a set of different image sizes for
>>> the user to directly access. These sizes are usually width-based. However,
>>> for tall images they are height-based. The thumbnail urls, which are used
>>> to generate them pass only a width.
>>> >>
>>> >> What this means is that tall images end up with arbitrary thumbnail
>>> widths that don't follow the set of sizes meant for the file page. The end
>>> result from an ops perspective is that we end up with very diverse widths
>>> for thumbnails. Not a problem in itself, but the exposure of these
>>> random-ish widths on the file page means that we can't set a different
>>> caching policy for non-standard widths without affecting the images linked
>>> from the file page.
>>> >>
>>> >> I see two solutions to this problem, if we want to introduce
>>> different caching tiers for thumbnail sizes that come from mediawiki and
>>> thumbnail sizes that were requested by other things.
>>> >>
>>> >> The first one would be to always keep the size progression on the
>>> file page width-bound, even for soft-rotated images. The first drawback of
>>> this is that for very skinny/very wide images the file size progression
>>> between the sizes could become steep. The second drawback is that we'd
>>> often offer less size options, because they'd be based on the smallest
>>> dimension.
>>> >>
>>> >> The second option would be to change the syntax of the thumbnail urls
>>> in order to allow height constraint. This is a pretty scary change.
>>> >>
>>> >> If we don't do anything, it simply means that we'll have to apply the
>>> same caching policy to every size smaller than 1280. We could already save
>>> quite a bit of storage space by evicting non-standard sizes larger than
>>> that, but sizes lower than 1280 would have to stay the way they are now.
>>> >>
>>> >> Thoughts?
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> Multimedia mailing list
>>> >>
>>> >> [email protected]
>>> >> https://lists.wikimedia.org/mailman/listinfo/multimedia
>>> >
>>> >
>>> > --
>>> > Neil Kandalgaonkar (|
>>> > <[email protected]>
>>> > _______________________________________________
>>> > Multimedia mailing list
>>> > [email protected]
>>> > https://lists.wikimedia.org/mailman/listinfo/multimedia
>>>
>>>
>>> _______________________________________________
>>> Multimedia mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/multimedia
>>>
>>
>>
>
>
> _______________________________________________
> Multimedia mailing 
> [email protected]https://lists.wikimedia.org/mailman/listinfo/multimedia
>
>
>
> --
> Neil Kandalgaonkar (|  <[email protected]> <[email protected]>
>
>
_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to