Forgot to copy-paste the list, here it is: 120, 150, 180, 200, 220, 240, 250, 256, 300, 320, 360, 480, 512, 600, 640, 800, 1024, 1280, 1920, 2048, 2560, 2880, 4096
On Wed, Aug 13, 2014 at 4:24 PM, Gilles Dubuc <[email protected]> wrote: > Just to get a sense of the scale, "non-standard" sizes > 1280 represent > approximately 2 TB of Swift storage at the moment. And all sizes <= 1280 > (where we can't tell "non-standard/standard" apart) represent approximately > 16 TB. As for "standard sizes" > 1280, they total around 1.6 TB. > > It's hard to estimate how much we're looking to save on sizes < 1280 due > to the issue I've described earlier. But it's probably something expressed > in terabytes. > > Filippo told me that the space I've just mentioned doesn't take into > account the swift replication (currently 3 copies). Which means that we're > currently talking about three times as much physical storage space. > > I've looked at the amount of hits for sizes > 1280 and "non-standard" > thumbnails are viewed 3.3 times less than "standard ones". That means some > strange sizes are getting a decent amount of traffic, but I haven't looked > at the distribution yet to see if there are some sizes that clearly stand > out and might be "standard" sizes which we don't know about lurking in > there. > > I've attached Filippo's CSV dumps, so that everyone can have fun at home > extracting meaning from that data. > > For reference, this is the list of "standard" sizes we've come up with, by > hunting for various areas of the code that govern thumbnail sizes served: > > > > > On Wed, Aug 13, 2014 at 12:59 PM, Gilles Dubuc <[email protected]> > wrote: > >> The context is that Filippo from Ops would like to run a regular cleanup >> job that deletes thumbnails from swift that have non-mediawiki-requested >> sizes, when they haven't been accessed for X amount of time. Currently we >> keep all thumbnails forever. >> >> The idea is that 3rd-party tool requesting odd sizes would result in less >> storage space used, as what they request would be deleted after a while. >> This would be accompanied with documentation towards developers indicating >> that best performance is obtained when using a predefined set of sizes >> currently in use by the various tools in production (core, extensions, >> mobile apps and sites, etc.). >> >> This is an interim solution while we still store thumbnails on swift, >> which in itself is something we want to change in the future. >> >> >> - we want to use less storage space >>> >> >> Yes >> >> >> - images we are generating and caching for not-Wikipedia should be the >>> first to go >>> >> >> Yes. More accurately, images we are currently generating for unknown 3rd >> parties requesting unusual sizes. >> >> >> - we assume weird sizes are from not-Wikipedia. So let's cache them for >>> less time >>> >> >> Either they are coming from unknown 3rd parties, or from defunct code. >> And yes, the idea is to keep them in swift for a period, instead of keeping >> them in swift forever. >> >> >> - except, that doesn't work, because of tall images >>> >> >> We can't differentiate requests coming from core's file page for tall >> images from odd sizes for anything below 1280px width. Above that, it's a >> lot easier to tell the difference between code we run and 3rd parties. >> Which means that we're probably already going to see some significant >> storage savings. In fact Filippo has given me figures from production, I >> just have to compile them to know how much storage we're talking about. >> I'll do that soon and it will be a good opportunity to see how much we're >> "missing out" due to the <1280 tall images case. >> >> >> - so maybe we should change the image request format? >>> >> >> If the thumbnail url format could be done by height in addition to width, >> we could keep the existing file page behavior and differenciate "ours vs >> theirs" thumbnail requests for sizes below 1280px. It would be a lot of >> work, we have to see if it's worth it. >> >> >> - If you want to prioritize Wiki[mp]edia thumbnails, why not use the >>> referrer header instead? Why use the width parameter to detect this? >>> >> >> Referrer is unreliable in the real world. Browsers can suppress it, so >> can proxies, etc. The width parameter doesn't tell us the source. If we >> receive a request for "469" width, we can't tell if it's coming from a 3rd >> party or a visitor of the file page for an image which is for example 469px >> wide and 1024px tall. >> >> >> - Are we sure we'll improve overall performance by evicting certain files >>> from cache quicker? Why not trust the LRU cache algorithm? >>> >> >> Performance, no, but storage space yes. The idea is that the performance >> impact would only be limited to clients requesting weird image sizes. I >> don't think we have a LRU option to speak of, it would be a job written by >> Ops. >> >> - as maintainers of the wikimedia media file servers, we want to reduce >>> the number of images cached in order to save storage space and cost? >>> >> >> Yes, and in particular this would allow us to use the existing capacity >> for more useful purposes, such as pre-generating all expected thumbnail >> sizes at upload time. Meaning that on "official" clients, or on clients >> sticking to the extensive list of sizes we'll support will never hit a >> thumbnail size that needs to be generated on the fly. >> >> >> is it possible to cache based on a last accessed timestamp? >>> >> >> When we move away from swift, this is exactly what we want to set up. >> Although it would be interesting to contemplate making exceptions for >> widely used sizes. What I'm describing is a temporary solution while we >> still live in the thumbnails-on-swift status quo. >> >> >> - if an image size has not been accessed within x number of days purge it >>> from the cache >>> >> >> Basically this is an attempt to do this on swift, while not touching >> sizes that we know are requested by a lot of clients. >> >> >> On Wed, Aug 13, 2014 at 12:35 PM, dan-nl <[email protected]> >> wrote: >> >>> what is the main use case? >>> >>> - as maintainers of the wikimedia media file servers, we want to reduce >>> the number of images cached in order to save storage space and cost? >>> >>> - and/or something else? >>> >>> >>> is it possible to cache based on a last accessed timestamp? >>> >>> - if an image size has not been accessed within x number of days purge >>> it from the cache >>> >>> >>> with kind regards, >>> dan >>> >>> >>> On Aug 13, 2014, at 11:18 , Neil Kandalgaonkar <[email protected]> wrote: >>> >>> > I think I need more context. Is this what you're saying? >>> > >>> > - we want to use less storage space >>> > - images we are generating and caching for not-Wikipedia should be the >>> first to go >>> > - we assume weird sizes are from not-Wikipedia. So let's cache them >>> for less time >>> > - except, that doesn't work, because of tall images >>> > - so maybe we should change the image request format? >>> > >>> > If this is accurate I have a few questions: >>> > - If you want to prioritize Wiki[mp]edia thumbnails, why not use the >>> referrer header instead? Why use the width parameter to detect this? >>> > - Are we sure we'll improve overall performance by evicting certain >>> files from cache quicker? Why not trust the LRU cache algorithm? >>> > >>> > >>> > >>> > On 8/13/14, 1:36 AM, Gilles Dubuc wrote: >>> >> Currently the file page provides a set of different image sizes for >>> the user to directly access. These sizes are usually width-based. However, >>> for tall images they are height-based. The thumbnail urls, which are used >>> to generate them pass only a width. >>> >> >>> >> What this means is that tall images end up with arbitrary thumbnail >>> widths that don't follow the set of sizes meant for the file page. The end >>> result from an ops perspective is that we end up with very diverse widths >>> for thumbnails. Not a problem in itself, but the exposure of these >>> random-ish widths on the file page means that we can't set a different >>> caching policy for non-standard widths without affecting the images linked >>> from the file page. >>> >> >>> >> I see two solutions to this problem, if we want to introduce >>> different caching tiers for thumbnail sizes that come from mediawiki and >>> thumbnail sizes that were requested by other things. >>> >> >>> >> The first one would be to always keep the size progression on the >>> file page width-bound, even for soft-rotated images. The first drawback of >>> this is that for very skinny/very wide images the file size progression >>> between the sizes could become steep. The second drawback is that we'd >>> often offer less size options, because they'd be based on the smallest >>> dimension. >>> >> >>> >> The second option would be to change the syntax of the thumbnail urls >>> in order to allow height constraint. This is a pretty scary change. >>> >> >>> >> If we don't do anything, it simply means that we'll have to apply the >>> same caching policy to every size smaller than 1280. We could already save >>> quite a bit of storage space by evicting non-standard sizes larger than >>> that, but sizes lower than 1280 would have to stay the way they are now. >>> >> >>> >> Thoughts? >>> >> >>> >> >>> >> _______________________________________________ >>> >> Multimedia mailing list >>> >> >>> >> [email protected] >>> >> https://lists.wikimedia.org/mailman/listinfo/multimedia >>> > >>> > >>> > -- >>> > Neil Kandalgaonkar (| >>> > <[email protected]> >>> > _______________________________________________ >>> > Multimedia mailing list >>> > [email protected] >>> > https://lists.wikimedia.org/mailman/listinfo/multimedia >>> >>> >>> _______________________________________________ >>> Multimedia mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/multimedia >>> >> >> >
_______________________________________________ Multimedia mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/multimedia
