Those CSVs don't have column names -- what do they represent?
On 8/13/14, 7:24 AM, Gilles Dubuc wrote: > Just to get a sense of the scale, "non-standard" sizes > 1280 > represent approximately 2 TB of Swift storage at the moment. And all > sizes <= 1280 (where we can't tell "non-standard/standard" apart) > represent approximately 16 TB. As for "standard sizes" > 1280, they > total around 1.6 TB. > > It's hard to estimate how much we're looking to save on sizes < 1280 > due to the issue I've described earlier. But it's probably something > expressed in terabytes. > > Filippo told me that the space I've just mentioned doesn't take into > account the swift replication (currently 3 copies). Which means that > we're currently talking about three times as much physical storage space. > > I've looked at the amount of hits for sizes > 1280 and "non-standard" > thumbnails are viewed 3.3 times less than "standard ones". That means > some strange sizes are getting a decent amount of traffic, but I > haven't looked at the distribution yet to see if there are some sizes > that clearly stand out and might be "standard" sizes which we don't > know about lurking in there. > > I've attached Filippo's CSV dumps, so that everyone can have fun at > home extracting meaning from that data. > > For reference, this is the list of "standard" sizes we've come up > with, by hunting for various areas of the code that govern thumbnail > sizes served: > > > > > On Wed, Aug 13, 2014 at 12:59 PM, Gilles Dubuc <[email protected] > <mailto:[email protected]>> wrote: > > The context is that Filippo from Ops would like to run a regular > cleanup job that deletes thumbnails from swift that have > non-mediawiki-requested sizes, when they haven't been accessed for > X amount of time. Currently we keep all thumbnails forever. > > The idea is that 3rd-party tool requesting odd sizes would result > in less storage space used, as what they request would be deleted > after a while. This would be accompanied with documentation > towards developers indicating that best performance is obtained > when using a predefined set of sizes currently in use by the > various tools in production (core, extensions, mobile apps and > sites, etc.). > > This is an interim solution while we still store thumbnails on > swift, which in itself is something we want to change in the future. > > > - we want to use less storage space > > > Yes > > > - images we are generating and caching for not-Wikipedia > should be the first to go > > > Yes. More accurately, images we are currently generating for > unknown 3rd parties requesting unusual sizes. > > > - we assume weird sizes are from not-Wikipedia. So let's cache > them for less time > > > Either they are coming from unknown 3rd parties, or from defunct > code. And yes, the idea is to keep them in swift for a period, > instead of keeping them in swift forever. > > > - except, that doesn't work, because of tall images > > > We can't differentiate requests coming from core's file page for > tall images from odd sizes for anything below 1280px width. Above > that, it's a lot easier to tell the difference between code we run > and 3rd parties. Which means that we're probably already going to > see some significant storage savings. In fact Filippo has given me > figures from production, I just have to compile them to know how > much storage we're talking about. I'll do that soon and it will be > a good opportunity to see how much we're "missing out" due to the > <1280 tall images case. > > > - so maybe we should change the image request format? > > > If the thumbnail url format could be done by height in addition to > width, we could keep the existing file page behavior and > differenciate "ours vs theirs" thumbnail requests for sizes below > 1280px. It would be a lot of work, we have to see if it's worth it. > > > - If you want to prioritize Wiki[mp]edia thumbnails, why not > use the referrer header instead? Why use the width parameter > to detect this? > > > Referrer is unreliable in the real world. Browsers can suppress > it, so can proxies, etc. The width parameter doesn't tell us the > source. If we receive a request for "469" width, we can't tell if > it's coming from a 3rd party or a visitor of the file page for an > image which is for example 469px wide and 1024px tall. > > > - Are we sure we'll improve overall performance by evicting > certain files from cache quicker? Why not trust the LRU cache > algorithm? > > > Performance, no, but storage space yes. The idea is that the > performance impact would only be limited to clients requesting > weird image sizes. I don't think we have a LRU option to speak of, > it would be a job written by Ops. > > - as maintainers of the wikimedia media file servers, we want > to reduce the number of images cached in order to save storage > space and cost? > > > Yes, and in particular this would allow us to use the existing > capacity for more useful purposes, such as pre-generating all > expected thumbnail sizes at upload time. Meaning that on > "official" clients, or on clients sticking to the extensive list > of sizes we'll support will never hit a thumbnail size that needs > to be generated on the fly. > > > is it possible to cache based on a last accessed timestamp? > > > When we move away from swift, this is exactly what we want to set > up. Although it would be interesting to contemplate making > exceptions for widely used sizes. What I'm describing is a > temporary solution while we still live in the thumbnails-on-swift > status quo. > > > - if an image size has not been accessed within x number of > days purge it from the cache > > > Basically this is an attempt to do this on swift, while not > touching sizes that we know are requested by a lot of clients. > > > On Wed, Aug 13, 2014 at 12:35 PM, dan-nl > <[email protected] > <mailto:[email protected]>> wrote: > > what is the main use case? > > - as maintainers of the wikimedia media file servers, we want > to reduce the number of images cached in order to save storage > space and cost? > > - and/or something else? > > > is it possible to cache based on a last accessed timestamp? > > - if an image size has not been accessed within x number of > days purge it from the cache > > > with kind regards, > dan > > > On Aug 13, 2014, at 11:18 , Neil Kandalgaonkar > <[email protected] <mailto:[email protected]>> wrote: > > > I think I need more context. Is this what you're saying? > > > > - we want to use less storage space > > - images we are generating and caching for not-Wikipedia > should be the first to go > > - we assume weird sizes are from not-Wikipedia. So let's > cache them for less time > > - except, that doesn't work, because of tall images > > - so maybe we should change the image request format? > > > > If this is accurate I have a few questions: > > - If you want to prioritize Wiki[mp]edia thumbnails, why not > use the referrer header instead? Why use the width parameter > to detect this? > > - Are we sure we'll improve overall performance by evicting > certain files from cache quicker? Why not trust the LRU cache > algorithm? > > > > > > > > On 8/13/14, 1:36 AM, Gilles Dubuc wrote: > >> Currently the file page provides a set of different image > sizes for the user to directly access. These sizes are usually > width-based. However, for tall images they are height-based. > The thumbnail urls, which are used to generate them pass only > a width. > >> > >> What this means is that tall images end up with arbitrary > thumbnail widths that don't follow the set of sizes meant for > the file page. The end result from an ops perspective is that > we end up with very diverse widths for thumbnails. Not a > problem in itself, but the exposure of these random-ish widths > on the file page means that we can't set a different caching > policy for non-standard widths without affecting the images > linked from the file page. > >> > >> I see two solutions to this problem, if we want to > introduce different caching tiers for thumbnail sizes that > come from mediawiki and thumbnail sizes that were requested by > other things. > >> > >> The first one would be to always keep the size progression > on the file page width-bound, even for soft-rotated images. > The first drawback of this is that for very skinny/very wide > images the file size progression between the sizes could > become steep. The second drawback is that we'd often offer > less size options, because they'd be based on the smallest > dimension. > >> > >> The second option would be to change the syntax of the > thumbnail urls in order to allow height constraint. This is a > pretty scary change. > >> > >> If we don't do anything, it simply means that we'll have to > apply the same caching policy to every size smaller than 1280. > We could already save quite a bit of storage space by evicting > non-standard sizes larger than that, but sizes lower than 1280 > would have to stay the way they are now. > >> > >> Thoughts? > >> > >> > >> _______________________________________________ > >> Multimedia mailing list > >> > >> [email protected] > <mailto:[email protected]> > >> https://lists.wikimedia.org/mailman/listinfo/multimedia > > > > > > -- > > Neil Kandalgaonkar (| > > <[email protected] <mailto:[email protected]>> > > _______________________________________________ > > Multimedia mailing list > > [email protected] > <mailto:[email protected]> > > https://lists.wikimedia.org/mailman/listinfo/multimedia > > > _______________________________________________ > Multimedia mailing list > [email protected] > <mailto:[email protected]> > https://lists.wikimedia.org/mailman/listinfo/multimedia > > > > > > _______________________________________________ > Multimedia mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/multimedia -- Neil Kandalgaonkar (| <[email protected]>
_______________________________________________ Multimedia mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/multimedia
