Those CSVs don't have column names -- what do they represent?

On 8/13/14, 7:24 AM, Gilles Dubuc wrote:
> Just to get a sense of the scale, "non-standard" sizes > 1280
> represent approximately 2 TB of Swift storage at the moment. And all
> sizes <= 1280 (where we can't tell "non-standard/standard" apart)
> represent approximately 16 TB. As for "standard sizes" > 1280, they
> total around 1.6 TB.
>
> It's hard to estimate how much we're looking to save on sizes < 1280
> due to the issue I've described earlier. But it's probably something
> expressed in terabytes.
>
> Filippo told me that the space I've just mentioned doesn't take into
> account the swift replication (currently 3 copies). Which means that
> we're currently talking about three times as much physical storage space.
>
> I've looked at the amount of hits for sizes > 1280 and "non-standard"
> thumbnails are viewed 3.3 times less than "standard ones". That means
> some strange sizes are getting a decent amount of traffic, but I
> haven't looked at the distribution yet to see if there are some sizes
> that clearly stand out and might be "standard" sizes which we don't
> know about lurking in there.
>
> I've attached Filippo's CSV dumps, so that everyone can have fun at
> home extracting meaning from that data.
>
> For reference, this is the list of "standard" sizes we've come up
> with, by hunting for various areas of the code that govern thumbnail
> sizes served:
>
>
>
>
> On Wed, Aug 13, 2014 at 12:59 PM, Gilles Dubuc <[email protected]
> <mailto:[email protected]>> wrote:
>
>     The context is that Filippo from Ops would like to run a regular
>     cleanup job that deletes thumbnails from swift that have
>     non-mediawiki-requested sizes, when they haven't been accessed for
>     X amount of time. Currently we keep all thumbnails forever.
>
>     The idea is that 3rd-party tool requesting odd sizes would result
>     in less storage space used, as what they request would be deleted
>     after a while. This would be accompanied with documentation
>     towards developers indicating that best performance is obtained
>     when using a predefined set of sizes currently in use by the
>     various tools in production (core, extensions, mobile apps and
>     sites, etc.).
>
>     This is an interim solution while we still store thumbnails on
>     swift, which in itself is something we want to change in the future.
>
>
>         - we want to use less storage space
>
>
>     Yes
>
>
>         - images we are generating and caching for not-Wikipedia
>         should be the first to go
>
>
>     Yes. More accurately, images we are currently generating for
>     unknown 3rd parties requesting unusual sizes.
>
>
>         - we assume weird sizes are from not-Wikipedia. So let's cache
>         them for less time
>
>
>     Either they are coming from unknown 3rd parties, or from defunct
>     code. And yes, the idea is to keep them in swift for a period,
>     instead of keeping them in swift forever.
>
>
>         - except, that doesn't work, because of tall images
>
>
>     We can't differentiate requests coming from core's file page for
>     tall images from odd sizes for anything below 1280px width. Above
>     that, it's a lot easier to tell the difference between code we run
>     and 3rd parties. Which means that we're probably already going to
>     see some significant storage savings. In fact Filippo has given me
>     figures from production, I just have to compile them to know how
>     much storage we're talking about. I'll do that soon and it will be
>     a good opportunity to see how much we're "missing out" due to the
>     <1280 tall images case.
>
>
>         - so maybe we should change the image request format?
>
>
>     If the thumbnail url format could be done by height in addition to
>     width, we could keep the existing file page behavior and
>     differenciate "ours vs theirs" thumbnail requests for sizes below
>     1280px. It would be a lot of work, we have to see if it's worth it.
>
>
>         - If you want to prioritize Wiki[mp]edia thumbnails, why not
>         use the referrer header instead? Why use the width parameter
>         to detect this?
>
>
>     Referrer is unreliable in the real world.  Browsers can suppress
>     it, so can proxies, etc. The width parameter doesn't tell us the
>     source. If we receive a request for "469" width, we can't tell if
>     it's coming from a 3rd party or a visitor of the file page for an
>     image which is for example 469px wide and 1024px tall.
>
>
>         - Are we sure we'll improve overall performance by evicting
>         certain files from cache quicker? Why not trust the LRU cache
>         algorithm?
>
>
>     Performance, no, but storage space yes. The idea is that the
>     performance impact would only be limited to clients requesting
>     weird image sizes. I don't think we have a LRU option to speak of,
>     it would be a job written by Ops.
>
>         - as maintainers of the wikimedia media file servers, we want
>         to reduce the number of images cached in order to save storage
>         space and cost?
>
>
>     Yes, and in particular this would allow us to use the existing
>     capacity for more useful purposes, such as pre-generating all
>     expected thumbnail sizes at upload time. Meaning that on
>     "official" clients, or on clients sticking to the extensive list
>     of sizes we'll support will never hit a thumbnail size that needs
>     to be generated on the fly.
>
>
>         is it possible to cache based on a last accessed timestamp?
>
>
>     When we move away from swift, this is exactly what we want to set
>     up. Although it would be interesting to contemplate making
>     exceptions for widely used sizes. What I'm describing is a
>     temporary solution while we still live in the thumbnails-on-swift
>     status quo.
>
>
>         - if an image size has not been accessed within x number of
>         days purge it from the cache
>
>
>     Basically this is an attempt to do this on swift, while not
>     touching sizes that we know are requested by a lot of clients.
>
>
>     On Wed, Aug 13, 2014 at 12:35 PM, dan-nl
>     <[email protected]
>     <mailto:[email protected]>> wrote:
>
>         what is the main use case?
>
>         - as maintainers of the wikimedia media file servers, we want
>         to reduce the number of images cached in order to save storage
>         space and cost?
>
>         - and/or something else?
>
>
>         is it possible to cache based on a last accessed timestamp?
>
>         - if an image size has not been accessed within x number of
>         days purge it from the cache
>
>
>         with kind regards,
>         dan
>
>
>         On Aug 13, 2014, at 11:18 , Neil Kandalgaonkar
>         <[email protected] <mailto:[email protected]>> wrote:
>
>         > I think I need more context. Is this what you're saying?
>         >
>         > - we want to use less storage space
>         > - images we are generating and caching for not-Wikipedia
>         should be the first to go
>         > - we assume weird sizes are from not-Wikipedia. So let's
>         cache them for less time
>         > - except, that doesn't work, because of tall images
>         > - so maybe we should change the image request format?
>         >
>         > If this is accurate I have a few questions:
>         > - If you want to prioritize Wiki[mp]edia thumbnails, why not
>         use the referrer header instead? Why use the width parameter
>         to detect this?
>         > - Are we sure we'll improve overall performance by evicting
>         certain files from cache quicker? Why not trust the LRU cache
>         algorithm?
>         >
>         >
>         >
>         > On 8/13/14, 1:36 AM, Gilles Dubuc wrote:
>         >> Currently the file page provides a set of different image
>         sizes for the user to directly access. These sizes are usually
>         width-based. However, for tall images they are height-based.
>         The thumbnail urls, which are used to generate them pass only
>         a width.
>         >>
>         >> What this means is that tall images end up with arbitrary
>         thumbnail widths that don't follow the set of sizes meant for
>         the file page. The end result from an ops perspective is that
>         we end up with very diverse widths for thumbnails. Not a
>         problem in itself, but the exposure of these random-ish widths
>         on the file page means that we can't set a different caching
>         policy for non-standard widths without affecting the images
>         linked from the file page.
>         >>
>         >> I see two solutions to this problem, if we want to
>         introduce different caching tiers for thumbnail sizes that
>         come from mediawiki and thumbnail sizes that were requested by
>         other things.
>         >>
>         >> The first one would be to always keep the size progression
>         on the file page width-bound, even for soft-rotated images.
>         The first drawback of this is that for very skinny/very wide
>         images the file size progression between the sizes could
>         become steep. The second drawback is that we'd often offer
>         less size options, because they'd be based on the smallest
>         dimension.
>         >>
>         >> The second option would be to change the syntax of the
>         thumbnail urls in order to allow height constraint. This is a
>         pretty scary change.
>         >>
>         >> If we don't do anything, it simply means that we'll have to
>         apply the same caching policy to every size smaller than 1280.
>         We could already save quite a bit of storage space by evicting
>         non-standard sizes larger than that, but sizes lower than 1280
>         would have to stay the way they are now.
>         >>
>         >> Thoughts?
>         >>
>         >>
>         >> _______________________________________________
>         >> Multimedia mailing list
>         >>
>         >> [email protected]
>         <mailto:[email protected]>
>         >> https://lists.wikimedia.org/mailman/listinfo/multimedia
>         >
>         >
>         > --
>         > Neil Kandalgaonkar (|
>         > <[email protected] <mailto:[email protected]>>
>         > _______________________________________________
>         > Multimedia mailing list
>         > [email protected]
>         <mailto:[email protected]>
>         > https://lists.wikimedia.org/mailman/listinfo/multimedia
>
>
>         _______________________________________________
>         Multimedia mailing list
>         [email protected]
>         <mailto:[email protected]>
>         https://lists.wikimedia.org/mailman/listinfo/multimedia
>
>
>
>
>
> _______________________________________________
> Multimedia mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/multimedia


-- 
Neil Kandalgaonkar (|  <[email protected]>

_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to