On Sun, May 11, 2014 at 11:33 AM, Gergo Tisza <[email protected]> wrote:
>
> I think the short-term outcome was to throttle GWToolset until there is a
> better fix. There is a patch pending to do that:
> https://gerrit.wikimedia.org/r/#/c/132111/
> https://gerrit.wikimedia.org/r/#/c/132112/
>
> I described the thinking behind the limits in this mail and the followups:
> http://thread.gmane.org/gmane.org.wikimedia.glamtools/24/focus=104
> tl;dr it tries to limit the GWToolset-uploaded thumbnails appearing in
> Special:* at one time to 10% (5 with default settings), based on the total
> upload rate in the slowest hour of an average day. That's about one image
> per two minutes.
>

The core patch is merged now so we could backport and merge the config
patch, and restart GWToolset uploads, in a few days, if we think the
throttling is enough to prevent further outages.
That is a big if though - it is not clear that throttling would be a good
way to avoid overloading the scalers.

My understanding is that there were three ways in which the NYPL map
uploads were causing problems:

1. the scalers did not have enough processing power to handle all the
thumbnail requests that were coming in simultaneously. This was presumably
because Special:NewFiles and Special:ListFiles were filled with the NYPL
maps, and users looking at those pages sent dozens of thumbnailing requests
in parallel.
2. Swift traffic was saturated by GWToolset-uploaded files, making the
serving of everything else very slow. I assume this was because of the
scalers fetching the original files? Or could this be directly caused by
the uploading somehow?
3. GWToolset jobs piling up in the job queue (Faidon said he cleared out 7396
jobs).

== Scaler overload ==

For the first problem, we can make an educated guess of the level of
throttling required: if we want to keep the number of simultaneous
GWToolset-related scaling requests below X, that means Special:NewFiles and
Special:ListFiles should not have more than X/2 GWToolset files on them at
any given time. Those pages show the last 50 files, so GWToolset should not
upload more than X files in the time that takes normal users to upload 100
of them. I counted the number of uploads per hour on Commons on a weekday,
and there were 240 uploads in the slowest hour, which is about 25 minutes
for 100 files. so GWToolset should be limited to X files in 25 minutes, for
some value of X that ops are happy with.

This is the best we can do with the current throttling options of the job
queue, I think, but it has a lot of holes. The rate of normal uploads could
drop extremely low for a short time for some reason. New file patrollers
could be looking at the special pages with non-default settings (500 images
instead of 50). Someone could look at the associated category (200
thumbnails at a time). This is not a problem if people are continuosly
keeping watch on Special:NewFiles, because that would mean that the
thumbnails get rendered soon after the uploads; but that's an untested
assumption.

So I am not confident that throttling would be enough to avoid further
meltdowns. I think Dan is working on a patch to make the upload jobs
pre-render the thumbnails; we might have to wait for that before allowing
GWToolset uploads again.

== Swift bandwidth overuse ==

This seems simple: just limit the bandwidth available for a single
transfer; if the throttling+prerendering is in place, that ensures that
there are no more than a set number scaling requests running in parallel.
If the bandwidth use is still an issue after that, just come up with a
per-transfer bandwidth limit such that even if the number of scaling
requests maxes out, there is still enough bandwidth remaining to serve
normal requests. (In the future, the bandwidth usage could be avoided
completely by using the same server for uploading and thumbnail rendering,
but that sounds like a more complex change.)
Gilles already started a thread about this ("Limiting bandwidth when
reading from swift").

== Job queue filling up ==

I am not sure if this is a real problem or just a symptom; does this cause
any issues directly?
At any rate, this seems like a bug in the code of the GWToolset jobs, which
have some logic to bail out if there more than 1000 pending jobs, but
apparently that does not work.

Any thoughts on this?
_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to