Re: [Multimedia] [Ops] Brief image scalers outage, Mon Apr 21 03:12 UTC

Gilles Dubuc Mon, 21 Apr 2014 03:23:18 -0700

Thanks for the detailed report, Faidon.

Can you clarify something: for a given set of heavyweight thumbnails that
need to be rendered, assuming the the uploads have ceased, would multiple
visits of Special:NewFiles in a short timeframe multiply the saturation by
the amount of HTTP requests to the same thumbnail URLs? I.e. if you request
the URL of a thumbnail which is currently being generated because someone
else requested it, does it make the issue worse?


Second question is, how come piling on jobs doesn't just make the jobs that
came last complete much later? The same kind of DoS situation could happen
with someone bombarding us with HEAD requests on previously unrequested
thumbnail sizes for small images, so I think that the issue isn't specific
to large jobs. It's more a matter of properly queueing things up so that
the imagescalers don't overload, regardless of the mix of job weight.


On Mon, Apr 21, 2014 at 12:05 PM, Faidon Liambotis <[email protected]>wrote:

> On Mon, Apr 21, 2014 at 10:56:40AM +0200, Giuseppe Lavagetto wrote:
> > The problem resolved before I could get to strace the apache processes,
> so
> > I don't have more details - Faidon was investigating as well and may have
> > more info.
>
> Indeed, I do: this had nothing to do with TMH. The trigger was Commons
> User:Fæ uploading hundreds of 100-200MB multipage TIFFs via GWToolset
> over the course of 4-5 hours (multiple files per minute), and then
> random users/bots viewing Special:NewFiles, which attempts to display a
> thumbnail for all of those new files in parallel in realtime, and thus
> saturating imagescalers' MaxClients setting and basically inadvertently
> DoSing them.
>
> The issue was temporary because of
> https://bugzilla.wikimedia.org/show_bug.cgi?id=49118 but since the user
> kept uploading new files, it was recurrent, with different files every
> time. Essentially, we would keep having short outages every now and then
> for as long as the upload activity continued.
>
> I left a comment over at https://commons.wikimedia.org/wiki/User_talk:Fæ
> and contacted Commons admins over at #wikimedia-commons, as a courtesy
> to both before I used my root to elevate my privileges and ban a
> long-time prominent Wikimedia user as an emergency countermeasure :)
>
> It was effective, as Fæ immediately responded and ceased the activity
> until further discussion; the Commons community was also helpful in the
> short discussion that followed.
>
> Andre also pointed out that Fæ had previously began the "Images so big
> they break Commons" thread at the Commons Village Pump:
>
> https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_break_Commons.3F
>
> As for the more permanent solution: there's not much we, as ops, can do
> about this but say "no, don't upload all these files", which is
> obviously not a great solution :) The root cause is an architecture
> issue with how imagescalers behave with regards to resource-intensive
> jobs coming in a short period of time. Perhaps a combination of
> poolcounter per file and more capacity (servers) would alleviate the
> effect, but ideally we should be able to have some grouping &
> prioritization of imagescaling jobs so that large jobs can't completely
> saturate and DoS the cluster.
>
> Aaron/Multimedia team, what do you think?
>
> Regards,
> Faidon
>
> _______________________________________________
> Multimedia mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/multimedia
>

_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Re: [Multimedia] [Ops] Brief image scalers outage, Mon Apr 21 03:12 UTC

Reply via email to