I changed the failure limiting to detecting fatals in
https://gerrit.wikimedia.org/r/#/c/127642/ and also tweaked the "large
source file download" pool counter config timeouts in
https://gerrit.wikimedia.org/r/#/c/127654/. Hopefully that should help for
now.


On Mon, Apr 21, 2014 at 3:05 AM, Faidon Liambotis <[email protected]>wrote:

> On Mon, Apr 21, 2014 at 10:56:40AM +0200, Giuseppe Lavagetto wrote:
> > The problem resolved before I could get to strace the apache processes,
> so
> > I don't have more details - Faidon was investigating as well and may have
> > more info.
>
> Indeed, I do: this had nothing to do with TMH. The trigger was Commons
> User:Fæ uploading hundreds of 100-200MB multipage TIFFs via GWToolset
> over the course of 4-5 hours (multiple files per minute), and then
> random users/bots viewing Special:NewFiles, which attempts to display a
> thumbnail for all of those new files in parallel in realtime, and thus
> saturating imagescalers' MaxClients setting and basically inadvertently
> DoSing them.
>
> The issue was temporary because of
> https://bugzilla.wikimedia.org/show_bug.cgi?id=49118 but since the user
> kept uploading new files, it was recurrent, with different files every
> time. Essentially, we would keep having short outages every now and then
> for as long as the upload activity continued.
>
> I left a comment over at https://commons.wikimedia.org/wiki/User_talk:Fæ
> and contacted Commons admins over at #wikimedia-commons, as a courtesy
> to both before I used my root to elevate my privileges and ban a
> long-time prominent Wikimedia user as an emergency countermeasure :)
>
> It was effective, as Fæ immediately responded and ceased the activity
> until further discussion; the Commons community was also helpful in the
> short discussion that followed.
>
> Andre also pointed out that Fæ had previously began the "Images so big
> they break Commons" thread at the Commons Village Pump:
>
> https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_break_Commons.3F
>
> As for the more permanent solution: there's not much we, as ops, can do
> about this but say "no, don't upload all these files", which is
> obviously not a great solution :) The root cause is an architecture
> issue with how imagescalers behave with regards to resource-intensive
> jobs coming in a short period of time. Perhaps a combination of
> poolcounter per file and more capacity (servers) would alleviate the
> effect, but ideally we should be able to have some grouping &
> prioritization of imagescaling jobs so that large jobs can't completely
> saturate and DoS the cluster.
>
> Aaron/Multimedia team, what do you think?
>
> Regards,
> Faidon
>
> _______________________________________________
> Ops mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/ops
>



-- 
-Aaron S
_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to