I changed the failure limiting to detecting fatals in https://gerrit.wikimedia.org/r/#/c/127642/ and also tweaked the "large source file download" pool counter config timeouts in https://gerrit.wikimedia.org/r/#/c/127654/. Hopefully that should help for now.
On Mon, Apr 21, 2014 at 3:05 AM, Faidon Liambotis <[email protected]>wrote: > On Mon, Apr 21, 2014 at 10:56:40AM +0200, Giuseppe Lavagetto wrote: > > The problem resolved before I could get to strace the apache processes, > so > > I don't have more details - Faidon was investigating as well and may have > > more info. > > Indeed, I do: this had nothing to do with TMH. The trigger was Commons > User:Fæ uploading hundreds of 100-200MB multipage TIFFs via GWToolset > over the course of 4-5 hours (multiple files per minute), and then > random users/bots viewing Special:NewFiles, which attempts to display a > thumbnail for all of those new files in parallel in realtime, and thus > saturating imagescalers' MaxClients setting and basically inadvertently > DoSing them. > > The issue was temporary because of > https://bugzilla.wikimedia.org/show_bug.cgi?id=49118 but since the user > kept uploading new files, it was recurrent, with different files every > time. Essentially, we would keep having short outages every now and then > for as long as the upload activity continued. > > I left a comment over at https://commons.wikimedia.org/wiki/User_talk:Fæ > and contacted Commons admins over at #wikimedia-commons, as a courtesy > to both before I used my root to elevate my privileges and ban a > long-time prominent Wikimedia user as an emergency countermeasure :) > > It was effective, as Fæ immediately responded and ceased the activity > until further discussion; the Commons community was also helpful in the > short discussion that followed. > > Andre also pointed out that Fæ had previously began the "Images so big > they break Commons" thread at the Commons Village Pump: > > https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_break_Commons.3F > > As for the more permanent solution: there's not much we, as ops, can do > about this but say "no, don't upload all these files", which is > obviously not a great solution :) The root cause is an architecture > issue with how imagescalers behave with regards to resource-intensive > jobs coming in a short period of time. Perhaps a combination of > poolcounter per file and more capacity (servers) would alleviate the > effect, but ideally we should be able to have some grouping & > prioritization of imagescaling jobs so that large jobs can't completely > saturate and DoS the cluster. > > Aaron/Multimedia team, what do you think? > > Regards, > Faidon > > _______________________________________________ > Ops mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/ops > -- -Aaron S
_______________________________________________ Multimedia mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/multimedia
