Adding Analytics list and Neil P Quinn, FYI, Nuria filed this ticket to track this issue: https://phabricator.wikimedia.org/T123634 Moving discussion there.
On Thu, Jan 14, 2016 at 1:10 AM, Faidon Liambotis <[email protected]> wrote: > Well, this was a snapshot of the situation then. It doesn't preclude > other issues (possibly caused by other, similarly-sized queries) in the > previous hours/days. That said, tendril show definitely a correlation > between all kinds of metrics (page/disk I/O, write traffic etc.) and the > aforementioned timeline of the past 1.5 days. > > Plus, replag for s1 was at the time ~133,900 seconds and rising, which > matches the timeline of that large query too. Since I killed it it has > been steadily dropping, albeit slowly (currently at 132,445). It will > probably take a couple of days to recover. Since the server is both > backlogged by and I/O-saturated, it will depend a lot on how much load > the server will get by other queries (currently it's getting hammered by > two other large queries that have been running for over 27,000 and 4,000 > seconds respectively, for example). > > Faidon > > On Wed, Jan 13, 2016 at 09:44:50PM -0800, Oliver Keyes wrote: > > Indeed, but 1.5 days is <half the time the problem has been occurring > for. > > > > On 13 January 2016 at 21:01, Faidon Liambotis <[email protected]> > wrote: > > > "SELECT * FROM information_schema.processlist ORDER BY time DESC" > > > informs us of this: > > > > > > | 5599890 | research | 10.64.36.103:53669 | enwiki > | Query | 133527 | Queried about 890000 rows > | CREATE TEMPORARY TA > > > SELECT > > > page_id, > > > views > > > FROM ( > > > SELECT > > > page_namespace, > > > page_title, > > > SUM(views) AS views > > > FROM staging.page_name_views_dupes > > > WHERE page_namespace = 0 > > > GROUP BY 1,2 > > > ) AS group_page_name_views > > > INNER JOIN enwiki.page USING (page_namespace, page_title) > > > > > > Column 6 is "time", i.e. this query was running for 133527 seconds at > > > the time (i.e. ~1.5 days!), which is obviously Not Good™. I just ran > > > "KILL QUERY 5599890;", hopefully this will help. > > > > > > The second-next long-standing query has been running for over 6 hours > > > now and it way too long to paste (87 lines, starts with "INSERT INTO > > > editor_month_global", inline comments, all kinds of subselects in inner > > > joins etc., queried "about 2470000 rows"). I left it be for now, we'll > > > see how that goes and I may eventually kill it too, as I/O is still > > > pegged at 100%. > > > > > > I believe long-running queries targetted at the research slaves isn't > > > particularly new but is often the source of such problems, so it's a > > > good place to start when investigating such issues. There is only so > > > much a poor database server (and software) can do :) > > > > > > Regards, > > > Faidon > > > > > > On Wed, Jan 13, 2016 at 06:55:26PM -0500, Andrew Otto wrote: > > >> Hi all, > > >> > > >> Replication to dbstore1002 is having a lot of trouble. From > > >> https://tendril.wikimedia.org/host/view/dbstore1002.eqiad.wmnet/3306, > we > > >> see that normal replication is about 9 hours behind at the moment. > > >> However, the EventLogging `log` database is not replicated with usual > MySQL > > >> replication. Instead, a custom bash script[1] periodically uses > mysqldump > > >> to copy data from m4-master (dbproxy1004) into dbstore1002. (I just > > >> recently found out that this wasn’t regular replication, and I’m not > > >> familiar with the reasoning behind using a custom script.) > > >> > > >> The EventLogging `log` custom replication has been lagging for days > now. > > >> Also, at about 18:00 UTC today (Jan 13), we can see a huge increase in > > >> write traffic on dbstore1002. I looked at each of the normal > replication > > >> masters, and don’t see this write traffic there. EventLogging > traffic also > > >> seems to be about the same over the last week or so (although there > was an > > >> in increase in events being produced by the MobileWebSectionUsage > schema > > >> starting Dec 18, but I don’t think this is the problem). > > >> > > >> Disk util is around 100%, but this has been the case for a while now. > > >> Today I filed https://phabricator.wikimedia.org/T123546 for a bad > mem chip > > >> or slot on dbproxy1004, but this also seems to have been the status > quo for > > >> quite a while, and doesn’t correlate with this lag. > > >> > > >> I’m not sure where else to look at the moment, and I need to run for > the > > >> day. I’ll try to look at this more tomorrow in my morning. > > >> > > >> -AO > > >> > > >> [1] > > >> > https://github.com/wikimedia/operations-puppet/blob/f0df1ec45b3f70a5c041cef217751014824ca6ec/files/mariadb/eventlogging_sync.sh > > > > > >> _______________________________________________ > > >> Ops mailing list > > >> [email protected] > > >> https://lists.wikimedia.org/mailman/listinfo/ops > > > > > > > > > _______________________________________________ > > > Ops mailing list > > > [email protected] > > > https://lists.wikimedia.org/mailman/listinfo/ops > > > > > > > > -- > > Oliver Keyes > > Count Logula > > Wikimedia Foundation >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
