Re: [Analytics] [Ops] analytics-store AKA dbstore1002 (custom) replication lagging

Andrew Otto Thu, 14 Jan 2016 10:51:08 -0800

Adding Analytics list and Neil P Quinn,

FYI, Nuria filed this ticket to track this issue:
https://phabricator.wikimedia.org/T123634  Moving discussion there.


On Thu, Jan 14, 2016 at 1:10 AM, Faidon Liambotis <[email protected]>
wrote:

> Well, this was a snapshot of the situation then. It doesn't preclude
> other issues (possibly caused by other, similarly-sized queries) in the
> previous hours/days. That said, tendril show definitely a correlation
> between all kinds of metrics (page/disk I/O, write traffic etc.) and the
> aforementioned timeline of the past 1.5 days.
>
> Plus, replag for s1 was at the time ~133,900 seconds and rising, which
> matches the timeline of that large query too. Since I killed it it has
> been steadily dropping, albeit slowly (currently at 132,445). It will
> probably take a couple of days to recover. Since the server is both
> backlogged by and I/O-saturated, it will depend a lot on how much load
> the server will get by other queries (currently it's getting hammered by
> two other large queries that have been running for over 27,000 and 4,000
> seconds respectively, for example).
>
> Faidon
>
> On Wed, Jan 13, 2016 at 09:44:50PM -0800, Oliver Keyes wrote:
> > Indeed, but 1.5 days is <half the time the problem has been occurring
> for.
> >
> > On 13 January 2016 at 21:01, Faidon Liambotis <[email protected]>
> wrote:
> > > "SELECT * FROM information_schema.processlist ORDER BY time DESC"
> > > informs us of this:
> > >
> > > | 5599890 | research        | 10.64.36.103:53669 | enwiki
>  | Query   |  133527 | Queried about 890000 rows
>                        | CREATE TEMPORARY TA
> > > SELECT
> > >     page_id,
> > >     views
> > > FROM (
> > >     SELECT
> > >         page_namespace,
> > >         page_title,
> > >         SUM(views) AS views
> > >     FROM staging.page_name_views_dupes
> > >     WHERE page_namespace = 0
> > >     GROUP BY 1,2
> > > ) AS group_page_name_views
> > > INNER JOIN enwiki.page USING (page_namespace, page_title)
> > >
> > > Column 6 is "time", i.e. this query was running for 133527 seconds at
> > > the time (i.e. ~1.5 days!), which is obviously Not Good™. I just ran
> > > "KILL QUERY 5599890;", hopefully this will help.
> > >
> > > The second-next long-standing query has been running for over 6 hours
> > > now and it way too long to paste (87 lines, starts with "INSERT INTO
> > > editor_month_global", inline comments, all kinds of subselects in inner
> > > joins etc., queried "about 2470000 rows"). I left it be for now, we'll
> > > see how that goes and I may eventually kill it too, as I/O is still
> > > pegged at 100%.
> > >
> > > I believe long-running queries targetted at the research slaves isn't
> > > particularly new but is often the source of such problems, so it's a
> > > good place to start when investigating such issues. There is only so
> > > much a poor database server (and software) can do :)
> > >
> > > Regards,
> > > Faidon
> > >
> > > On Wed, Jan 13, 2016 at 06:55:26PM -0500, Andrew Otto wrote:
> > >> Hi all,
> > >>
> > >> Replication to dbstore1002 is having a lot of trouble. From
> > >> https://tendril.wikimedia.org/host/view/dbstore1002.eqiad.wmnet/3306,
> we
> > >> see that normal replication is about 9 hours behind at the moment.
> > >> However, the EventLogging `log` database is not replicated with usual
> MySQL
> > >> replication.  Instead, a custom bash script[1] periodically uses
> mysqldump
> > >> to copy data from m4-master (dbproxy1004) into dbstore1002.  (I just
> > >> recently found out that this wasn’t regular replication, and I’m not
> > >> familiar with the reasoning behind using a custom script.)
> > >>
> > >> The EventLogging `log` custom replication has been lagging for days
> now.
> > >> Also, at about 18:00 UTC today (Jan 13), we can see a huge increase in
> > >> write traffic on dbstore1002.  I looked at each of the normal
> replication
> > >> masters, and don’t see this write traffic there.  EventLogging
> traffic also
> > >> seems to be about the same over the last week or so (although there
> was an
> > >> in increase in events being produced by the MobileWebSectionUsage
> schema
> > >> starting Dec 18, but I don’t think this is the problem).
> > >>
> > >> Disk util is around 100%, but this has been the case for a while now.
> > >> Today I filed https://phabricator.wikimedia.org/T123546 for a bad
> mem chip
> > >> or slot on dbproxy1004, but this also seems to have been the status
> quo for
> > >> quite a while, and doesn’t correlate with this lag.
> > >>
> > >> I’m not sure where else to look at the moment, and I need to run for
> the
> > >> day.  I’ll try to look at this more tomorrow in my morning.
> > >>
> > >> -AO
> > >>
> > >> [1]
> > >>
> https://github.com/wikimedia/operations-puppet/blob/f0df1ec45b3f70a5c041cef217751014824ca6ec/files/mariadb/eventlogging_sync.sh
> > >
> > >> _______________________________________________
> > >> Ops mailing list
> > >> [email protected]
> > >> https://lists.wikimedia.org/mailman/listinfo/ops
> > >
> > >
> > > _______________________________________________
> > > Ops mailing list
> > > [email protected]
> > > https://lists.wikimedia.org/mailman/listinfo/ops
> >
> >
> >
> > --
> > Oliver Keyes
> > Count Logula
> > Wikimedia Foundation
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Ops] analytics-store AKA dbstore1002 (custom) replication lagging

Reply via email to