I confirmed this on IRC, but just feeding the archives here. I'm also convinced that the client IP hashing bug we just found explains this problem. It's good we took a look at the other problems, but the main one seems the IP hashing. We'll brain bounce more tomorrow on how to fix that.
On Tue, Sep 15, 2015 at 6:23 PM, Oliver Keyes <[email protected]> wrote: > Update; I read Dan's thread about hashing, read this thread, and a > penny dropped ;). > > This is totally explainable by the fact that we /expect/ to see > multiple pageIDs per IP. And we are! The hashing problem just means > those aren't /appearing/ to be the same IP. > > On 15 September 2015 at 18:05, Erik Bernhardson > <[email protected]> wrote: > > We've deployed the change to bucketing, but we are still seeing the same > > issue in the collected data. > > > > Again we are generating a unique 64 bit random number when the user gets > to > > the page. We are seeing this same 64 bit unique number being reported by > > multiple ip addresses. > > > > Since deploying the new schema number with the updated bucket selection > we > > have seen 13 distinct tokens coming from 42 distinct ip addresses. This > > shouldn't be possible. > > > > mysql:[email protected] [log]> select count(distinct > > clientIp) from CompletionSugges > > tions_13630018; > > +--------------------------+ > > | count(distinct clientIp) | > > +--------------------------+ > > | 42 | > > +--------------------------+ > > 1 row in set (0.00 sec) > > > > mysql:[email protected] [log]> select count(distinct > > event_pageViewToken) from CompletionSuggestions_13630018; > > > > +-------------------------------------+ > > | count(distinct event_pageViewToken) | > > +-------------------------------------+ > > | 13 | > > +-------------------------------------+ > > 1 row in set (0.00 sec) > > > > > > > > My best guess at this point is that something has changed in the way > these > > clientIp's are collected and is incorrect. > > > > > > On Mon, Sep 14, 2015 at 1:32 PM, Erik Bernhardson > > <[email protected]> wrote: > >> > >> Thanks for taking a look over this. I've incorperated your suggestions > >> into a patch[1] and if all looks good will send that out in SWAT. We > should > >> be able to look at the data collected overnight and see if things are > more > >> sane tomorrow. > >> > >> [1] https://gerrit.wikimedia.org/r/#/c/238306/ > >> > >> On Mon, Sep 14, 2015 at 11:56 AM, Gergo Tisza <[email protected]> > >> wrote: > >>> > >>> You are queueing a logging callback every time a request is sent (which > >>> is roughly every time the user types another character in the search > box) > >>> until the tracking module finishes loading and > mw.searchSuggest.request is > >>> restored. On a slow connection the user might type several characters > and > >>> trigger several log events by then. If you filter for queries from the > same > >>> non-unique IP, you will probably see something like "a", "ab", "abc"... > >>> > >>> _______________________________________________ > >>> Analytics mailing list > >>> [email protected] > >>> https://lists.wikimedia.org/mailman/listinfo/analytics > >>> > >> > > > > > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > > > -- > Oliver Keyes > Count Logula > Wikimedia Foundation > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
