And, an additional point; I don't understand why, if dupes is the problem, the Hive query was not hit as badly by this as the equivalent UDF.
On 22 February 2015 at 18:22, Oliver Keyes <[email protected]> wrote: > As previously mentioned, I've been digging into QA testing the new > pageviews definitions, and noticed a weird spike.[0] This was narrowed > down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1] > and a breakpoint was then seen at approximately 22:35 UTC.[2] > > TL;DR: either I don't understand how sequence number/hostname > combinations work or there's massive duplication and sometimes > triplication happening in the webrequest table. > > I grabbed a 5-minute slice of pageviews around the 22:35 breakpoint, > coming to 6 million rows in total. My first hypothesis was that we > were looking at some form of external attack (automata, say?), but the > requests were evenly distributed between the desktop and mobile > sites,[3] were not linked to any particular user agent or class of > user agents,[4] and were not linked to any particular IP address.[5] > > With that hypothesis looking tentative I instead investigated internal > snafus. The most obvious was duplicate events. As I understand it (and > I really hope I'm wrong about this), each hostname issues a > unique-to-the-host sequence number with each request, incrementing > each time. Accordingly, in a universe where we have no duplicate > events, the {hostname, sequence_number} tuples in a dataset of > requests should contain zero duplicates. > > I dug into this and looked at how many duplicate tuples we had. > And...bingo. We have many, /many/ duplicate tuples, and the point at > which it reduces lines up with when the number of pageviews > reduces.[6] Moreover, the number of duplicates is not proportionate to > the number of pageviews.[7] So it looks like what we're dealing with > here is a tremendous rise in duplicate events in the webrequests > table. After the duplicate requests were removed, we ended up with a > more natural pattern.[8] IOW, a chaotic pattern matching the chaotic > pattern we see in the number of distinct IPs. > > Thoughts: > 1. I thought we had systems in place to stop this? We should be > calculating a per-host arithmetic series over the sequence numbers > when data is loaded. > 2. Please tell me that my understanding of how unique sequence numbers > are is terribly terribly wrong, because the alternative is...trouble. > 3. I'm not sure what this means for our "actual" pageviews, given that > as [7] shows, we still have a lot of duplicates after the artificial > spike ends. > 4. How many issues do I have to ID before people take me up on my > request to be exclusively referred to as 'Count Logula'? ;) > > > [0] > https://upload.wikimedia.org/wikipedia/commons/4/40/First_pageview_QA_test.png > [1] > https://upload.wikimedia.org/wikipedia/commons/0/02/First_Pageview_QA_test_-_27_January_2015.png > [2] > https://upload.wikimedia.org/wikipedia/commons/a/a4/First_Pageview_QA_test_-_2200.png > [3] > https://upload.wikimedia.org/wikipedia/commons/e/ec/27_2200_analysis_per_source.png > [4] > https://upload.wikimedia.org/wikipedia/commons/d/dd/27_2200_analysis_per_agent_type.png > [5] > https://upload.wikimedia.org/wikipedia/commons/5/51/27_2200_analysis_distinct_ips.png > [6] > https://upload.wikimedia.org/wikipedia/commons/4/48/27_2200_analysis_duplicate_events.png > [7] > https://upload.wikimedia.org/wikipedia/commons/7/7d/27_2200_analysis_duplicate_proportion.png > [8] > https://upload.wikimedia.org/wikipedia/commons/a/a6/27_2200_analysis_de_duplicated.png > > -- > Oliver Keyes > Research Analyst > Wikimedia Foundation -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
