And, an additional point; I don't understand why, if dupes is the
problem, the Hive query was not hit as badly by this as the equivalent
UDF.

On 22 February 2015 at 18:22, Oliver Keyes <[email protected]> wrote:
> As previously mentioned, I've been digging into QA testing the new
> pageviews definitions, and noticed a weird spike.[0] This was narrowed
> down to 27 January, and thence to 22:00-23:00 UTC on 27 January,[1]
> and a breakpoint was then seen at approximately 22:35 UTC.[2]
>
> TL;DR: either I don't understand how sequence number/hostname
> combinations work or there's massive duplication and sometimes
> triplication happening in the webrequest table.
>
> I grabbed a 5-minute slice of pageviews around the 22:35 breakpoint,
> coming to 6 million rows in total. My first hypothesis was that we
> were looking at some form of external attack (automata, say?), but the
> requests were evenly distributed between the desktop and mobile
> sites,[3] were not linked to any particular user agent or class of
> user agents,[4] and were not linked to any particular IP address.[5]
>
> With that hypothesis looking tentative I instead investigated internal
> snafus. The most obvious was duplicate events. As I understand it (and
> I really hope I'm wrong about this), each hostname issues a
> unique-to-the-host sequence number with each request, incrementing
> each time. Accordingly, in a universe where we have no duplicate
> events, the {hostname, sequence_number} tuples in a dataset of
> requests should contain zero duplicates.
>
> I dug into this and looked at how many duplicate tuples we had.
> And...bingo. We have many, /many/ duplicate tuples, and the point at
> which it reduces lines up with when the number of pageviews
> reduces.[6] Moreover, the number of duplicates is not proportionate to
> the number of pageviews.[7] So it looks like what we're dealing with
> here is a tremendous rise in duplicate events in the webrequests
> table. After the duplicate requests were removed, we ended up with a
> more natural pattern.[8] IOW, a chaotic pattern matching the chaotic
> pattern we see in the number of distinct IPs.
>
> Thoughts:
> 1. I thought we had systems in place to stop this? We should be
> calculating a per-host arithmetic series over the sequence numbers
> when data is loaded.
> 2. Please tell me that my understanding of how unique sequence numbers
> are is terribly terribly wrong, because the alternative is...trouble.
> 3. I'm not sure what this means for our "actual" pageviews, given that
> as [7] shows, we still have a lot of duplicates after the artificial
> spike ends.
> 4. How many issues do I have to ID before people take me up on my
> request to be exclusively referred to as 'Count Logula'? ;)
>
>
> [0] 
> https://upload.wikimedia.org/wikipedia/commons/4/40/First_pageview_QA_test.png
> [1] 
> https://upload.wikimedia.org/wikipedia/commons/0/02/First_Pageview_QA_test_-_27_January_2015.png
> [2] 
> https://upload.wikimedia.org/wikipedia/commons/a/a4/First_Pageview_QA_test_-_2200.png
> [3] 
> https://upload.wikimedia.org/wikipedia/commons/e/ec/27_2200_analysis_per_source.png
> [4] 
> https://upload.wikimedia.org/wikipedia/commons/d/dd/27_2200_analysis_per_agent_type.png
> [5] 
> https://upload.wikimedia.org/wikipedia/commons/5/51/27_2200_analysis_distinct_ips.png
> [6] 
> https://upload.wikimedia.org/wikipedia/commons/4/48/27_2200_analysis_duplicate_events.png
> [7] 
> https://upload.wikimedia.org/wikipedia/commons/7/7d/27_2200_analysis_duplicate_proportion.png
> [8] 
> https://upload.wikimedia.org/wikipedia/commons/a/a6/27_2200_analysis_de_duplicated.png
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to