Hey all :). We've been doing some comparative analysis of all the
different Pageviews options we have, as a first step towards putting
the new definitions in Production. TL;DR: 'promising' doesn't cover
how happy these results make me - see
http://ironholds.org/deimos/qa_tests.png

Some background: we've been working on a new definition to address
deficiencies in the existing one. At the same time, we've also been
working on writing UDFs - User Defined Functions - in Java, so that
analysts can conveniently apply whatever we come up with to our data
store of requests. This is a step forward from where we are at the
moment, which is relying on complex Hive queries or implementations in
various languages that are run over the /sampled/, rather than
unsampled logs, and so can't be used for per-page stats.

As part of the robustness testing of the new definition I compared
four options. These were:

1. The legacy definition, through a Hive query;
2. The legacy definition, through its new UDF;
3. The new definition, through the sampled logs;
4. The new definition, through its new UDF.

The results can be seen at http://ironholds.org/deimos/qa_tests.png -
the reason it only appears that there are three lines, most of the
time, is that the results match so closely that it's not possible to
visually distinguish them. This is a pretty good heuristic for "good
implementations" :D.

So, what does this mean? It means we're confident in the UDFs' ability
to replicate the previous implementations. This means we can
conveniently use it to test the validity of the actual definition, and
can rely on it for production-ready analysis when that definition is
signed off on.

What's next? Digging into the spike on 29 January, and if it doesn't
show anything scary, hand-coding the output of the two UDFs to see if
we're confident in the new definition. And if we are...well. We
release that data :D.

Tremendous thanks to Aaron Halfaker, Christian A. and Andrew Otto for
(respectively) their poking, code review and introduction to Java!

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to