Hey all :). We've been doing some comparative analysis of all the different Pageviews options we have, as a first step towards putting the new definitions in Production. TL;DR: 'promising' doesn't cover how happy these results make me - see http://ironholds.org/deimos/qa_tests.png
Some background: we've been working on a new definition to address deficiencies in the existing one. At the same time, we've also been working on writing UDFs - User Defined Functions - in Java, so that analysts can conveniently apply whatever we come up with to our data store of requests. This is a step forward from where we are at the moment, which is relying on complex Hive queries or implementations in various languages that are run over the /sampled/, rather than unsampled logs, and so can't be used for per-page stats. As part of the robustness testing of the new definition I compared four options. These were: 1. The legacy definition, through a Hive query; 2. The legacy definition, through its new UDF; 3. The new definition, through the sampled logs; 4. The new definition, through its new UDF. The results can be seen at http://ironholds.org/deimos/qa_tests.png - the reason it only appears that there are three lines, most of the time, is that the results match so closely that it's not possible to visually distinguish them. This is a pretty good heuristic for "good implementations" :D. So, what does this mean? It means we're confident in the UDFs' ability to replicate the previous implementations. This means we can conveniently use it to test the validity of the actual definition, and can rely on it for production-ready analysis when that definition is signed off on. What's next? Digging into the spike on 29 January, and if it doesn't show anything scary, hand-coding the output of the two UDFs to see if we're confident in the new definition. And if we are...well. We release that data :D. Tremendous thanks to Aaron Halfaker, Christian A. and Andrew Otto for (respectively) their poking, code review and introduction to Java! -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
