pig-user  

Call for queries

Alan Gates
Thu, 02 Oct 2008 12:59:42 -0700

I propose that pig develop a standard set of benchmark queries that can be run from release to release to measure pig's (hopefully improving) performance over time. This would be similar in nature to hadoop's GridMix (see http://svn.apache.org/viewvc/hadoop/core/tags/release-0.17.1/src/test/gridmix/ and http://developer.yahoo.com/blogs/hadoop/). This set should be relatively small (probably under 10). But it should cover a range of operations being done by pig users.

So, if you have queries that you think would be good candidates and that you can share (or obfuscate and then share), please do so. In addition to the query, please give some idea of the type of data it runs over. In particular we need to know how much data, how many fields are in your data, the cardinality and distribution of any fields used as a group, cogroup, or sort key.

Thanks.


  • Call for queries Alan Gates