[ https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-200: --------------------------- Attachment: perf.patch The following attached patch takes a different approach to providing a set of benchmarks for pig. It contains a set of 14 queries which are designed to try to cover a range of ways users use pig. It also includes implementations of the same queries in java code for map reduce, so that developers can compare pig performance against map reduce performance. See http://wiki.apache.org/pig/PigMix for information on how the queries were chosen, how the data is constructed, and data from an initial run of 0.1.0 pig versus soon to be 0.2.0 pig. This attachment is not ready for inclusion in the code. It has several issues. # The library used to generate the zipf distributions in the data is under the GNU public license, and thus cannot be included. The library can be obtained at http://www.eli.sdsu.edu/java-SDSU/ # The data generation script is single threaded because the zipf distribution generator is. This means to generate 10m rows of data (about 15G) takes ~48 hours. I'd like to be able to generate larger data sets, but first I need to find a parallel zipf distribution generator that has a compatible license (or write one, which I don't really want to do). # There are places in the code (particularly the map reduce code) where path names etc. are hard wired to locations in my test setup. These need to be generalized. > Pig Performance Benchmarks > -------------------------- > > Key: PIG-200 > URL: https://issues.apache.org/jira/browse/PIG-200 > Project: Pig > Issue Type: Task > Reporter: Amir Youssefi > Attachments: generate_data.pl, perf.patch > > > To benchmark Pig performance, we need to have a TPC-H like Large Data Set > plus Script Collection. This is used in comparison of different Pig releases, > Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only). > Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance > I am currently running long-running Pig scripts over data-sets in the order > of tens of TBs. Next step is hundreds of TBs. > We need to have an open large-data set (open source scripts which generate > data-set) and detailed scripts for important operations such as ORDER, > AGGREGATION etc. > We can call those the Pig Workouts: Cardio (short processing), Marathon (long > running scripts) and Triathlon (Mix). > I will update this JIRA with more details of current activities soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.