Alan Gates updated PIG-200:

    Attachment: perf.patch

The following attached patch takes a different approach to providing a set of 
benchmarks for pig.  It contains a set of 14 queries which are designed to try 
to cover a range of ways users use pig.  It also includes implementations of 
the same queries in java code for map reduce, so that developers can compare 
pig performance against map reduce performance.  See 
http://wiki.apache.org/pig/PigMix for information on how the queries were 
chosen, how the data is constructed, and data from an initial run of 0.1.0 pig 
versus soon to be 0.2.0 pig.

This attachment is not ready for inclusion in the code.  It has several issues.

# The library used to generate the zipf distributions in the data is under the 
GNU public license, and thus cannot be included.  The library can be obtained 
at http://www.eli.sdsu.edu/java-SDSU/
# The data generation script is single threaded because the zipf distribution 
generator is.  This means to generate 10m rows of data (about 15G) takes ~48 
hours.  I'd like to be able to generate larger data sets, but first I need to 
find a parallel zipf distribution generator that has a compatible license (or 
write one, which I don't really want to do).
# There are places in the code (particularly the map reduce code) where path 
names etc. are hard wired to locations in my test setup.  These need to be 

> Pig Performance Benchmarks
> --------------------------
>                 Key: PIG-200
>                 URL: https://issues.apache.org/jira/browse/PIG-200
>             Project: Pig
>          Issue Type: Task
>            Reporter: Amir Youssefi
>         Attachments: generate_data.pl, perf.patch
> To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
> plus Script Collection. This is used in comparison of different Pig releases, 
> Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
> Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
> I am currently running long-running Pig scripts over data-sets in the order 
> of tens of TBs. Next step is hundreds of TBs.
> We need to have an open large-data set (open source scripts which generate 
> data-set) and detailed scripts for important operations such as ORDER, 
> We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
> running scripts) and Triathlon (Mix). 
> I will update this JIRA with more details of current activities soon.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to