Alan Gates updated PIG-200:
The following attached patch takes a different approach to providing a set of
benchmarks for pig. It contains a set of 14 queries which are designed to try
to cover a range of ways users use pig. It also includes implementations of
the same queries in java code for map reduce, so that developers can compare
pig performance against map reduce performance. See
http://wiki.apache.org/pig/PigMix for information on how the queries were
chosen, how the data is constructed, and data from an initial run of 0.1.0 pig
versus soon to be 0.2.0 pig.
This attachment is not ready for inclusion in the code. It has several issues.
# The library used to generate the zipf distributions in the data is under the
GNU public license, and thus cannot be included. The library can be obtained
# The data generation script is single threaded because the zipf distribution
generator is. This means to generate 10m rows of data (about 15G) takes ~48
hours. I'd like to be able to generate larger data sets, but first I need to
find a parallel zipf distribution generator that has a compatible license (or
write one, which I don't really want to do).
# There are places in the code (particularly the map reduce code) where path
names etc. are hard wired to locations in my test setup. These need to be
> Pig Performance Benchmarks
> Key: PIG-200
> URL: https://issues.apache.org/jira/browse/PIG-200
> Project: Pig
> Issue Type: Task
> Reporter: Amir Youssefi
> Attachments: generate_data.pl, perf.patch
> To benchmark Pig performance, we need to have a TPC-H like Large Data Set
> plus Script Collection. This is used in comparison of different Pig releases,
> Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
> Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
> I am currently running long-running Pig scripts over data-sets in the order
> of tens of TBs. Next step is hundreds of TBs.
> We need to have an open large-data set (open source scripts which generate
> data-set) and detailed scripts for important operations such as ORDER,
> AGGREGATION etc.
> We can call those the Pig Workouts: Cardio (short processing), Marathon (long
> running scripts) and Triathlon (Mix).
> I will update this JIRA with more details of current activities soon.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.