[ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-200:
----------------------------------

    Attachment: pigmix_pig0.11.patch

Attaching a patch that works with pig 0.11 (current trunk).

A few changes:

1) Removed explicit hardcoded parallelism inside the pig scripts to let them 
scale automatically.

2) All scripts respect the PIGMIX_DIR environment variable (so you don't have 
to have /user/pig on your cluster -- set $PIGMIX_DIR and use your own paths)

3) Made the shell scripts respect $HADOOP_CLASSPATH 

4) PigPerformanceLoader is now part of e2e, so removed it from this patch. 
Fixed the e2e one to proxy bytesToMap to its caster instead of throwing an 
exception

5) use /usr/bin/env to find perl, as not everyone has it in the same place. Use 
strict and warnings in perl.
                
> Pig Performance Benchmarks
> --------------------------
>
>                 Key: PIG-200
>                 URL: https://issues.apache.org/jira/browse/PIG-200
>             Project: Pig
>          Issue Type: Task
>            Reporter: Amir Youssefi
>            Assignee: Alan Gates
>             Fix For: 0.2.0
>
>         Attachments: generate_data.pl, perf-0.6.patch, perf.hadoop.patch, 
> perf.patch, pigmix2.patch, pigmix_pig0.11.patch
>
>
> To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
> plus Script Collection. This is used in comparison of different Pig releases, 
> Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
> Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
> I am currently running long-running Pig scripts over data-sets in the order 
> of tens of TBs. Next step is hundreds of TBs.
> We need to have an open large-data set (open source scripts which generate 
> data-set) and detailed scripts for important operations such as ORDER, 
> AGGREGATION etc.
> We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
> running scripts) and Triathlon (Mix). 
> I will update this JIRA with more details of current activities soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to