[ 
https://issues.apache.org/jira/browse/PIG-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Li updated PIG-483:
-----------------------

    Attachment: PIG-483.1.patch

Now with PIG-2779 fixed, before submitting the sample job we have the runtime 
#reducers of orderby/skewjoin job. If they're using only one reducer then we 
can then skip the sample job. Some changes of the orderby/skewjoin in case of 
skipping sample:

* do not add to distributed cache the partition file, as there is no such file.
* do not set the specialized Partitioner, i.e. WeightedRangePartitioner for 
orderby and SkewedPartitioner for skewjoin
* for skew join, do not load partition file in POPartitionRearrange.

We then return the sample job as a SkipJob, whose status is set to successful 
so JobControl directly puts it in the successful job queue without submitting 
it. Then the SkipJob is processed just the same as regular jobs.

Any comment on this approach? Will work on unit tests if it looks good.
                
> PERFORMANCE: different strategies for large and small order bys
> ---------------------------------------------------------------
>
>                 Key: PIG-483
>                 URL: https://issues.apache.org/jira/browse/PIG-483
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>              Labels: gsoc2011, performance
>         Attachments: PIG-483.0.patch, PIG-483.1.patch
>
>
> Currently pig always does a multi-pass order by where it first determines a 
> distribution for the keys and then orders in a second pass.  This avoids the 
> necessity of having a single reducer.  However, in cases where the data is 
> small enough to fit into a single reducer, this is inefficient.  For small 
> data sets it would be good to realize the small size of the set and do the 
> order by in a single pass with a single reducer.
> This is a candidate project for Google summer of code 2011. More information 
> about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to