[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

Alan Gates (JIRA) Tue, 26 May 2009 18:02:10 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alan Gates updated PIG-460:
---------------------------

    Resolution: Won't Fix
        Status: Resolved  (was: Patch Available)

After some testing by Amir Youssefi we determined that making this change 
actually makes performance worse. Changing RandomSampleLoader into an EvalFunc 
means that all records in the file have to be read and parsed. Since hadoop 
efficiently supports skipping in the input stream, this is very expensive. 
Instead we will pursue making RandomSampleLoader subsume the user's loader to 
avoid requiring a third MR job (see PIG-820).

> PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
> ------------------------------------------------------------
>
>                 Key: PIG-460
>                 URL: https://issues.apache.org/jira/browse/PIG-460
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>         Attachments: sampler.patch, sampler2.patch
>
>
> Currently order by is done in three MR jobs:
> job 1: read data in whatever loader the user requests, store using BinStorage
> job 2: load using RandomSampleLoader, find quantiles
> job 3: load data again and sort
> It is done this way because RandomSampleLoader extends BinStorage, and so 
> needs the data in that format to read it.
> If the logic in RandomSampleLoader was made into an operator instead of being 
> in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
> 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

Reply via email to