[ 
https://issues.apache.org/jira/browse/PIG-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-733:
-------------------------------

    Attachment: PIG-733-v2.patch

> Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem 
> closed" error on large input
> -------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-733
>                 URL: https://issues.apache.org/jira/browse/PIG-733
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: 0.3.0
>
>         Attachments: PIG-733-v2.patch, PIG-733.patch
>
>
> Order by has a sampling job which samples the input and creates a sorted list 
> of sample items. CUrrently the number of items sampled is 100 per map task. 
> So if the input is large resulting in many maps (say 50,000) the sample is 
> big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
> computes quantile boundaries and weighted probabilities for repeating values 
> in each map by reading the samples file from DFS. In queries with many maps 
> (in the order of 50,000) the dfs read of the sample file fails with 
> "FileSystem closed" error. This seems to point to a dfs issue wherein a big 
> dfs file being read simultaneously by many dfs clients (in this case all 
> maps) causes the clients to be closed. However on the pig side, loading the 
> sample from each map in the final map reduce job and computing the quantile 
> boundaries and weighted probabilities is inefficient. We should do this 
> computation through a FindQuantiles udf in the same map reduce job which 
> produces the sorted samples. This way lesser data is written to dfs and in 
> the final map reduce job, the weightedRangePartitioner needs to just load the 
> computed information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to