Hadoop QA commented on PIG-733:

-1 overall.  Here are the results of testing the latest attachment 
  against trunk revision 759376.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified 
                        Please justify why no tests are needed for this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The applied patch generated 207 javac compiler warnings (more 
than the trunk's current 200 warnings).

    -1 findbugs.  The patch appears to introduce 5 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: 
Findbugs warnings: 
Console output: 

This message is automatically generated.

> Order by sampling dumps entire sample to hdfs which causes dfs "FileSystem 
> closed" error on large input
> -------------------------------------------------------------------------------------------------------
>                 Key: PIG-733
>                 URL: https://issues.apache.org/jira/browse/PIG-733
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Pradeep Kamath
>            Assignee: Pradeep Kamath
>             Fix For: 0.3.0
>         Attachments: PIG-733.patch
> Order by has a sampling job which samples the input and creates a sorted list 
> of sample items. CUrrently the number of items sampled is 100 per map task. 
> So if the input is large resulting in many maps (say 50,000) the sample is 
> big. This sorted sample is stored on dfs. The WeightedRangePartitioner 
> computes quantile boundaries and weighted probabilities for repeating values 
> in each map by reading the samples file from DFS. In queries with many maps 
> (in the order of 50,000) the dfs read of the sample file fails with 
> "FileSystem closed" error. This seems to point to a dfs issue wherein a big 
> dfs file being read simultaneously by many dfs clients (in this case all 
> maps) causes the clients to be closed. However on the pig side, loading the 
> sample from each map in the final map reduce job and computing the quantile 
> boundaries and weighted probabilities is inefficient. We should do this 
> computation through a FindQuantiles udf in the same map reduce job which 
> produces the sorted samples. This way lesser data is written to dfs and in 
> the final map reduce job, the weightedRangePartitioner needs to just load the 
> computed information.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to