[ 
https://issues.apache.org/jira/browse/PIG-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395567#comment-14395567
 ] 

Daniel Dai commented on PIG-4485:
---------------------------------

bq. 1. If the hadoop admin has good experience on how many reducers should be 
used, why do not let hadoop admin to decide the number of reducers for the 
"real" MR job?
Sample job is not to decide the number of reducers, it decides the key range 
for each reducer
bq. 2. If we "set pig.random.sampler.sample.size 0", in this case, Sampler will 
sample 0 row. Why don't we just disable Sampler in this case?
If "pig.random.sampler.sample.size" is 0, yes, sample job is not needed. 
However, "pig.random.sampler.sample.size" should not be 0 by design. We do need 
samples in order to determine the key range for every reduce. 
bq. 3. Per our tests in house, Sampler job read all bytes of all files. So the 
"HDFS reads" stat for Sampler job is the same as "Real" MR job. This could be 
another issue: why Sampler job needs to read all the bytes of all files? My 
assumption is, it should read 100 records(by default) from each file, and then 
stop reading this file,right?
Yes, reseavior sample algorithm needs to scan all the inputs. Reading the first 
100 records defeat the purpose of a random sample. In practice, the sampling 
job works very well. It is usually very small compare to the sorting job (the 
next job). I am not sure why it is so slow in parquet case. Also, are you using 
hdfs or s3?

> Can Pig disable RandomSampleLoader when doing "Order by"
> --------------------------------------------------------
>
>                 Key: PIG-4485
>                 URL: https://issues.apache.org/jira/browse/PIG-4485
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Hao Zhu
>            Priority: Critical
>
> When reading parquet files with "order by":
> {code}
> a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
> b = order a by col1 ;
> c = limit b 100 ;
> dump c
> {code}
> Pig spawns a Sampler job always in the begining:
> {code}
> Job Stats (time in seconds):
> JobId Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
> MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   
> MedianReducetime        Alias   Feature Outputs
> job_1426804645147_1270        1       1       8       8       8       8       
> 4       4       4       4       b       SAMPLER
> job_1426804645147_1271        1       1       10      10      10      10      
> 4       4       4       4       b       ORDER_BY,COMBINER
> job_1426804645147_1272        1       1       2       2       2       2       
> 4       4       4       4       b               hdfs:/tmp/temp-xxx/tmp-xxx,
> {code}
> The issue is when reading lots of files, the first sampler job can take a 
> long time to finish.
> The ask is:
> 1. Is the sampler job a must to implement "order by"?
> 2. If no, is there any way to disable RandomSampleLoader manually?
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to