Hao Zhu created PIG-4485:
----------------------------

             Summary: Can Pig disable RandomSampleLoader when doing "Order by"
                 Key: PIG-4485
                 URL: https://issues.apache.org/jira/browse/PIG-4485
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.13.0
            Reporter: Hao Zhu
            Priority: Critical


When reading parquet files with "order by":
{code}
a = load '/xxx/xxx/parquet/xxx.parquet' using ParquetLoader();
b = order a by col1 ;
c = limit b 100 ;
dump c
{code}

Pig spawns a Sampler job always in the begining:
{code}
Job Stats (time in seconds):
JobId   Maps    Reduces MaxMapTime      MinMapTIme      AvgMapTime      
MedianMapTime   MaxReduceTime   MinReduceTime   AvgReduceTime   
MedianReducetime        Alias   Feature Outputs
job_1426804645147_1270  1       1       8       8       8       8       4       
4       4       4       b       SAMPLER
job_1426804645147_1271  1       1       10      10      10      10      4       
4       4       4       b       ORDER_BY,COMBINER
job_1426804645147_1272  1       1       2       2       2       2       4       
4       4       4       b               hdfs:/tmp/temp-xxx/tmp-xxx,
{code}

The issue is when reading lots of files, the first sampler job can take a long 
time to finish.

The ask is:
1. Is the sampler job a must to implement "order by"?
2. If no, is there any way to disable RandomSampleLoader manually?

Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to