[ 
https://issues.apache.org/jira/browse/PIG-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-4148:
-------------------------------
    Attachment: popackage.log
                generate_sample.py
                samples_logs.tar.gz

[~daijy], [~rohini], I am attaching logs from my job.
# samples_logs.tar.gz contains the following logs from 300 sampling tasks-
{code}
if (rand < numSamples) {
    log.info("XXX: rand: " + rand + " rowProcessed: " + rowProcessed + " 
sample: " + res.result);
    samples[rand] = res;
}
{code}
# generate_sample.py is a Python script that mimics POReservoirSample and 
builds the final samples bag. You can run it with the following command after 
untar samples_logs.tar.gz-
{code}
python ./generate_sample.py
{code}
This will show the total size of samples bag is 30,000.
# Finally, popackage.log contains the following from the POPackage 
(POShuffleTezLoad) of sampling aggregate vertex.
{code}
for (Object val : vals) {
     NullableTuple nTup = (NullableTuple) val;
     int index = nTup.getIndex();
     Tuple tup = pkgr.getValueTuple(keyWritable, nTup, index);
     if (pkgr.getKeyType() == DataType.BYTEARRAY) {
         LOG.info("XXX samples in POPackage: "+tup);
     }
     bag.add(tup);
}
{code}
It shows there are only 300 samples records.
{code}
$ wc -l popackage.log
300 popackage.log
{code}

Based on this observation, I think samples are not sent by the sampler vertex.

> Tez order-by is often skewed because FindQuantiles UDF is called with small 
> number
> ----------------------------------------------------------------------------------
>
>                 Key: PIG-4148
>                 URL: https://issues.apache.org/jira/browse/PIG-4148
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Cheolsoo Park
>             Fix For: 0.14.0
>
>         Attachments: generate_sample.py, popackage.log, samples_logs.tar.gz
>
>
> In Tez, FindQuantiles UDF is called with a smaller number of samples than MR 
> resulting in skew in range partitions.
> For example, I have a job that runs sampling with a parallelism of 300. Since 
> each task samples 100 records, the total sample should be 30K. But 
> FindQuantiles UDF is called with only 300 samples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to