[ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722682#comment-13722682
 ] 

Cheolsoo Park commented on PIG-3325:
------------------------------------

[~dvryaboy], I think your sampling code is incorrect.
{code}
/**
 * Sample every 10th tuple until we reach a max of SPILL_SAMPLE_SIZE
 * to get an estimate of the tuple sizes.
 */
protected void sampleContents() {
    synchronized (mContents) {
        ...
        for (int i = sampled; iter.hasNext() && sampled < SPILL_SAMPLE_SIZE; 
i++) {
            if (i % SPILL_SAMPLE_FREQUENCY == 0) {
                aggSampleTupleSize += iter.next().getMemorySize();
                sampled += 1;
            }
        }
    }
}
{code}
The iterator doesn't get incremented every iteration, so you're sampling 
sequential tuples instead of every 10th. Don't you need to add an else block 
and increment the iterator always?
                
> Adding a tuple to a bag is slow
> -------------------------------
>
>                 Key: PIG-3325
>                 URL: https://issues.apache.org/jira/browse/PIG-3325
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11, 0.11.1, 0.11.2
>            Reporter: Mark Wagner
>            Assignee: Dmitriy V. Ryaboy
>            Priority: Critical
>         Attachments: PIG-3325.2.patch, PIG-3325.3.patch, PIG-3325.demo.patch, 
> PIG-3325.optimize.1.patch
>
>
> The time it takes to add a tuple to a bag has increased significantly, 
> causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
> this down to PIG-2923, which has made adding a tuple heavier weight (it now 
> includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to