[ 
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679862#comment-13679862
 ] 

Dmitriy V. Ryaboy commented on PIG-3325:
----------------------------------------

[~mwagner] thanks for catching this perf regression.
I only had time for a cursory look today -- why is the existing code O(n)? 
Seems like it sampled up to 100 elements and no more, so it's constant (once 
n>=100). Seems to me like all that materially changed was that you added the 
sampling bit to add(). Unfortunately, a number of Bags override add() (see my 
notes in PIG-2923), which makes doing this in the default add() of the abstract 
function unreliable.

Seems to me like a better approach would be to tackle the fact that for every 
time that getMemorySize() is called while there are fewer than 100 elements, we 
iterate over the whole bag (which is what you mean by O(n)?). We can do this by 
jumping directly to the mLastContentsSize'th element in the Bag, if we know the 
structure, or at least iterate to it without calling getMemorySize(), and then 
add to our running avg, rather than recomputing it. So, no resetting 
aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when 
sampling, just ignoring the first mLastContentsSize in the iterator.

Thoughts?
                
> Adding a tuple to a bag is slow
> -------------------------------
>
>                 Key: PIG-3325
>                 URL: https://issues.apache.org/jira/browse/PIG-3325
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.11, 0.11.1, 0.11.2
>            Reporter: Mark Wagner
>            Assignee: Mark Wagner
>            Priority: Critical
>         Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch
>
>
> The time it takes to add a tuple to a bag has increased significantly, 
> causing some jobs to take about 50x longer compared to 0.10.1. I've tracked 
> this down to PIG-2923, which has made adding a tuple heavier weight (it now 
> includes some memory estimation).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to