[
https://issues.apache.org/jira/browse/PIG-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13679862#comment-13679862
]
Dmitriy V. Ryaboy commented on PIG-3325:
----------------------------------------
[~mwagner] thanks for catching this perf regression.
I only had time for a cursory look today -- why is the existing code O(n)?
Seems like it sampled up to 100 elements and no more, so it's constant (once
n>=100). Seems to me like all that materially changed was that you added the
sampling bit to add(). Unfortunately, a number of Bags override add() (see my
notes in PIG-2923), which makes doing this in the default add() of the abstract
function unreliable.
Seems to me like a better approach would be to tackle the fact that for every
time that getMemorySize() is called while there are fewer than 100 elements, we
iterate over the whole bag (which is what you mean by O(n)?). We can do this by
jumping directly to the mLastContentsSize'th element in the Bag, if we know the
structure, or at least iterate to it without calling getMemorySize(), and then
add to our running avg, rather than recomputing it. So, no resetting
aggSampleTupleSize in your version, or avgTupleSize in mine, to 0 when
sampling, just ignoring the first mLastContentsSize in the iterator.
Thoughts?
> Adding a tuple to a bag is slow
> -------------------------------
>
> Key: PIG-3325
> URL: https://issues.apache.org/jira/browse/PIG-3325
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.11, 0.11.1, 0.11.2
> Reporter: Mark Wagner
> Assignee: Mark Wagner
> Priority: Critical
> Attachments: PIG-3325.demo.patch, PIG-3325.optimize.1.patch
>
>
> The time it takes to add a tuple to a bag has increased significantly,
> causing some jobs to take about 50x longer compared to 0.10.1. I've tracked
> this down to PIG-2923, which has made adding a tuple heavier weight (it now
> includes some memory estimation).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira