[
https://issues.apache.org/jira/browse/TEZ-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199423#comment-16199423
]
Jacob Tolar commented on TEZ-3849:
----------------------------------
[~kshukla] --
"add hasNext to the original iterator" - yeah, that's the obvious solution but,
as you said, it involves changes several other places. I thought this might be
a little simpler. Some of the implementation changes are super straightforward
but a couple would need a little scrutiny to make sure that hasNext() and
next() play nice with each other (MergeQueue and PartitionFilter in particular).
* DefaultSorter.MRResultIterator
* PipelinedSorter.PartitionedRawKeyValueIterator
* PipelinedSorter.SpanIterator
* PipelinedSorter.PartitionFilter
* PipelinedSorter.SpanMerger
* TezMerger.MergeQueue
* TezMerger.EmptyIterator
If that's what you prefer let me know and I can send another patch.
Unfortunately "just reinitialize kvIter" doesn't work --
{{merger.filter(partition)}} just calls reset() on its inner iterator, which
does nothing but reset the PartitionFilter partition number. The iterator
PartitionFilter wraps will already have been advanced (and it may be marked
dirty already, etc). I don't know this code well enough to say if that's a bug
in reset() or not, I just java debugger'd my way here. :) See:
https://github.com/apache/tez/blob/c82b2eadea9d75db4dd47a455c0b4d6f7dda2cc8/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/PipelinedSorter.java#L1234-L1276
.
> Combiner+PipelinedSorter silently drops records
> -----------------------------------------------
>
> Key: TEZ-3849
> URL: https://issues.apache.org/jira/browse/TEZ-3849
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jacob Tolar
> Assignee: Jacob Tolar
> Attachments: TEZ-3849.1.patch, TEZ-3849.2.patch
>
>
> This bug was introduced in
> https://github.com/apache/tez/commit/a47e8fcbea5eeab5a7cf812271d329524cc02dba?diff=split
>
> when combiner != null, the change in this commit passes kvIter with next()
> having already been called. This ends up (silently) dropping the first record
> in the partition.
> Will submit PR and attach patch. [~jeagles], not sure if this is the way you
> want to fix or not but it does fix my tests.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)