[
https://issues.apache.org/jira/browse/DRILL-5083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15704033#comment-15704033
]
Paul Rogers edited comment on DRILL-5083 at 11/29/16 3:25 AM:
--------------------------------------------------------------
More info. The query in question had, in essence:
{code}
- Merge Join
. - Record iterator
. . - Iterator Validator Batch Iterator
. . . - Merge Join
. . . . - Record iterator
. . . . . - Iterator Validator Batch Iterator
. . . . . . - Sort
. . - Record iterator
. . . - Iterator Validator Batch Iterator
. . . . - Sort
. - Record iterator
. . - Iterator Validator Batch Iterator
. . . - Sort
{code}
Each Merge Join input seems to be wrapped in a RecordIterator and
IteratorValidatorBatchIterator. If the first sort above fails, its
RecordIterator learns about the error. But, the others do not, especially not
the third sort. When closing, we try to clean out batches by calling next().
This sort of works (with the exception explained above) for an the active
iterator above the failed sort. But, the other record iterators never saw the
error, so they are in their prime start-up state. When they get the
batch-cleaning next(), they think it is a real one and proceed to do real work.
So, the clean-up code needs to handle the case where it is cleaning up
operators down a path in the tree that did not fail. But, because of the way
sort and joins work, those operators may never even have started. Thus, they
can't tell the clean-up next from a "time to get started" next.
was (Author: paul-rogers):
More info. The query in question had, in essence:
{code}
- Merge Join
- Merge Join
- Sort
- Sort
- Sort
{code}
Each Merge Join input seems to be wrapped in a RecordIterator and
IteratorValidatorBatchIterator. If the first sort above fails, its
RecordIterator learns about the error. But, the others do not, especially not
the third sort. When closing, we try to clean out batches by calling next().
This sort of works (with the exception explained above) for an the active
iterator above the failed sort. But, the other record iterators never saw the
error, so they are in their prime start-up state. When they get the
batch-cleaning next(), they think it is a real one and proceed to do real work.
So, the clean-up code needs to handle the case where it is cleaning up
operators down a path in the tree that did not fail. But, because of the way
sort and joins work, those operators may never even have started. Thus, they
can't tell the clean-up next from a "time to get started" next.
> IteratorValidator does not handle RecordIterator cleanup call to next( )
> ------------------------------------------------------------------------
>
> Key: DRILL-5083
> URL: https://issues.apache.org/jira/browse/DRILL-5083
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.8.0
> Reporter: Paul Rogers
> Priority: Minor
>
> This one is very confusing...
> In a test with a MergeJoin and external sort, operators are stacked something
> like this:
> {code}
> Screen
> - MergeJoin
> - - External Sort
> ...
> {code}
> Using the injector to force a OOM in spill, the external sort threw a
> UserException up the stack. This was handed by:
> {code}
> IteratorValidatorBatchIterator.next( )
> RecordIterator.clearInflightBatches( )
> RecordIterator.close( )
> MergeJoinBatch.close( )
> {code}
> Which does the following:
> {code}
> // Check whether next() should even have been called in current state.
> if (null != exceptionState) {
> throw new IllegalStateException(
> {code}
> But, the exceptionState is set, so we end up throwing an
> IllegalStateException during cleanup.
> Seems the code should agree: if {{next( )}} will be called during cleanup,
> then {{next( )}} should gracefully handle that case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)