[ 
https://issues.apache.org/jira/browse/DRILL-5083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093486#comment-16093486
 ] 

Roman commented on DRILL-5083:
------------------------------

[~paul-rogers],
During the investigation, I added more logs in Drill and I got the next 
situation: when fragment (which will hang) changed state to 
CANCELLATION_REQUESTED we can see in the *MergeJoinBatch.innerNext()* that we 
have *status.getLeftStatus() = OK_NEW_SCHEMA* and *status.getRightStatus() = 
STOP*. The result of *status.getOutcome()* is *SCHEMA_CHANGED* which in the 
final case leads to an infinite loop. 

So my proposed solution is to add the next lines to *JoinStatus.getOutcome()* 
before checking on *eitherMatches(IterOutcome.OK_NEW_SCHEMA)*:
{code:Java}
    if (eitherMatches(IterOutcome.STOP)) {
      return JoinOutcome.FAILURE;
    }
{code}
In this case, the result of *status.getOutcome()* will be *FAILURE* and  Drill 
will make clean up of left and right batches. With this changes, I can't 
reproduce the situation where fragment hanged in CANSELLATION_REQUESTED state. 
So I think this changes should fix infinite loop problem in MergeJoin.


Also, I want to notice that during investigations sometimes I got unchecked 
*UserException* in *ExternalSortBatch* (see DRILL-5058). For example:
{noformat}
2017-07-19 15:01:44,771 [26932ef1-9719-764e-c736-9f8120c8e652:frag:10:1] INFO  
o.a.d.e.p.i.xsort.ExternalSortBatch - User Error Occurred: External Sort 
encountered an error while spilling to disk (Unable to allocate buffer of size 
65536 (rounded from 39321) due to memory limit. Current allocation: 19961472)
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External Sort 
encountered an error while spilling to disk

Unable to allocate buffer of size 65536 (rounded from 39321) due to memory 
limit. Current allocation: 19961472

[Error Id: fe111f04-6f94-4dbb-95d1-36a74abc0979 ]
        at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:550)
 ~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:618)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.innerNext(ExternalSortBatch.java:423)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:171)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:128)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:114)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:140)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:171)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105) 
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:144)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95) 
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at java.security.AccessController.doPrivileged(Native Method) 
[na:1.8.0_131]
        at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_131]
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
 [hadoop-common-2.7.0-mapr-1607.jar:na]
        at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) 
[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_131]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: Unable to 
allocate buffer of size 65536 (rounded from 39321) due to memory limit. Current 
allocation: 19961472
        at 
org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:238) 
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:213) 
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.memory.BaseAllocator.read(BaseAllocator.java:832) 
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.cache.VectorAccessibleSerializable.readVectors(VectorAccessibleSerializable.java:131)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.cache.VectorAccessibleSerializable.readFromStream(VectorAccessibleSerializable.java:108)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.physical.impl.xsort.BatchGroup.getBatch(BatchGroup.java:111)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.physical.impl.xsort.BatchGroup.getNextIndex(BatchGroup.java:137)
 ~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        at 
org.apache.drill.exec.test.generated.PriorityQueueCopierGen2493.next(PriorityQueueCopierTemplate.java:76)
 ~[na:na]
        at 
org.apache.drill.exec.physical.impl.xsort.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:602)
 [drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
        ... 19 common frames omitted

{noformat}

And when I added the catch for this error it was harder to reproduce hanging 
situation with my scenario. In this case, the query can fail before we get to 
the *MergeJoinBatch*. So I think we can consider the situation with checking 
UserExceprion in ExternalSortBatch.

> RecordIterator can sometimes restart a query on close
> -----------------------------------------------------
>
>                 Key: DRILL-5083
>                 URL: https://issues.apache.org/jira/browse/DRILL-5083
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Roman
>            Priority: Minor
>         Attachments: DrillOperatorErrorHandlingRedesign.pdf
>
>
> This one is very confusing...
> In a test with a MergeJoin and external sort, operators are stacked something 
> like this:
> {code}
> Screen
> - MergeJoin
> - - External Sort
> ...
> {code}
> Using the injector to force a OOM in spill, the external sort threw a 
> UserException up the stack. This was handed by:
> {code}
> IteratorValidatorBatchIterator.next( )
> RecordIterator.clearInflightBatches( )
> RecordIterator.close( )
> MergeJoinBatch.close( )
> {code}
> Which does the following:
> {code}
>       // Check whether next() should even have been called in current state.
>       if (null != exceptionState) {
>         throw new IllegalStateException(
> {code}
> But, the exceptionState is set, so we end up throwing an 
> IllegalStateException during cleanup.
> Seems the code should agree: if {{next( )}} will be called during cleanup, 
> then {{next( )}} should gracefully handle that case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to