[
https://issues.apache.org/jira/browse/DRILL-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063113#comment-16063113
]
Roman commented on DRILL-4595:
------------------------------
I tried steps from previous comment on Drill after DRILL-5599 fix and it seems
the problem was solved.
After step 2) I got
{code:sql}
Error: DATA_READ ERROR: Error reading page data
File: /drill/testdata/tpcds_sf100/parquet/web_sales/0_0_7.parquet
Column: ws_ship_hdemo_sk
Row Group Start: 7836969
Fragment 1:1
[Error Id: a8ca60e9-5ef7-42b7-b93a-fd7c69c06aef on node1:31010] (state=,code=0)
{code}
Drillbit was running (I did not get drillbit down) and table files were cleaned
up. Also as I see in UI, the query correctly finished with FAILED state.
Information from logs:
{code:sql}
2017-06-26 13:35:30,518 [26aef278-acd0-2649-502d-636b78c58f66:frag:1:1] INFO
o.a.d.e.s.p.c.AsyncPageReader - User Error Occurred: Error reading page data
(Failure allocating buffer.)
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Error
reading page data
File: /drill/testdata/tpcds_sf100/parquet/web_sales/0_0_7.parquet
Column: ws_ship_hdemo_sk
Row Group Start: 7836969
[Error Id: a8ca60e9-5ef7-42b7-b93a-fd7c69c06aef ]
at
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:550)
~[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.handleAndThrowException(AsyncPageReader.java:185)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.nextInternal(AsyncPageReader.java:273)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.PageReader.next(PageReader.java:307)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.NullableColumnReader.processPages(NullableColumnReader.java:69)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.BatchReader.readAllFixedFieldsSerial(BatchReader.java:63)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.BatchReader.readAllFixedFields(BatchReader.java:56)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.BatchReader$FixedWidthReader.readRecords(BatchReader.java:141)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.BatchReader.readBatch(BatchReader.java:42)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.ParquetRecordReader.next(ParquetRecordReader.java:297)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:180)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:133)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:133)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:91)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at java.security.AccessController.doPrivileged(Native Method)
[na:1.8.0_131]
at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_131]
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
[hadoop-common-2.7.0-mapr-1607.jar:na]
at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_131]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
Caused by: org.apache.drill.exec.exception.OutOfMemoryException: Failure
allocating buffer.
at
io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:64)
~[drill-memory-base-1.11.0-SNAPSHOT.jar:4.0.27.Final]
at
org.apache.drill.exec.memory.AllocationManager.<init>(AllocationManager.java:80)
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.memory.BaseAllocator.bufferWithoutReservation(BaseAllocator.java:254)
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:236)
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:206)
~[drill-memory-base-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.PageReader.allocateTemporaryBuffer(PageReader.java:376)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.decompress(AsyncPageReader.java:195)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.getDecompressedPageData(AsyncPageReader.java:146)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.nextInternal(AsyncPageReader.java:268)
[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
... 35 common frames omitted
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:694) ~[na:1.8.0_131]
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
~[na:1.8.0_131]
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
~[na:1.8.0_131]
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:437)
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179)
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:168)
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:98)
~[netty-buffer-4.0.27.Final.jar:4.0.27.Final]
at
io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.newDirectBufferL(PooledByteBufAllocatorL.java:165)
~[drill-memory-base-1.11.0-SNAPSHOT.jar:4.0.27.Final]
at
io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.directBuffer(PooledByteBufAllocatorL.java:195)
~[drill-memory-base-1.11.0-SNAPSHOT.jar:4.0.27.Final]
at
io.netty.buffer.PooledByteBufAllocatorL.allocate(PooledByteBufAllocatorL.java:62)
~[drill-memory-base-1.11.0-SNAPSHOT.jar:4.0.27.Final]
... 43 common frames omitted
{code}
So in this case we got out of memory and Drillbit cancelled the query
correctly. It seems DRILL-5599 fixes all similar issues with CTAS.
> FragmentExecutor.fail() should interrupt the fragment thread to avoid
> possible query hangs
> ------------------------------------------------------------------------------------------
>
> Key: DRILL-4595
> URL: https://issues.apache.org/jira/browse/DRILL-4595
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Deneche A. Hakim
> Assignee: Deneche A. Hakim
> Fix For: Future
>
>
> When a fragment fails it's assumed it will be able to close itself and send
> it's FAILED state to the foreman which will cancel any running fragments.
> FragmentExecutor.cancel() will interrupt the thread making sure those
> fragment don't stay blocked.
> However, if a fragment is already blocked when it's fail method is called the
> foreman may never be notified about this and the query will hang forever. One
> such scenario is the following:
> - generally it's a CTAS running on a large cluster (lot's of writers running
> in parallel)
> - logs show that the user channel was closed and UserServer caused the root
> fragment to move to a FAILED state
> - jstack shows that the root fragment is blocked in it's receiver waiting for
> data
> - jstack also shows that ALL other fragments are no longer running, and the
> logs show that all of them succeeded
> - the foreman waits *forever* for the root fragment to finish
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)