[
https://issues.apache.org/jira/browse/DRILL-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063080#comment-16063080
]
Roman commented on DRILL-4595:
------------------------------
I tried to catch some issues with CTAS queries and got something.
Steps:
1) set small DRILL_HEAP and DRILL_MAX_DIRECT_MEMORY to drill-env.sh
{code:sql}
Example:
export DRILL_HEAP=${DRILL_HEAP:-"1G"}
export DRILL_MAX_DIRECT_MEMORY=${DRILL_MAX_DIRECT_MEMORY:-"1G"}
{code}
2) run long CTAS query
{code:sql}
Example:
CREATE TABLE dfs.tmp.table3 AS SELECT * FROM
dfs.tpcds_sf1_parquet_views.web_sales;
{code}
After that drillbit fails (process was killed) with error:
{code:sql}
Error: CONNECTION ERROR: Connection /192.168.121.7:47697 <-->
node1/192.168.121.7:31010 (user client) closed unexpectedly. Drillbit down?
[Error Id: 3de27393-8f21-4869-acd3-c4a14d01ed44 ] (state=,code=0)
{code}
Information from drillbit.log:
{code:sql}
2017-06-26 13:02:53,062 [26aefa29-490b-e807-d093-548607458d28:frag:1:0] ERROR
o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting.
Information message: Unable to handle out of memory condition in
FragmentExecutor.
java.lang.OutOfMemoryError: Java heap space
at java.util.AbstractList.iterator(AbstractList.java:288)
~[na:1.8.0_131]
at
org.apache.parquet.bytes.BytesInput$SequenceBytesIn.writeAllTo(BytesInput.java:263)
~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:174)
~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.bytes.ConcatenatingByteArrayCollector.collect(ConcatenatingByteArrayCollector.java:33)
~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:118)
~[parquet-hadoop-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:154)
~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:115)
~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:187)
~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addDouble(MessageColumnIO.java:483)
~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.drill.exec.store.ParquetOutputRecordWriter$NullableFloat8ParquetConverter.writeField(ParquetOutputRecordWriter.java:970)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.EventBasedRecordWriter.write(EventBasedRecordWriter.java:65)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:106)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at java.security.AccessController.doPrivileged(Native Method)
~[na:1.8.0_131]
at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_131]
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
~[hadoop-common-2.7.0-mapr-1607.jar:na]
at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_131]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
2017-06-26 13:02:53,924 [26aefa29-490b-e807-d093-548607458d28:frag:1:1] ERROR
o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting.
Information message: Unable to handle out of memory condition in
FragmentExecutor.
java.lang.OutOfMemoryError: Java heap space
at java.util.AbstractList.iterator(AbstractList.java:288)
~[na:1.8.0_131]
at
org.apache.parquet.bytes.BytesInput$SequenceBytesIn.writeAllTo(BytesInput.java:263)
~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at org.apache.parquet.bytes.BytesInput.toByteArray(BytesInput.java:174)
~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:185)
~[parquet-encoding-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.hadoop.DirectCodecFactory$SnappyCompressor.compress(DirectCodecFactory.java:291)
~[parquet-hadoop-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:94)
~[parquet-hadoop-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:154)
~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.column.impl.ColumnWriterV1.accountForValueWritten(ColumnWriterV1.java:115)
~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.column.impl.ColumnWriterV1.write(ColumnWriterV1.java:187)
~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addDouble(MessageColumnIO.java:483)
~[parquet-column-1.8.1-drill-r0.jar:1.8.1-drill-r0]
at
org.apache.drill.exec.store.ParquetOutputRecordWriter$NullableFloat8ParquetConverter.writeField(ParquetOutputRecordWriter.java:970)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.store.EventBasedRecordWriter.write(EventBasedRecordWriter.java:65)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext(WriterRecordBatch.java:106)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:105)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.SingleSenderCreator$SingleSenderRootExec.innerNext(SingleSenderCreator.java:92)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:95)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:234)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:227)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at java.security.AccessController.doPrivileged(Native Method)
~[na:1.8.0_131]
at javax.security.auth.Subject.doAs(Subject.java:422) ~[na:1.8.0_131]
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
~[hadoop-common-2.7.0-mapr-1607.jar:na]
at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:227)
~[drill-java-exec-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
[drill-common-1.11.0-SNAPSHOT.jar:1.11.0-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_131]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
2017-06-26 13:02:54,065 [Drillbit-ShutdownHook#0] INFO
o.apache.drill.exec.server.Drillbit - Received shutdown request.
2017-06-26 13:03:01,593 [pool-7-thread-2] INFO
o.a.drill.exec.rpc.data.DataServer - closed eventLoopGroup
io.netty.channel.nio.NioEventLoopGroup@4f4a0f30 in 1058 ms
2017-06-26 13:03:01,594 [pool-7-thread-2] INFO
o.a.drill.exec.service.ServiceEngine - closed dataPool in 1058 ms
2017-06-26 13:03:03,657 [Drillbit-ShutdownHook#0] WARN
o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 2
running fragments.
2017-06-26 13:03:03,657 [Drillbit-ShutdownHook#0] INFO
o.a.drill.exec.compile.CodeCompiler - Stats: code gen count: 4, cache miss
count: 1, hit rate: 75%
{code}
Also I see that table file was not cleaned up. It seems we get data corruption:
{code:sql}
hadoop fs -ls /tmp/
Found 1 items
drwxrwxr-x - mapr users 2 2017-06-26 13:02 /tmp/table3
{code}
After drillbit starting there was not information about this query in UI. I
used single node drillbit cluster and Drill from commit a7e298760f9c9efa.
> FragmentExecutor.fail() should interrupt the fragment thread to avoid
> possible query hangs
> ------------------------------------------------------------------------------------------
>
> Key: DRILL-4595
> URL: https://issues.apache.org/jira/browse/DRILL-4595
> Project: Apache Drill
> Issue Type: Bug
> Affects Versions: 1.4.0
> Reporter: Deneche A. Hakim
> Assignee: Deneche A. Hakim
> Fix For: Future
>
>
> When a fragment fails it's assumed it will be able to close itself and send
> it's FAILED state to the foreman which will cancel any running fragments.
> FragmentExecutor.cancel() will interrupt the thread making sure those
> fragment don't stay blocked.
> However, if a fragment is already blocked when it's fail method is called the
> foreman may never be notified about this and the query will hang forever. One
> such scenario is the following:
> - generally it's a CTAS running on a large cluster (lot's of writers running
> in parallel)
> - logs show that the user channel was closed and UserServer caused the root
> fragment to move to a FAILED state
> - jstack shows that the root fragment is blocked in it's receiver waiting for
> data
> - jstack also shows that ALL other fragments are no longer running, and the
> logs show that all of them succeeded
> - the foreman waits *forever* for the root fragment to finish
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)