[ https://issues.apache.org/jira/browse/DRILL-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sreeparna Bhabani updated DRILL-7736: ------------------------------------- Description: Facing one issue while creating Parquet file in Drill from another Parquet file. *Summary-* I am re-writing one Parquet file from another Parquet file using CTAS. The source Parquet file is generated from Python. But when I am trying to rewrite the parquet I am getting error. The details of the error is given below. *Version of Apache Drill* - 1.17 *Memory config-* DRILL_HEAP=16 G DRILL_MAX_DIRECT_MEMORY=32G *Few configs are mentioned here for information-* store.parquet.reader.pagereader.async=true; store.parquet.reader.pagereader.bufferedread=false; planner.memory.max_query_memory_per_node=31147483648 drill.exec.memory.operator.output_batch_size=4194304 *Details of volume-* The number of rows for which I am trying to CTAS is - 25245241. No of columns 145. FYI - I am able to create Parquet using CTAS for less number of rows. *CTAS script-* CREATE TABLE dfs.root.<Table_name> PARTITION BY (<Column1>,<Column2>,<Column3>) AS SELECT * FROM dfs.root.<source_parquet>; *Error Log-* 2020-05-07 xx:xx:xx,504 [scan-4] INFO o.a.d.e.s.p.c.AsyncPageReader - User Error Occurred: Exception occurred while reading from disk. (can not read class org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: Interrupted while choosing DataNode for read.) org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Exception occurred while reading from disk. File: <xxx>.parquet Column: <xxx> Row Group Start: 25545832 [Error Id: 4157803d-a37e-4693-bc1a-b654807222ed ] at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:637) at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.handleAndThrowException(AsyncPageReader.java:190) at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.access$700(AsyncPageReader.java:84) at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:480) at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:394) at org.apache.drill.exec.util.concurrent.ExecutorServiceUtil$CallableTaskWrapper.call(ExecutorServiceUtil.java:85) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: Interrupted while choosing DataNode for read. at org.apache.parquet.format.Util.read(Util.java:232) at org.apache.parquet.format.Util.readPageHeader(Util.java:81) at org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:437) ... 6 common frames omitted Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException: java.io.InterruptedIOException: Interrupted while choosing DataNode for read. at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) at shaded.parquet.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:634) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:539) at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158) at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:973) at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966) at org.apache.parquet.format.PageHeader.read(PageHeader.java:843) at org.apache.parquet.format.Util.read(Util.java:229) ... 8 common frames omitted Caused by: java.io.InterruptedIOException: Interrupted while choosing DataNode for read. at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:910) at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:862) at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:841) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) at java.io.DataInputStream.read(DataInputStream.java:149) at java.io.FilterInputStream.read(FilterInputStream.java:133) at shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 16 common frames omitted was: Facing one issue while creating Parquet file in Drill from another Parquet file. The Parquet file from which I am sourcing is created from Python. But when I am trying to rewrite the parquet using CTAS I am getting error. The error is - INFO o.a.d.e.s.p.c.AsyncPageReader - User Error Occurred: Exception occurred while reading from disk. (can not read class org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: Interrupted while choosing DataNode for read.) org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Exception occurred while reading from disk. I have modified the below setting. But it did not help. store.parquet.reader.pagereader.async=true; store.parquet.reader.pagereader.bufferedread=false; planner.memory.max_query_memory_per_node=31147483648 drill.exec.memory.operator.output_batch_size=4194304 The number of rows for which I am trying to CTAS is - 25245241. No of columns 145. FYI - I am able to create Parquet using CTAS for less number of rows. Version I am using - 1.17 DRILL_HEAP=16 G DRILL_MAX_DIRECT_MEMORY=32G CTAS script- CREATE TABLE dfs.root.<Table_name> PARTITION BY (<Column1>,<Column2>,<Column3>) AS SELECT * FROM dfs.root.<source_parquet> > Error while reading from Parquet : DATA_READ ERROR: Exception occurred while > reading from disk > ---------------------------------------------------------------------------------------------- > > Key: DRILL-7736 > URL: https://issues.apache.org/jira/browse/DRILL-7736 > Project: Apache Drill > Issue Type: Test > Components: Functions - Drill > Affects Versions: 1.17.0 > Reporter: Sreeparna Bhabani > Priority: Blocker > > Facing one issue while creating Parquet file in Drill from another Parquet > file. > *Summary-* > I am re-writing one Parquet file from another Parquet file using CTAS. The > source Parquet file is generated from Python. But when I am trying to rewrite > the parquet I am getting error. The details of the error is given below. > *Version of Apache Drill* - > 1.17 > *Memory config-* > DRILL_HEAP=16 G > DRILL_MAX_DIRECT_MEMORY=32G > *Few configs are mentioned here for information-* > store.parquet.reader.pagereader.async=true; > store.parquet.reader.pagereader.bufferedread=false; > planner.memory.max_query_memory_per_node=31147483648 > drill.exec.memory.operator.output_batch_size=4194304 > *Details of volume-* > The number of rows for which I am trying to CTAS is - 25245241. No of columns > 145. > FYI - I am able to create Parquet using CTAS for less number of rows. > *CTAS script-* > CREATE TABLE dfs.root.<Table_name> > PARTITION BY (<Column1>,<Column2>,<Column3>) > AS SELECT * > FROM dfs.root.<source_parquet>; > *Error Log-* > 2020-05-07 xx:xx:xx,504 [scan-4] INFO o.a.d.e.s.p.c.AsyncPageReader - User > Error Occurred: Exception occurred while reading from disk. (can not read > class org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: > Interrupted while choosing DataNode for read.) > org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Exception > occurred while reading from disk. > File: <xxx>.parquet > Column: <xxx> > Row Group Start: 25545832 > [Error Id: 4157803d-a37e-4693-bc1a-b654807222ed ] > at > org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:637) > at > org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.handleAndThrowException(AsyncPageReader.java:190) > at > org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.access$700(AsyncPageReader.java:84) > at > org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:480) > at > org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:394) > at > org.apache.drill.exec.util.concurrent.ExecutorServiceUtil$CallableTaskWrapper.call(ExecutorServiceUtil.java:85) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: > Interrupted while choosing DataNode for read. > at org.apache.parquet.format.Util.read(Util.java:232) > at org.apache.parquet.format.Util.readPageHeader(Util.java:81) > at > org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:437) > ... 6 common frames omitted > Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException: > java.io.InterruptedIOException: Interrupted while choosing DataNode for read. > at > shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129) > at > shaded.parquet.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) > at > shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:634) > at > shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:539) > at > org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158) > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:973) > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966) > at org.apache.parquet.format.PageHeader.read(PageHeader.java:843) > at org.apache.parquet.format.Util.read(Util.java:229) > ... 8 common frames omitted > Caused by: java.io.InterruptedIOException: Interrupted while choosing > DataNode for read. > at > org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:910) > at > org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:862) > at > org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:841) > at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829) > at java.io.DataInputStream.read(DataInputStream.java:149) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at > shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) > ... 16 common frames omitted -- This message was sent by Atlassian Jira (v8.3.4#803005)