[jira] [Updated] (DRILL-7736) Error while reading from Parquet : DATA_READ ERROR: Exception occurred while reading from disk

Sreeparna Bhabani (Jira) Fri, 08 May 2020 01:01:09 -0700


     [ 
https://issues.apache.org/jira/browse/DRILL-7736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sreeparna Bhabani updated DRILL-7736:
-------------------------------------
    Description: 
Facing one issue while creating Parquet file in Drill from another Parquet file.

*Summary-*

I am re-writing one Parquet file from another Parquet file using CTAS. The 
source Parquet file is generated from Python. But when I am trying to rewrite 
the parquet I am getting error. The details of the error is given below.

*Version of Apache Drill* -

1.17

*Memory config-*

DRILL_HEAP=16 G
 DRILL_MAX_DIRECT_MEMORY=32G

*Few configs are mentioned here for information-*

store.parquet.reader.pagereader.async=true;

store.parquet.reader.pagereader.bufferedread=false;

planner.memory.max_query_memory_per_node=31147483648

drill.exec.memory.operator.output_batch_size=4194304

*Details of volume-*

The number of rows for which I am trying to CTAS is - 25245241. No of columns 
145.

FYI - I am able to create Parquet using CTAS for less number of rows.

*CTAS script-*

CREATE TABLE dfs.root.<Table_name>
 PARTITION BY (<Column1>,<Column2>,<Column3>)
 AS SELECT * 
 FROM dfs.root.<source_parquet>;

*Error Log-*

2020-05-07 xx:xx:xx,504 [scan-4] INFO  o.a.d.e.s.p.c.AsyncPageReader - User 
Error Occurred: Exception occurred while reading from disk. (can not read class 
org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: 
Interrupted while choosing DataNode for read.)
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Exception 
occurred while reading from disk.

File:  <xxx>.parquet
Column:  <xxx>
Row Group Start:  25545832

[Error Id: 4157803d-a37e-4693-bc1a-b654807222ed ]
 at 
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:637)
 at 
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.handleAndThrowException(AsyncPageReader.java:190)
 at 
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.access$700(AsyncPageReader.java:84)
 at 
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:480)
 at 
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:394)
 at 
org.apache.drill.exec.util.concurrent.ExecutorServiceUtil$CallableTaskWrapper.call(ExecutorServiceUtil.java:85)
 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: 
Interrupted while choosing DataNode for read.
 at org.apache.parquet.format.Util.read(Util.java:232)
 at org.apache.parquet.format.Util.readPageHeader(Util.java:81)
 at 
org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:437)
 ... 6 common frames omitted
Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException: 
java.io.InterruptedIOException: Interrupted while choosing DataNode for read.
 at 
shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
 at 
shaded.parquet.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
 at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:634)
 at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:539)
 at 
org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
 at 
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:973)
 at 
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966)
 at org.apache.parquet.format.PageHeader.read(PageHeader.java:843)
 at org.apache.parquet.format.Util.read(Util.java:229)
 ... 8 common frames omitted
Caused by: java.io.InterruptedIOException: Interrupted while choosing DataNode 
for read.
 at 
org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:910)
 at 
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:862)
 at 
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:841)
 at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
 at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
 at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
 at java.io.DataInputStream.read(DataInputStream.java:149)
 at java.io.FilterInputStream.read(FilterInputStream.java:133)
 at 
shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
 ... 16 common frames omitted

  was:
 Facing one issue while creating Parquet file in Drill from another Parquet 
file.

The Parquet file from which I am sourcing is created from Python. But when I am 
trying to rewrite the parquet using CTAS I am getting error.

The error is -

INFO o.a.d.e.s.p.c.AsyncPageReader - User Error Occurred: Exception occurred 
while reading from disk. (can not read class 
org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: 
Interrupted while choosing DataNode for read.)
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Exception 
occurred while reading from disk.



I have modified the below setting. But it did not help.

store.parquet.reader.pagereader.async=true;
store.parquet.reader.pagereader.bufferedread=false;

planner.memory.max_query_memory_per_node=31147483648

drill.exec.memory.operator.output_batch_size=4194304

The number of rows for which I am trying to CTAS is - 25245241. No of columns 
145.



FYI - I am able to create Parquet using CTAS for less number of rows.



Version I am using - 1.17

DRILL_HEAP=16 G
DRILL_MAX_DIRECT_MEMORY=32G



CTAS script-

CREATE TABLE dfs.root.<Table_name>
PARTITION BY (<Column1>,<Column2>,<Column3>)
AS SELECT * 
FROM dfs.root.<source_parquet>


> Error while reading from Parquet : DATA_READ ERROR: Exception occurred while 
> reading from disk
> ----------------------------------------------------------------------------------------------
>
>                 Key: DRILL-7736
>                 URL: https://issues.apache.org/jira/browse/DRILL-7736
>             Project: Apache Drill
>          Issue Type: Test
>          Components: Functions - Drill
>    Affects Versions: 1.17.0
>            Reporter: Sreeparna Bhabani
>            Priority: Blocker
>
> Facing one issue while creating Parquet file in Drill from another Parquet 
> file.
> *Summary-*
> I am re-writing one Parquet file from another Parquet file using CTAS. The 
> source Parquet file is generated from Python. But when I am trying to rewrite 
> the parquet I am getting error. The details of the error is given below.
> *Version of Apache Drill* -
> 1.17
> *Memory config-*
> DRILL_HEAP=16 G
>  DRILL_MAX_DIRECT_MEMORY=32G
> *Few configs are mentioned here for information-*
> store.parquet.reader.pagereader.async=true;
> store.parquet.reader.pagereader.bufferedread=false;
> planner.memory.max_query_memory_per_node=31147483648
> drill.exec.memory.operator.output_batch_size=4194304
> *Details of volume-*
> The number of rows for which I am trying to CTAS is - 25245241. No of columns 
> 145.
> FYI - I am able to create Parquet using CTAS for less number of rows.
> *CTAS script-*
> CREATE TABLE dfs.root.<Table_name>
>  PARTITION BY (<Column1>,<Column2>,<Column3>)
>  AS SELECT * 
>  FROM dfs.root.<source_parquet>;
> *Error Log-*
> 2020-05-07 xx:xx:xx,504 [scan-4] INFO  o.a.d.e.s.p.c.AsyncPageReader - User 
> Error Occurred: Exception occurred while reading from disk. (can not read 
> class org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: 
> Interrupted while choosing DataNode for read.)
> org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Exception 
> occurred while reading from disk.
> File:  <xxx>.parquet
> Column:  <xxx>
> Row Group Start:  25545832
> [Error Id: 4157803d-a37e-4693-bc1a-b654807222ed ]
>  at 
> org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:637)
>  at 
> org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.handleAndThrowException(AsyncPageReader.java:190)
>  at 
> org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader.access$700(AsyncPageReader.java:84)
>  at 
> org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:480)
>  at 
> org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:394)
>  at 
> org.apache.drill.exec.util.concurrent.ExecutorServiceUtil$CallableTaskWrapper.call(ExecutorServiceUtil.java:85)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: java.io.InterruptedIOException: 
> Interrupted while choosing DataNode for read.
>  at org.apache.parquet.format.Util.read(Util.java:232)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:81)
>  at 
> org.apache.drill.exec.store.parquet.columnreaders.AsyncPageReader$AsyncPageReaderTask.call(AsyncPageReader.java:437)
>  ... 6 common frames omitted
> Caused by: shaded.parquet.org.apache.thrift.transport.TTransportException: 
> java.io.InterruptedIOException: Interrupted while choosing DataNode for read.
>  at 
> shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
>  at 
> shaded.parquet.org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
>  at 
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:634)
>  at 
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:539)
>  at 
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:973)
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:966)
>  at org.apache.parquet.format.PageHeader.read(PageHeader.java:843)
>  at org.apache.parquet.format.Util.read(Util.java:229)
>  ... 8 common frames omitted
> Caused by: java.io.InterruptedIOException: Interrupted while choosing 
> DataNode for read.
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:910)
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:862)
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:841)
>  at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:567)
>  at 
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:757)
>  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:829)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at 
> shaded.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
>  ... 16 common frames omitted



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (DRILL-7736) Error while reading from Parquet : DATA_READ ERROR: Exception occurred while reading from disk

Reply via email to