[ 
https://issues.apache.org/jira/browse/PARQUET-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marios Meimaris updated PARQUET-2060:
-------------------------------------
    Description: 
I am attaching a valid and corrupt parquet file (datapageV2) that differ in one 
byte.

We hit an infinite loop when trying to read the corrupt file in 
[https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698]
 and specifically in the `page.getData().toInputStream()` call.  

Stack trace of infinite loop:

java.io.DataInputStream.readFully(DataInputStream.java:195)
 java.io.DataInputStream.readFully(DataInputStream.java:169)
 
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
 org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
 org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
 
org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
 
org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
 
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
 
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
 org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
 
org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
 
org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)

 

The call to `readFully` will underneath go through 
`NonBlockedDecompressorStream` which will always hit this path: 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45].
 This will cause `setInput` to not be called on the decompressor, and the 
subsequent calls to `decompress` will always hit this condition: 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54].
 Therefore, the 0 value will be returned by the read method, which will cause 
an infinite loop in 
[https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198]
 
 This originates from the corruption, which causes the input stream of the data 
page to be of size 0, which makes `getCompressedData` always return -1. 

I am wondering whether this can be caught earlier so that the read fails in 
case of such corruptions. 

Since this happens in `BytesInput.toInputStream`, I don't think it's only 
relevant to DataPageV2. 

 

In 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111,]
 if we call `bytes.toByteArray` and log its length, it is 0 in the case of the 
corrupt file, and 6 in the case of the valid file. 

A potential fix is to check the array size there and fail early, but I am not 
sure if a zero-length byte array can ever be expected in the case of valid 
files.

 

Attached:

Valid file: `datapage_v2_snappy.parquet`

Corrupt file: `datapage_v2_snappy.parquet1383`

  was:
I am attaching a valid and corrupt parquet file (datapageV2) that differ in one 
byte.

We hit an infinite loop when trying to read the corrupt file in 
https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698
 and specifically in the `page.getData().toInputStream()` call.  

Stack trace of infinite loop:

java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readFully(DataInputStream.java:169)
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)

 

The call to `readFully` will underneath go through 
`NonBlockedDecompressorStream` which will always hit this path: 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45.
 This will cause `setInput` to not be called on the decompressor, and the 
subsequent calls to `decompress` will always hit this condition: 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54.
 Therefore, the 0 value will be returned by the read method, which will cause 
an infinite loop in 
https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198
 
This originates from the corruption, which causes the input stream of the data 
page to be of size 0, which makes `getCompressedData` always return -1. 

I am wondering whether this can be caught earlier so that the read fails in 
case of such corruptions. 

Since this happens in `BytesInput.toInputStream`, I don't think it's only 
relevant to DataPageV2. 

 

Attached:

Valid file: `datapage_v2_snappy.parquet`

Corrupt file: `datapage_v2_snappy.parquet1383`


> Parquet corruption can cause infinite loop with Snappy
> ------------------------------------------------------
>
>                 Key: PARQUET-2060
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2060
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Marios Meimaris
>            Priority: Major
>         Attachments: datapage_v2.snappy.parquet, 
> datapage_v2.snappy.parquet1383
>
>
> I am attaching a valid and corrupt parquet file (datapageV2) that differ in 
> one byte.
> We hit an infinite loop when trying to read the corrupt file in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698]
>  and specifically in the `page.getData().toInputStream()` call.  
> Stack trace of infinite loop:
> java.io.DataInputStream.readFully(DataInputStream.java:195)
>  java.io.DataInputStream.readFully(DataInputStream.java:169)
>  
> org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
>  org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
>  org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
>  
> org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
>  
> org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
>  org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
>  
> The call to `readFully` will underneath go through 
> `NonBlockedDecompressorStream` which will always hit this path: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45].
>  This will cause `setInput` to not be called on the decompressor, and the 
> subsequent calls to `decompress` will always hit this condition: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54].
>  Therefore, the 0 value will be returned by the read method, which will cause 
> an infinite loop in 
> [https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198]
>  
>  This originates from the corruption, which causes the input stream of the 
> data page to be of size 0, which makes `getCompressedData` always return -1. 
> I am wondering whether this can be caught earlier so that the read fails in 
> case of such corruptions. 
> Since this happens in `BytesInput.toInputStream`, I don't think it's only 
> relevant to DataPageV2. 
>  
> In 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111,]
>  if we call `bytes.toByteArray` and log its length, it is 0 in the case of 
> the corrupt file, and 6 in the case of the valid file. 
> A potential fix is to check the array size there and fail early, but I am not 
> sure if a zero-length byte array can ever be expected in the case of valid 
> files.
>  
> Attached:
> Valid file: `datapage_v2_snappy.parquet`
> Corrupt file: `datapage_v2_snappy.parquet1383`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to