[ 
https://issues.apache.org/jira/browse/PARQUET-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368819#comment-17368819
 ] 

Gabor Szadovszky commented on PARQUET-2060:
-------------------------------------------

[~mmeimaris], what do you think about simply returning the zero length 
BytesInput object (just like in the case of the codec is null)? This way we 
shall catch the error at same place if the data stream is empty. (We shall 
handle this case for uncompressed data as well.)
Are you willing to implement a PR about this? I'm happy to help/review.

> Parquet corruption can cause infinite loop with Snappy
> ------------------------------------------------------
>
>                 Key: PARQUET-2060
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2060
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Marios Meimaris
>            Priority: Major
>         Attachments: datapage_v2.snappy.parquet, 
> datapage_v2.snappy.parquet1383
>
>
> I am attaching a valid and corrupt parquet file (datapageV2) that differ in 
> one byte.
> We hit an infinite loop when trying to read the corrupt file in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698]
>  and specifically in the `page.getData().toInputStream()` call.  
> Stack trace of infinite loop:
> java.io.DataInputStream.readFully(DataInputStream.java:195)
>  java.io.DataInputStream.readFully(DataInputStream.java:169)
>  
> org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
>  org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
>  org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
>  
> org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
>  
> org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
>  org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
>  
> The call to `readFully` will underneath go through 
> `NonBlockedDecompressorStream` which will always hit this path: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45].
>  This will cause `setInput` to not be called on the decompressor, and the 
> subsequent calls to `decompress` will always hit this condition: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54].
>  Therefore, the 0 value will be returned by the read method, which will cause 
> an infinite loop in 
> [https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198]
>  
>  This originates from the corruption, which causes the input stream of the 
> data page to be of size 0, which makes `getCompressedData` always return -1. 
> I am wondering whether this can be caught earlier so that the read fails in 
> case of such corruptions. 
> Since this happens in `BytesInput.toInputStream`, I don't think it's only 
> relevant to DataPageV2. 
>  
> In 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111,]
>  if we call `bytes.toByteArray` and log its length, it is 0 in the case of 
> the corrupt file, and 6 in the case of the valid file. 
> A potential fix is to check the array size there and fail early, but I am not 
> sure if a zero-length byte array can ever be expected in the case of valid 
> files.
>  
> Attached:
> Valid file: `datapage_v2_snappy.parquet`
> Corrupt file: `datapage_v2_snappy.parquet1383`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to