[
https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689782#comment-17689782
]
ASF GitHub Bot commented on PARQUET-2247:
-----------------------------------------
cxzl25 commented on code in PR #1031:
URL: https://github.com/apache/parquet-mr/pull/1031#discussion_r1108530943
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java:
##########
@@ -160,7 +160,7 @@ public void writePage(BytesInput bytes,
Encoding valuesEncoding) throws IOException {
pageOrdinal++;
long uncompressedSize = bytes.size();
- if (uncompressedSize > Integer.MAX_VALUE) {
+ if (uncompressedSize > Integer.MAX_VALUE || uncompressedSize < 0) {
Review Comment:
Using Spark3.2.0 Paruqet 1.12.1 to write a parquet file with Gzip
compression, the write is successful, but the read fails
There is the following exception.
```java
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254)
at
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168)
... 18 more
Caused by: java.lang.NegativeArraySizeException
at
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:285)
at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
at
org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
at
org.apache.parquet.column.impl.ColumnReaderBase.readPageV1(ColumnReaderBase.java:680)
at
org.apache.parquet.column.impl.ColumnReaderBase.access$300(ColumnReaderBase.java:57)
at
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:623)
at
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120)
at
org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
at
org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
at
org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:735)
at
org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30)
at
org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:47)
at
org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
at
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:271)
at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at
org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
at
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136)
at
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225)
```
It is found that the `total_uncompressed_size` write is a negative number.
Because `CapacityByteArrayOutputStream#size` may be negative.
```
ColumnChunk(file_offset:4339481930,
meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[RLE, BIT_PACKED, PLAIN],
path_in_schema:[XX], codec:GZIP, num_values:7277,
total_uncompressed_size:-1953770507, total_compressed_size:1105719623,
data_page_offset:4339481930, statistics:Statistics(),
encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN,
count:5)]), offset_index_offset:5445600168, offset_index_length:78),
```
> Fail-fast if CapacityByteArrayOutputStream write overflow
> ---------------------------------------------------------
>
> Key: PARQUET-2247
> URL: https://issues.apache.org/jira/browse/PARQUET-2247
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Reporter: dzcxzl
> Priority: Critical
>
> The bytesUsed of CapacityByteArrayOutputStream may overflow when writing some
> large byte data, resulting in parquet file write corruption.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)