[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689782#comment-17689782 ]
ASF GitHub Bot commented on PARQUET-2247: ----------------------------------------- cxzl25 commented on code in PR #1031: URL: https://github.com/apache/parquet-mr/pull/1031#discussion_r1108530943 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java: ########## @@ -160,7 +160,7 @@ public void writePage(BytesInput bytes, Encoding valuesEncoding) throws IOException { pageOrdinal++; long uncompressedSize = bytes.size(); - if (uncompressedSize > Integer.MAX_VALUE) { + if (uncompressedSize > Integer.MAX_VALUE || uncompressedSize < 0) { Review Comment: Using Spark3.2.0 Paruqet 1.12.1 to write a parquet file with Gzip compression, the write is successful, but the read fails There is the following exception. ```java at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:254) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:168) ... 18 more Caused by: java.lang.NegativeArraySizeException at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:285) at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237) at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246) at org.apache.parquet.column.impl.ColumnReaderBase.readPageV1(ColumnReaderBase.java:680) at org.apache.parquet.column.impl.ColumnReaderBase.access$300(ColumnReaderBase.java:57) at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:623) at org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620) at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:120) at org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620) at org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594) at org.apache.parquet.column.impl.ColumnReaderBase.consume(ColumnReaderBase.java:735) at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:30) at org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:47) at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82) at org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:271) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:136) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:225) ``` It is found that the `total_uncompressed_size` write is a negative number. Because `CapacityByteArrayOutputStream#size` may be negative. ``` ColumnChunk(file_offset:4339481930, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[RLE, BIT_PACKED, PLAIN], path_in_schema:[XX], codec:GZIP, num_values:7277, total_uncompressed_size:-1953770507, total_compressed_size:1105719623, data_page_offset:4339481930, statistics:Statistics(), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:5)]), offset_index_offset:5445600168, offset_index_length:78), ``` > Fail-fast if CapacityByteArrayOutputStream write overflow > --------------------------------------------------------- > > Key: PARQUET-2247 > URL: https://issues.apache.org/jira/browse/PARQUET-2247 > Project: Parquet > Issue Type: Bug > Components: parquet-mr > Reporter: dzcxzl > Priority: Critical > > The bytesUsed of CapacityByteArrayOutputStream may overflow when writing some > large byte data, resulting in parquet file write corruption. -- This message was sent by Atlassian Jira (v8.20.10#820010)