[ 
https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344499#comment-17344499
 ] 

ASF GitHub Bot commented on PARQUET-1633:
-----------------------------------------

gszadovszky commented on pull request #902:
URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-841139504


   @advancedxy, thanks for explaining.
   I think, the best option is 2. It is up to the user to provide enough 
resources for handling the large row groups or not writing them.
   Meanwhile, even though I've written I can accept lacking of unit tests in 
some situations my concern in this case is I am not sure that every aspect of a 
large row group is handled properly. So, we clearly need to validate this fix 
with such large row groups. This test can be even implemented in this source 
code but we must not include it in the unit tests or integration tests we run 
regularly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Integer overflow in ParquetFileReader.ConsecutiveChunkList
> ----------------------------------------------------------
>
>                 Key: PARQUET-1633
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1633
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Ivan Sadikov
>            Priority: Major
>
> When reading a large Parquet file (2.8GB), I encounter the following 
> exception:
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
> dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228)
> ... 14 more
> Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212
> at java.util.ArrayList.<init>(ArrayList.java:157)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code}
>  
> The file metadata is:
>  * block 1 (3 columns)
>  ** rowCount: 110,100
>  ** totalByteSize: 348,492,072
>  ** compressedSize: 165,689,649
>  * block 2 (3 columns)
>  ** rowCount: 90,054
>  ** totalByteSize: 3,243,165,541
>  ** compressedSize: 2,509,579,966
>  * block 3 (3 columns)
>  ** rowCount: 105,119
>  ** totalByteSize: 350,901,693
>  ** compressedSize: 144,952,177
>  * block 4 (3 columns)
>  ** rowCount: 48,741
>  ** totalByteSize: 1,275,995
>  ** compressedSize: 914,205
> I don't have the code to reproduce the issue, unfortunately; however, I 
> looked at the code and it seems that integer {{length}} field in 
> ConsecutiveChunkList overflows, which results in negative capacity for array 
> list in {{readAll}} method:
> {code:java}
> int fullAllocations = length / options.getMaxAllocationSize();
> int lastAllocationSize = length % options.getMaxAllocationSize();
>       
> int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0);
> List<ByteBuffer> buffers = new ArrayList<>(numAllocations);{code}
>  
> This is caused by cast to integer in {{readNextRowGroup}} method in 
> ParquetFileReader:
> {code:java}
> currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, 
> (int)mc.getTotalSize()));
> {code}
> which overflows when total size of the column is larger than 
> Integer.MAX_VALUE.
> I would appreciate if you could help addressing the issue. Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to