[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362791#comment-17362791 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-859662157 @gszadovszky Thanks - just created Jira account `edw_vtxa`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362767#comment-17362767 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-859381992 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362758#comment-17362758 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky merged pull request #902: URL: https://github.com/apache/parquet-mr/pull/902 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361096#comment-17361096 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858802880 Note I don't have write access, so I guess someone needs to merge this at some point? Not sure of your workflow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360984#comment-17360984 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858684740 Pleasure @gszadovszky - thanks for yours! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360979#comment-17360979 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858684740 Pleasure @gszadovszky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360938#comment-17360938 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858656793 @gszadovszky done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360832#comment-17360832 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858568042 @eadwright, sorry, you're right. This is not tightly related to your PR. Please, remove the try-catch blocks for OOE and put an `@Ignore` annotation to the test class for now. I'll open a separate jira to solve the issue of such "not-to-be-tested-by-ci" tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360705#comment-17360705 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858483789 Interesting options @gszadovszky , I have no strong opinion. I'd just like this fix merged once everyone is happy :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360695#comment-17360695 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858478591 @eadwright, I understand your concerns I don't really like it either. Meanwhile, I don't feel good having a test that is not executed automatically. Without regular executions there is no guarantee that this test would be executed ever again and even if someone would execute it it might fail because of the lack of maintenance. What do you think about the following options? @shangxinli, I'm also curious about your ideas. * Execute this test separately with a maven profile. I am not sure if the CI allows allocating such large memory but with Xmx options we might give a try and create a separate check for this test only. * Similar to the previous with the profile but not executing in the CI ever. Instead, we add some comments to the release doc so this test will be executed at least once per release. * Configuring the CI profile to skip this test but have it in the normal scenario meaning the devs will execute it locally. There are a couple of cons though. There is no guarantee that devs executes all the tests including this one. It also can cause issues if the dev doesn't have enough memory and don't know that the test failure is not related to the current change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360686#comment-17360686 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858467001 @gszadovszky Sorry, delay on my part. Have merged your changes in. Even in a testing pipeline I am uncomfortable with catching any kind of `java.lang.Error`, especially an OOM. Should we remove those catch clauses and mark this test to be ignored usually - let it be run manually? Love how the tests need ~3GB, not 10GB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360679#comment-17360679 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-858467001 @gszadovszky Sorry, delay on my part. Have merged your changes in. Even in a testing pipeline I am uncomfortable with catching any kind of `java.lang.Error`, especially an OOM. Should we remove those catch clauses and mark this test to be ignored usually - let it be run manually? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355329#comment-17355329 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-852434740 @gszadovszky I had a look at your changes. I feel uncomfortable relying on any behaviour at all after an OOM error. Are you sure this is the right approach? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354286#comment-17354286 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-851298938 @eadwright, I've made some changes in the unit test (no more TODOs). See the update [here](https://github.com/gszadovszky/parquet-mr/commit/6dc3f418b537fd5cb7954018243399f39784d81b). The idea is to not skip the tests in "normal" case but catch the OoM and skip. This way no tests should fail on any environments. Most of the modern laptops should have enough memory so this test will be executed on them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352779#comment-17352779 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-849972541 @gszadovszky I raised a PR to bring your changes in to my fork. Not had time yet alas to address the TODOs. I can say though that I believe to read that example file correctly with the fix requires 10GB of heap or so, probably similar with your test. Agree this test should be disabled by default, too heavy for CI. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352329#comment-17352329 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-849434359 @eadwright, so the CI was [executed](https://github.com/gszadovszky/parquet-mr/actions/runs/879304911) somehow on my private repo and failed due to OoM. So, we may either investigate if we can tweak our configs/CI or disable this test by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351932#comment-17351932 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-848941568 @eadwright, I've implemented a [unit test](https://github.com/gszadovszky/parquet-mr/commit/fcaf41269470c03c088b7eb5598558d44013f59d) to reproduce the issue and test your solution. Feel free to use it in your PR. I've left some TODOs for you :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350959#comment-17350959 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-847752354 @gszadovszky Awesome, appreciated. Also note the file uploaded isn't corrupt as such, it just goes beyond 32-bit limits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350958#comment-17350958 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-847752354 Awesome, appreciated. Also note the file uploaded isn't corrupt as such, it just goes beyond 32-bit limits. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350914#comment-17350914 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-847669598 @eadwright, I'll try to look into this this week and produce a java code to reproduce the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350124#comment-17350124 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-846615701 Update - I've created some Java code which writes a parquet file nearly big enough to cause the issue, and can successfully read this file. However, two problems: * If I bump the record count so I have a file big enough to reproduce this bug 1633, another bug causes 32-bit integer overflow. The code as it stands cannot reproduce the file which was created in python. * The code I'm using to read the data, the potential unit test, does not work against the python-produced file (which has an avro schema) as I get this error (I suspect some classpath issue) `java.lang.NoSuchMethodError: org.apache.parquet.format.LogicalType.getSetField()Lshaded/parquet/org/apache/thrift/TFieldIdEnum;` So... if any of you can reproduce the original error with the parquet file I posted above, and can validate this 7 line fix addresses it, that'd be great. Open to ideas of course. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350061#comment-17350061 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright removed a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-846560922 Work in progress looks like this (not committed) - adjust `PATH` to point to the file I uploaded. ``` package org.apache.parquet.hadoop; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.ParquetReadOptions; import org.apache.parquet.column.page.PageReadStore; import org.apache.parquet.example.data.Group; import org.apache.parquet.example.data.simple.convert.GroupRecordConverter; import org.apache.parquet.hadoop.util.HadoopInputFile; import org.apache.parquet.io.ColumnIOFactory; import org.apache.parquet.io.MessageColumnIO; import org.apache.parquet.io.RecordReader; import org.apache.parquet.schema.MessageType; import org.apache.parquet.schema.PrimitiveType; import org.junit.Test; import java.io.IOException; import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.BINARY; import static org.apache.parquet.schema.Type.Repetition.REQUIRED; public class TestParquetReaderLargeColumn { private static final Path PATH = new Path("/Volumes/HDD/random6.parquet"); @Test public void test() throws IOException { Configuration configuration = new Configuration(); ParquetReadOptions options = ParquetReadOptions.builder().build(); MessageType messageType = buildSchema(); try (ParquetFileReader reader = new ParquetFileReader(HadoopInputFile.fromPath(PATH, configuration), options)) { PageReadStore pages; while ((pages = reader.readNextRowGroup()) != null) { MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(messageType); RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(messageType)); long rowCount = pages.getRowCount(); for (int i = 0; i < rowCount - 1; i++) { Group group = recordReader.read(); group.getString("string", 0); } } } } private static MessageType buildSchema() { return new MessageType("AvroString", new PrimitiveType(REQUIRED, BINARY, "string")); } } ``` Alas when I run it, I get this Exception, not sure why yet: ``` java.lang.NoSuchMethodError: org.apache.parquet.format.LogicalType.getSetField()Lshaded/parquet/org/apache/thrift/TFieldIdEnum; at org.apache.parquet.format.converter.ParquetMetadataConverter.getLogicalTypeAnnotation(ParquetMetadataConverter.java:1066) at org.apache.parquet.format.converter.ParquetMetadataConverter.buildChildren(ParquetMetadataConverter.java:1569) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetSchema(ParquetMetadataConverter.java:1524) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1399) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1370) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:583) at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:777) at org.apache.parquet.hadoop.TestParquetReaderLargeColumn.test(TestParquetReaderLargeColumn.java:64) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at >
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350062#comment-17350062 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-846560068 I've uploaded the python-generated file to my Google Drive: https://drive.google.com/file/d/1trWjeJaHpqbHlnipnDUrwlwYM-AKFNWg/view?usp=sharing Still working on a unit test. I'm used to working with Avro schema, but I'm trying to cut it back to bare-bones. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350036#comment-17350036 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-846560922 Work in progress looks like this (not committed) - adjust `PATH` to point to the file I uploaded. ``` package org.apache.parquet.hadoop; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.ParquetReadOptions; import org.apache.parquet.column.page.PageReadStore; import org.apache.parquet.example.data.Group; import org.apache.parquet.example.data.simple.convert.GroupRecordConverter; import org.apache.parquet.hadoop.util.HadoopInputFile; import org.apache.parquet.io.ColumnIOFactory; import org.apache.parquet.io.MessageColumnIO; import org.apache.parquet.io.RecordReader; import org.apache.parquet.schema.MessageType; import org.apache.parquet.schema.PrimitiveType; import org.junit.Test; import java.io.IOException; import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.BINARY; import static org.apache.parquet.schema.Type.Repetition.REQUIRED; public class TestParquetReaderLargeColumn { private static final Path PATH = new Path("/Volumes/HDD/random6.parquet"); @Test public void test() throws IOException { Configuration configuration = new Configuration(); ParquetReadOptions options = ParquetReadOptions.builder().build(); MessageType messageType = buildSchema(); try (ParquetFileReader reader = new ParquetFileReader(HadoopInputFile.fromPath(PATH, configuration), options)) { PageReadStore pages; while ((pages = reader.readNextRowGroup()) != null) { MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(messageType); RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(messageType)); long rowCount = pages.getRowCount(); for (int i = 0; i < rowCount - 1; i++) { Group group = recordReader.read(); group.getString("string", 0); } } } } private static MessageType buildSchema() { return new MessageType("AvroString", new PrimitiveType(REQUIRED, BINARY, "string")); } } ``` Alas when I run it, I get this Exception, not sure why yet: ``` java.lang.NoSuchMethodError: org.apache.parquet.format.LogicalType.getSetField()Lshaded/parquet/org/apache/thrift/TFieldIdEnum; at org.apache.parquet.format.converter.ParquetMetadataConverter.getLogicalTypeAnnotation(ParquetMetadataConverter.java:1066) at org.apache.parquet.format.converter.ParquetMetadataConverter.buildChildren(ParquetMetadataConverter.java:1569) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetSchema(ParquetMetadataConverter.java:1524) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1399) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1370) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:583) at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:777) at org.apache.parquet.hadoop.TestParquetReaderLargeColumn.test(TestParquetReaderLargeColumn.java:64) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at >
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350035#comment-17350035 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-846560068 I've uploaded the python-generated file to my Google Drive: https://drive.google.com/file/d/1trWjeJaHpqbHlnipnDUrwlwYM-AKFNWg/view?usp=sharing Still working on a unit test. I'm used to work with Avro schema, but I'm trying to cut it back to bare-bones. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347767#comment-17347767 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844246494 I've tweaked the python to create a test file which Java can't read. The python can now run fine on a 16GB machine. ``` import pandas as pd import numpy as np rand_array = np.random.rand(3800, 3) df = pd.DataFrame(rand_array, columns=["number1", "number2", "number3"]) df["string"] = df["number1"].astype(str) + df["number2"].astype(str) + df["number3"].astype(str) df.drop(["number1", "number2", "number3"], axis=1, inplace=True) df.to_parquet("random.parquet", compression=None, engine="pyarrow", **{"row_group_size": 3780}) ``` It creates 38M records, 37.8M of which are in the first row group, and the data for the `string` column in the first row group is about 2.1GB in size, over the threshold to cause the Java bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347759#comment-17347759 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844228742 Correction, we need a column within a row group to be over 2GB in size to cause the issue. It is not the row group size in total which counts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347757#comment-17347757 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844246494 I've tweaked the python to create a test file which Java can't read. The python can now run fine on a 16GB machine. ``` import pandas as pd import numpy as np rand_array = np.random.rand(4800, 3) df = pd.DataFrame(rand_array, columns=["number1", "number2", "number3"]) df["string1"] = df["number1"].astype(str) + df["number2"].astype(str) + df["number3"].astype(str) df.drop(["number1", "number2", "number3"], axis=1, inplace=True) df.to_parquet("random.parquet", compression="snappy", engine="pyarrow", **{"row_group_size": 4780}) ``` It creates 48M records, 47.8M of which are in the first row group, and the data for the `string1` column in the first row group is about 2.1GB in size, over the threshold to cause the Java bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347756#comment-17347756 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844246494 I've tweaked the python to create a test file which Java can't read. The python can now run fine on a 16GB machine. ``` import pandas as pd import numpy as np rand_array = np.random.rand(4800, 3) df = pd.DataFrame(rand_array, columns=["number1", "number2", "number3"]) df['string1'] = df["number1"].astype(str) + df["number2"].astype(str) + df["number3"].astype(str) df.drop(["number1", "number2", "number3"], axis=1, inplace=True) df.to_parquet("random.parquet", compression="snappy", engine="pyarrow", **{"row_group_size": 4780}) ``` It creates 48M records, 47.8M of which are in the first row group, and the data for the `string1` column in the first row group is about 2.1GB in size, over the threshold to cause the Java bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347755#comment-17347755 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844246494 I've tweaked the python to create a test file which Java can't read. The python can now run fine on a 16GB machine. ``` import pandas as pd import numpy as np rand_array = np.random.rand(4800, 4) df = pd.DataFrame(rand_array, columns=['number1', 'number2', 'number3', 'number4']) df['string1'] = df['number1'].astype(str) + df['number2'].astype(str) + df['number3'].astype(str) df.drop(["number1", "number2", "number3", "number4"], axis=1, inplace=True) df.to_parquet("random.parquet", compression="snappy", engine="pyarrow", **{"row_group_size": 4780}) ``` It creates 48M records, 47.8M of which are in the first row group, and the data for the `string1` column in the first row group is about 2.1GB in size, over the threshold to cause the Java bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347754#comment-17347754 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844246494 I've tweaked the python to create a test file which Java can't read. The python can now run fine on a 16GB machine. ```import pandas as pd import numpy as np rand_array = np.random.rand(4800, 4) df = pd.DataFrame(rand_array, columns=['number1', 'number2', 'number3', 'number4']) df['string1'] = df['number1'].astype(str) + df['number2'].astype(str) + df['number3'].astype(str) df.drop(["number1", "number2", "number3", "number4"], axis=1, inplace=True) df.to_parquet("random.parquet", compression="snappy", engine="pyarrow", **{"row_group_size": 4780})``` It creates 48M records, 47.8M of which are in the first row group, and the data for the `string1` column in the first row group is about 2.1GB in size, over the threshold to cause the Java bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347743#comment-17347743 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844228742 Correction, we need a column within a row group to be over 2GB in size to cause the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347736#comment-17347736 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844212961 Build failed due to some transient connectivity issue - builds fine for me, Java 8 and 11 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347735#comment-17347735 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844212961 Build failed due to some transient connectivity issue - builds fine for my, Java 8 and 11 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347723#comment-17347723 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844187674 Thanks, @eadwright for explaining. I get it now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347707#comment-17347707 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-844147358 > @eadwright, what do you mean by "necessary for the rows to spill over into a second row group."? It shall not be possible. Even the pages keep row boundaries but for row groups it is required by the specification. Sorry I probably didn't phrase that well. I mean, for this bug to occur, you need i) a row group which is taking more than 2GB of space, to get the 2^31 signed-int overflow, and ii) another subsequent row group (any size) so the code with the bug adds to a file offset using a corrupted value. In the example python code, if I asked it to produce 50M rows instead of 75M, you get a ~3.3GB row group, but no second row group. The file offset addition code path is not executed and the file gets read correctly, the bug is not triggered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347586#comment-17347586 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-84394 @eadwright, what do you mean by "necessary for the rows to spill over into a second row group."? It shall not be possible. Even the pages keep row boundaries but for row groups it is required by the specification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347411#comment-17347411 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright edited a comment on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-843868178 Screenshot from `parquet-tools meta` looking at the generated file. Note the TS (total size) of row group 1 is way over 2GB. For the bug to happen, it is also necessary for the rows to spill over into a second row group. ![image](https://user-images.githubusercontent.com/17048626/118780317-d349fc80-b883-11eb-8337-23dcbe433b5a.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347408#comment-17347408 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-843868178 Screenshot from `parquet-tools meta` looking at the generated file. Note the TS (total size) of row group 1 is way over 2GB ![image](https://user-images.githubusercontent.com/17048626/118780317-d349fc80-b883-11eb-8337-23dcbe433b5a.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347400#comment-17347400 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on a change in pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#discussion_r635013899 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java ## @@ -1464,7 +1464,7 @@ protected PageHeader readPageHeader(BlockCipher.Decryptor blockDecryptor, byte[] */ private void verifyCrc(int referenceCrc, byte[] bytes, String exceptionMsg) { crc.reset(); - crc.update(bytes); + crc.update(bytes, 0, bytes.length); Review comment: Agreed, committed to revert this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346373#comment-17346373 ] ASF GitHub Bot commented on PARQUET-1633: - advancedxy commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-841937603 > @advancedxy, thanks for explaining. > I think, the best option is 2. It is up to the user to provide enough resources for handling the large row groups or not writing them. > Meanwhile, even though I've written I can accept lacking of unit tests in some situations my concern in this case is I am not sure that every aspect of a large row group is handled properly. So, we clearly need to validate this fix with such large row groups. This test can be even implemented in this source code but we must not include it in the unit tests or integration tests we run regularly. @gszadovszky well understood. Hi @eadwright, is it possible for you to add the test case(tagged with @Ignore) in your PR, so others like @gszadovszky or me can verify it offline? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345832#comment-17345832 ] ASF GitHub Bot commented on PARQUET-1633: - advancedxy commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-841937603 > @advancedxy, thanks for explaining. > I think, the best option is 2. It is up to the user to provide enough resources for handling the large row groups or not writing them. > Meanwhile, even though I've written I can accept lacking of unit tests in some situations my concern in this case is I am not sure that every aspect of a large row group is handled properly. So, we clearly need to validate this fix with such large row groups. This test can be even implemented in this source code but we must not include it in the unit tests or integration tests we run regularly. @gszadovszky well understood. Hi @eadwright, is it possible for you to add the test case(tagged with @Ignore) in your PR, so others like @gszadovszky or me can verify it offline? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344499#comment-17344499 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-841139504 @advancedxy, thanks for explaining. I think, the best option is 2. It is up to the user to provide enough resources for handling the large row groups or not writing them. Meanwhile, even though I've written I can accept lacking of unit tests in some situations my concern in this case is I am not sure that every aspect of a large row group is handled properly. So, we clearly need to validate this fix with such large row groups. This test can be even implemented in this source code but we must not include it in the unit tests or integration tests we run regularly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344485#comment-17344485 ] ASF GitHub Bot commented on PARQUET-1633: - advancedxy commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-841124415 > @eadwright, > > I'll try to summarize the issue, please correct me if I'm wrong. Parquet-mr is not able to write such big row groups (>2GB) because of the `int` array size limitation. Meanwhile, both the format and some other implementations allow such big row groups. So, parquet-mr shall be prepared for this issue in some way. > One option is to "simply" read the large row groups. It would require significant efforts to use proper memory handling objects that would properly support reading the large row groups. (A similar effort would also make parquet-mr available to write larger row groups than 2GB.) > > The other option is to handle the too large row groups with a proper error message in parquet-mr without allowing silent overflows. This second option would be this effort. It is great to handle to potential int overflows but the main point, I think, would be at the footer conversion (`ParquetMetadataConverter`) where we create our own object structure from the file footer. At this point we can throw the proper error messages if the row group is too large to be handled (for now) in parquet-mr. > BTW, it might not be enough to check the potential overflows to validate if we can read a row group size. (See e.g. the source code of [ArrayList](https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229).) > > About the lack of the unit tests. I can accept in some cases where unit tests are not practically feasible to be implemented. In these cases I usually ask to validate the code offline. Hi @gszadovszky parquet-mr is able to produce big row groups. We found some files wrote by Spark(which uses parquet-mr) have this problem. See https://issues.apache.org/jira/browse/PARQUET-2045 for details. There are two options to fix this problem: 1. fail at writer side when creating such large row group/column chunk 2. support at reader side, which is this approach. It would require a lot of resource, but it's feasible. Either option is fine for me, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize();
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344482#comment-17344482 ] ASF GitHub Bot commented on PARQUET-1633: - advancedxy commented on a change in pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#discussion_r632394126 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java ## @@ -1464,7 +1464,7 @@ protected PageHeader readPageHeader(BlockCipher.Decryptor blockDecryptor, byte[] */ private void verifyCrc(int referenceCrc, byte[] bytes, String exceptionMsg) { crc.reset(); - crc.update(bytes); + crc.update(bytes, 0, bytes.length); Review comment: This is unrelated, I would prefer to update this in another PR. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java ## @@ -1763,8 +1763,8 @@ public void addChunk(ChunkDescriptor descriptor) { public void readAll(SeekableInputStream f, ChunkListBuilder builder) throws IOException { f.seek(offset); - int fullAllocations = length / options.getMaxAllocationSize(); - int lastAllocationSize = length % options.getMaxAllocationSize(); + int fullAllocations = (int)(length / options.getMaxAllocationSize()); Review comment: `(int)(length / options.getMaxAllocationSize())` -> ` Math.toIntExact(length / options.getMaxAllocationSize());` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341756#comment-17341756 ] ASF GitHub Bot commented on PARQUET-1633: - gszadovszky commented on pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#issuecomment-836325628 @eadwright, I'll try to summarize the issue, please correct me if I'm wrong. Parquet-mr is not able to write such big row groups (>2GB) because of the `int` array size limitation. Meanwhile, both the format and some other implementations allow such big row groups. So, parquet-mr shall be prepared for this issue in some way. One option is to "simply" read the large row groups. It would require significant efforts to use proper memory handling objects that would properly support reading the large row groups. (A similar effort would also make parquet-mr available to write larger row groups than 2GB.) The other option is to handle the too large row groups with a proper error message in parquet-mr without allowing silent overflows. This second option would be this effort. It is great to handle to potential int overflows but the main point, I think, would be at the footer conversion (`ParquetMetadataConverter`) where we create our own object structure from the file footer. At this point we can throw the proper error messages if the row group is too large to be handled (for now) in parquet-mr. BTW, it might not be enough to check the potential overflows to validate if we can read a row group size. (See e.g. the source code of [ArrayList](https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229).) About the lack of the unit tests. I can accept in some cases where unit tests are not practically feasible to be implemented. In these cases I usually ask to validate the code offline. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than >
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338998#comment-17338998 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright commented on a change in pull request #902: URL: https://github.com/apache/parquet-mr/pull/902#discussion_r625792180 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java ## @@ -1464,7 +1464,7 @@ protected PageHeader readPageHeader(BlockCipher.Decryptor blockDecryptor, byte[] */ private void verifyCrc(int referenceCrc, byte[] bytes, String exceptionMsg) { crc.reset(); - crc.update(bytes); + crc.update(bytes, 0, bytes.length); Review comment: Changed to adopt a Java 8 API, to be consistent with the pom -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338988#comment-17338988 ] ASF GitHub Bot commented on PARQUET-1633: - eadwright opened a new pull request #902: URL: https://github.com/apache/parquet-mr/pull/902 This PR addresses this issue: https://issues.apache.org/jira/browse/PARQUET-1633 I have not added unit tests, as to check overflow conditions I would need test data over 2GB in size (on disk, compressed), considerably larger in-memory and thus requiring significant CI resources. The issue was using an `int` for length field, which for parquet files with very large `row_group_size` (row groups over 2GB) would cause silent integer overflow, manifesting itself as negative length and an attempt to create an ArrayList with negative length. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903730#comment-16903730 ] Adrian Ionescu commented on PARQUET-1633: - I agree with [~sadikovi] here: Being able to write a file that is then unreadable is a pretty serious bug (i.e. data corruption). > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902096#comment-16902096 ] Ivan Sadikov commented on PARQUET-1633: --- I would say it is a bug. There are a few instances of overflow I found in parquet-mr, another example being https://issues.apache.org/jira/browse/PARQUET-1632. I also have a simple repro with Spark; by the way, if you make records shorter, the problem still persists: {code:java} import org.apache.spark.sql.functions._ val large_str = udf(() => "a" * (128 * 1024 * 1024)) val df = spark.range(0, 20, 1, 1).withColumn("large_str", large_str()) spark.conf.set("parquet.enable.dictionary", "false") spark.conf.set("parquet.page.size.row.check.min", "1") // this is done so I don't hit PARQUET-1632 df.write.option("compression", "uncompressed").mode("overwrite").parquet("/mnt/large.parquet") spark.read.parquet("/mnt/large.parquet").foreach(_ => Unit) // Fails{code} Here is the stacktrace: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 22.0 failed 4 times, most recent failure: Lost task 10.3 in stage 22.0 (TID 121, 10.0.217.97, executor 1): java.lang.IllegalArgumentException: Illegal Capacity: -191 at java.util.ArrayList.(ArrayList.java:157) at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:811){code} Well, there is no restriction on row group size in parquet format. Also, it is not a significant effort to patch this issue - making ChunkDescriptor size field as well as ConsecutiveChunkList length to have long type should patch this problem (it already has long type in metadata). > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList
[ https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901963#comment-16901963 ] Gabor Szadovszky commented on PARQUET-1633: --- However, parquet-mr do not catch the overflow properly I would not say it is a bug it cannot read this file. A row-group should be able to fit in memory. From Java point of view we can only use a {{byte[]}} or a {{ByteBuffer}} of the size a bit below {{2^31^}} so, a row-group of the size ~3GB (uncompressed size) seems to be too large. Of course, it is possible to handle such large row-groups but it would require significant efforts to rewrite the related code paths and I am not sure we should. I think, the problem is not that we cannot read this file but that we should never write such large row-groups. What tools did you use to create this file and with what configuration? It is very strange that the 1st row-group is ~350M and the 3rd one as well while the 2nd one is almost 10 times bigger. > Integer overflow in ParquetFileReader.ConsecutiveChunkList > -- > > Key: PARQUET-1633 > URL: https://issues.apache.org/jira/browse/PARQUET-1633 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.10.1 >Reporter: Ivan Sadikov >Priority: Major > > When reading a large Parquet file (2.8GB), I encounter the following > exception: > {code:java} > Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value > at 0 in block -1 in file > dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228) > ... 14 more > Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212 > at java.util.ArrayList.(ArrayList.java:157) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code} > > The file metadata is: > * block 1 (3 columns) > ** rowCount: 110,100 > ** totalByteSize: 348,492,072 > ** compressedSize: 165,689,649 > * block 2 (3 columns) > ** rowCount: 90,054 > ** totalByteSize: 3,243,165,541 > ** compressedSize: 2,509,579,966 > * block 3 (3 columns) > ** rowCount: 105,119 > ** totalByteSize: 350,901,693 > ** compressedSize: 144,952,177 > * block 4 (3 columns) > ** rowCount: 48,741 > ** totalByteSize: 1,275,995 > ** compressedSize: 914,205 > I don't have the code to reproduce the issue, unfortunately; however, I > looked at the code and it seems that integer {{length}} field in > ConsecutiveChunkList overflows, which results in negative capacity for array > list in {{readAll}} method: > {code:java} > int fullAllocations = length / options.getMaxAllocationSize(); > int lastAllocationSize = length % options.getMaxAllocationSize(); > > int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0); > List buffers = new ArrayList<>(numAllocations);{code} > > This is caused by cast to integer in {{readNextRowGroup}} method in > ParquetFileReader: > {code:java} > currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, > (int)mc.getTotalSize())); > {code} > which overflows when total size of the column is larger than > Integer.MAX_VALUE. > I would appreciate if you could help addressing the issue. Thanks! > -- This message was sent by Atlassian JIRA (v7.6.14#76016)