[jira] [Created] (PARQUET-1359) Out of Memory when reading large parquet file

Ryan Sachs (JIRA) Thu, 26 Jul 2018 08:11:12 -0700

Ryan Sachs created PARQUET-1359:
-----------------------------------

             Summary: Out of Memory when reading large parquet file
                 Key: PARQUET-1359
                 URL: https://issues.apache.org/jira/browse/PARQUET-1359
             Project: Parquet
          Issue Type: Bug
            Reporter: Ryan Sachs



Hi,

We are successfully reading parquet files block by block, and are running into 
a JVM out of memory issue in a certain edge case. Consider the following 
scenario:

Parquet file has one column and one block and is 10 GB

Our JVM is 5 GB

Is there any way to read such a file? Below is our implementation/stack trace
{code:java}
Caused by: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511)

try {
  ParquetMetadata readFooter = ParquetFileReader.readFooter(hfsConfig, path,
                               ParquetMetadataConverter.NO_FILTER);
  MessageType schema = readFooter.getFileMetaData().getSchema();
  long a = readFooter.getBlocks().stream().
    reduce(0L, (left, right) -> left > 
      right.getTotalByteSize() ? left : right.getTotalByteSize(), 
    (leftl, rightl) -> leftl > rightl ? leftl : rightl);

  for (BlockMetaData block : readFooter.getBlocks()) {
    try {
      fileReader = new ParquetFileReader(hfsConfig, 
                   readFooter.getFileMetaData(), path, Collections
      .singletonList(block), schema.getColumns());
      PageReadStore pages;

    while (null != (pages = fileReader.readNextRowGroup())) {
      //exception gets thrown here on blocks larger than jvm memory
      final long rows = pages.getRowCount();
      final MessageColumnIO columnIO = new 
                            ColumnIOFactory().getColumnIO(schema);
      final RecordReader<Group> recordReader = 
            columnIO.getRecordReader(pages, new GroupRecordConverter(schema));

      for (int i = 0; i < rows; i++) {
        final Group group = recordReader.read();
        int fieldCount = group.getType().getFieldCount();

        for (int field = 0; field < fieldCount; field++) {
          int valueCount = group.getFieldRepetitionCount(field);
          Type fieldType = group.getType().getType(field);
          String fieldName = fieldType.getName();

          for (int index = 0; index < valueCount; index++) {
            // Process data 
          }
        }
      }
    }
  } catch (IOException e) {
    ...
  } finally {
    ...
  }
}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1359) Out of Memory when reading large parquet file

Reply via email to