Ryan Sachs created PARQUET-1359: ----------------------------------- Summary: Out of Memory when reading large parquet file Key: PARQUET-1359 URL: https://issues.apache.org/jira/browse/PARQUET-1359 Project: Parquet Issue Type: Bug Reporter: Ryan Sachs
Hi, We are successfully reading parquet files block by block, and are running into a JVM out of memory issue in a certain edge case. Consider the following scenario: Parquet file has one column and one block and is 10 GB Our JVM is 5 GB Is there any way to read such a file? Below is our implementation/stack trace {code:java} Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:778) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:511) try { ParquetMetadata readFooter = ParquetFileReader.readFooter(hfsConfig, path, ParquetMetadataConverter.NO_FILTER); MessageType schema = readFooter.getFileMetaData().getSchema(); long a = readFooter.getBlocks().stream(). reduce(0L, (left, right) -> left > right.getTotalByteSize() ? left : right.getTotalByteSize(), (leftl, rightl) -> leftl > rightl ? leftl : rightl); for (BlockMetaData block : readFooter.getBlocks()) { try { fileReader = new ParquetFileReader(hfsConfig, readFooter.getFileMetaData(), path, Collections .singletonList(block), schema.getColumns()); PageReadStore pages; while (null != (pages = fileReader.readNextRowGroup())) { //exception gets thrown here on blocks larger than jvm memory final long rows = pages.getRowCount(); final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema); final RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema)); for (int i = 0; i < rows; i++) { final Group group = recordReader.read(); int fieldCount = group.getType().getFieldCount(); for (int field = 0; field < fieldCount; field++) { int valueCount = group.getFieldRepetitionCount(field); Type fieldType = group.getType().getType(field); String fieldName = fieldType.getName(); for (int index = 0; index < valueCount; index++) { // Process data } } } } } catch (IOException e) { ... } finally { ... } }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)