hy5446 created PARQUET-282:
------------------------------
Summary: OutOfMemoryError in job commit / ParquetMetadataConverter
Key: PARQUET-282
URL: https://issues.apache.org/jira/browse/PARQUET-282
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.6.0
Environment: CentOS, MapR,. Scalding
Reporter: hy5446
Priority: Critical
We're trying to write some 14B rows (about 3.6 TB in parquets) to parquet
files. When our ETL job finishes, it throws this exception, and the status is
"died in job commit".
2015-05-14 09:24:28,158 FATAL [CommitterEvent Processor #4]
org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread
Thread[CommitterEvent Processor #4,5,main] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.ByteBuffer.wrap(ByteBuffer.java:373)
at java.nio.ByteBuffer.wrap(ByteBuffer.java:396)
at parquet.format.Statistics.setMin(Statistics.java:237)
at
parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:243)
at
parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:167)
at
parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:79)
at
parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:405)
at
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:433)
at
parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:423)
at
parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
at
parquet.hadoop.mapred.MapredParquetOutputCommitter.commitJob(MapredParquetOutputCommitter.java:43)
at
org.apache.hadoop.mapred.OutputCommitter.commitJob(OutputCommitter.java:259)
at
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobCommit(CommitterEventHandler.java:253)
at
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:216)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
This seems to have something to do with the _metadata file creation, as the
parquet files are perfectly fine and usable. Also I'm not sure how to alleviate
this (i.e. add more heap space) since the crash is outside the Map/Reduce tasks
themselves but seems in the job/application controller itself.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)