Tony Yan created PARQUET-284:
--------------------------------
Summary: Should use ConcurrentHashMap instead of HashMap in
ParquetMetadataConverter
Key: PARQUET-284
URL: https://issues.apache.org/jira/browse/PARQUET-284
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.6.1
Environment: Spark 1.2.1, CentOS 6.4
Reporter: Tony Yan
When using parquet in Spark Environment, sometimes got hang with following
thread dump:
"Executor task launch worker-0" daemon prio=10 tid=0x000000004073d000
nid=0xd6c5 runnable [0x00007ff3fda40000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.get(HashMap.java:303)
at
parquet.format.converter.ParquetMetadataConverter.fromFormatEncodings(ParquetMetadataConverter.java:218)
at
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
at
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:161)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:135)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
>From the source code of ParquetMetadataConverter:
private Map> encodingLists = new HashMap>();
It use HashMap instead of ConcurrentHashMap. Because HashMap is not thread safe
and can cause hang when run in multithread environment. So it need to change to
ConcurrentHashMap
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)