[
https://issues.apache.org/jira/browse/PARQUET-328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634083#comment-14634083
]
Alex Levenson commented on PARQUET-328:
---------------------------------------
Have you seen: PARQUET-284 ?
Concurrent access to a HashMap can cause deadlock, wonder if that's what's
happening here?
Hadoop has a cacheing mechanism for FileSystem.get() -- which is synchronized
-- however I don't think the returned FileSystem objects are themselves thread
safe (maybe FileSystem.get() uses thread local? not sure). So either way, care
has to be taken to not pass fs instances across threads as far as I know. It's
actually important to cache the FS instances however, in the past we've tried
using FileSystem.newInstance() instead of the cache, but turns out that
constructing a fs instance is very expensive (similar to constructing a new
Configuration object, it involves parsing xml of disk and such).
> ParquetReader not using FileSystem cache effectively?
> -----------------------------------------------------
>
> Key: PARQUET-328
> URL: https://issues.apache.org/jira/browse/PARQUET-328
> Project: Parquet
> Issue Type: Bug
> Reporter: Tianshuo Deng
>
> We've seen spark job stucked with following trace:
> java.util.HashMap.put(HashMap.java:494)
> org.apache.hadoop.conf.Configuration.set(Configuration.java:1065)
> org.apache.hadoop.conf.Configuration.set(Configuration.java:1035)
> org.apache.hadoop.fs.viewfs.HDFSCompatibleViewFileSystem.mergeViewFsHdfsMountPoints(HDFSCompatibleViewFileSystem.java:491)
> org.apache.hadoop.fs.viewfs.HDFSCompatibleViewFileSystem.mergeConfFromDirectory(HDFSCompatibleViewFileSystem.java:413)
> org.apache.hadoop.fs.viewfs.HDFSCompatibleViewFileSystem.mergeViewFsAndHdfs(HDFSCompatibleViewFileSystem.java:273)
> org.apache.hadoop.fs.viewfs.HDFSCompatibleViewFileSystem.initialize(HDFSCompatibleViewFileSystem.java:190)
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2438)
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2472)
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2454)
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:384)
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
> org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:244)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:64)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)