Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Patrick Wendell Mon, 14 Jul 2014 21:56:27 -0700

Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
immediate priority is addressing regressions between these two
releases.


On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash <[email protected]> wrote:
> I'm not sure either of those PRs will fix the concurrent adds to
> Configuration issue I observed. I've got a stack trace and writeup I'll
> share in an hour or two (traveling today).
> On Jul 14, 2014 9:50 PM, "scwf" <[email protected]> wrote:
>
>> hi，Cody
>>   i met this issue days before and i post a PR for this(
>> https://github.com/apache/spark/pull/1385)
>> it's very strange that if i synchronize conf it will deadlock but it is ok
>> when synchronize initLocalJobConfFuncOpt
>>
>>
>>  Here's the entire jstack output.
>>>
>>>
>>> On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Hey Cody,
>>>
>>>     This Jstack seems truncated, would you mind giving the entire stack
>>>     trace? For the second thread, for instance, we can't see where the
>>>     lock is being acquired.
>>>
>>>     - Patrick
>>>
>>>     On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>>>     <[email protected] <mailto:cody.koeninger@
>>> mediacrossing.com>> wrote:
>>>      > Hi all, just wanted to give a heads up that we're seeing a
>>> reproducible
>>>      > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>>>      >
>>>      > If jira is a better place for this, apologies in advance - figured
>>> talking
>>>      > about it on the mailing list was friendlier than randomly
>>> (re)opening jira
>>>      > tickets.
>>>      >
>>>      > I know Gary had mentioned some issues with 1.0.1 on the mailing
>>> list, once
>>>      > we got a thread dump I wanted to follow up.
>>>      >
>>>      > The thread dump shows the deadlock occurs in the synchronized
>>> block of code
>>>      > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>>>      >
>>>      > Relevant portions of the thread dump are summarized below, we can
>>> provide
>>>      > the whole dump if it's useful.
>>>      >
>>>      > Found one Java-level deadlock:
>>>      > =============================
>>>      > "Executor task launch worker-1":
>>>      >   waiting to lock monitor 0x00007f250400c520 (object
>>> 0x00000000fae7dc30, a
>>>      > org.apache.hadoop.co <http://org.apache.hadoop.co>
>>>      > nf.Configuration),
>>>      >   which is held by "Executor task launch worker-0"
>>>      > "Executor task launch worker-0":
>>>      >   waiting to lock monitor 0x00007f2520495620 (object
>>> 0x00000000faeb4fc8, a
>>>      > java.lang.Class),
>>>      >   which is held by "Executor task launch worker-1"
>>>      >
>>>      >
>>>      > "Executor task launch worker-1":
>>>      >         at
>>>      > org.apache.hadoop.conf.Configuration.reloadConfiguration(
>>> Configuration.java:791)
>>>      >         - waiting to lock <0x00000000fae7dc30> (a
>>>      > org.apache.hadoop.conf.Configuration)
>>>      >         at
>>>      > org.apache.hadoop.conf.Configuration.addDefaultResource(
>>> Configuration.java:690)
>>>      >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>>>      > org.apache.hadoop.conf.Configurati
>>>      > on)
>>>      >         at
>>>      > org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(
>>> HdfsConfiguration.java:34)
>>>      >         at
>>>      > org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>
>>> (DistributedFileSystem.java:110
>>>      > )
>>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>>> newInstance0(Native
>>>      > Method)
>>>      >         at
>>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>>> NativeConstructorAccessorImpl.
>>>      > java:57)
>>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>>> newInstance0(Native
>>>      > Method)
>>>      >         at
>>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>>> NativeConstructorAccessorImpl.
>>>      > java:57)
>>>      >         at
>>>      > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>>> DelegatingConstructorAcces
>>>      > sorImpl.java:45)
>>>      >         at java.lang.reflect.Constructor.
>>> newInstance(Constructor.java:525)
>>>      >         at java.lang.Class.newInstance0(Class.java:374)
>>>      >         at java.lang.Class.newInstance(Class.java:327)
>>>      >         at java.util.ServiceLoader$LazyIterator.next(
>>> ServiceLoader.java:373)
>>>      >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>>> FileSystem.java:2364)
>>>      >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>>>      > org.apache.hadoop.fs.FileSystem)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>>> FileSystem.java:2375)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>>> FileSystem.java:2392)
>>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>>> FileSystem.java:89)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>>> FileSystem.java:2431)
>>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>>> FileSystem.java:2413)
>>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>>> java:368)
>>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>>> java:167)
>>>      >         at
>>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>>> JobConf.java:587)
>>>      >         at
>>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>>> FileInputFormat.java:315)
>>>      >         at
>>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>>> FileInputFormat.java:288)
>>>      >         at
>>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>>> SparkContext.scala:546)
>>>      >         at
>>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>>> SparkContext.scala:546)
>>>      >         at
>>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>>> 1.apply(HadoopRDD.scala:145)
>>>      >
>>>      >
>>>      >
>>>      > ...elided...
>>>      >
>>>      >
>>>      > "Executor task launch worker-0" daemon prio=10
>>> tid=0x0000000001e71800
>>>      > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>>>      >    java.lang.Thread.State: BLOCKED (on object monitor)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>>> FileSystem.java:2362)
>>>      >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class
>>> for
>>>      > org.apache.hadoop.fs.FileSystem)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>>> FileSystem.java:2375)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>>> FileSystem.java:2392)
>>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>>> FileSystem.java:89)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>>> FileSystem.java:2431)
>>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>>> FileSystem.java:2413)
>>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>>> java:368)
>>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>>> java:167)
>>>      >         at
>>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>>> JobConf.java:587)
>>>      >         at
>>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>>> FileInputFormat.java:315)
>>>      >         at
>>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>>> FileInputFormat.java:288)
>>>      >         at
>>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>>> SparkContext.scala:546)
>>>      >         at
>>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>>> SparkContext.scala:546)
>>>      >         at
>>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>>> 1.apply(HadoopRDD.scala:145)
>>>
>>>
>>>
>>
>> --
>>
>> Best Regards
>> Fei Wang
>>
>> ------------------------------------------------------------
>> --------------------
>>
>>
>>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Reply via email to