better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Hi Spark devs,

I was looking into the memory usage of shuffle and one annoying thing is
the default compression codec (LZF) is that the implementation we use
allocates buffers pretty generously. I did a simple experiment and found
that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
we have a shuffle task that uses 10k reducers and 32 threads running
currently, the memory used by the lzf stream alone would be ~ 60GB.

In comparison, Snappy only allocates ~ 65MB for every
1k SnappyOutputStream. However, Snappy's compression is slightly lower than
LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
ratio does matter here because we are sending data across the network.

In future releases we will likely change the shuffle implementation to open
less streams. Until that happens, I'm looking for compression codec
implementations that are fast, allocate small buffers, and have decent
compression ratio.

Does anybody on this list have any suggestions? If not, I will submit a
patch for 1.1 that replaces LZF with Snappy for the default compression
codec to lower memory usage.


allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567


Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Patrick Wendell
 1. The first error I met is the different SerializationVersionUID in 
 ExecuterStatus

 I resolved by explicitly declare SerializationVersionUID in 
 ExecuterStatus.scala and recompile branch-0.1-jdbc


I don't think there is a class in Spark named ExecuterStatus (sic) ...
or ExecutorStatus. Is this a class you made?


Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu
Ah, sorry, sorry

It's executorState under deploy package

On Monday, July 14, 2014, Patrick Wendell pwend...@gmail.com wrote:

  1. The first error I met is the different SerializationVersionUID in
 ExecuterStatus
 
  I resolved by explicitly declare SerializationVersionUID in
 ExecuterStatus.scala and recompile branch-0.1-jdbc
 

 I don't think there is a class in Spark named ExecuterStatus (sic) ...
 or ExecutorStatus. Is this a class you made?



Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Mridul Muralidharan
We tried with lower block size for lzf, but it barfed all over the place.
Snappy was the way to go for our jobs.


Regards,
Mridul


On Mon, Jul 14, 2014 at 12:31 PM, Reynold Xin r...@databricks.com wrote:
 Hi Spark devs,

 I was looking into the memory usage of shuffle and one annoying thing is
 the default compression codec (LZF) is that the implementation we use
 allocates buffers pretty generously. I did a simple experiment and found
 that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
 we have a shuffle task that uses 10k reducers and 32 threads running
 currently, the memory used by the lzf stream alone would be ~ 60GB.

 In comparison, Snappy only allocates ~ 65MB for every
 1k SnappyOutputStream. However, Snappy's compression is slightly lower than
 LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
 ratio does matter here because we are sending data across the network.

 In future releases we will likely change the shuffle implementation to open
 less streams. Until that happens, I'm looking for compression codec
 implementations that are fast, allocate small buffers, and have decent
 compression ratio.

 Does anybody on this list have any suggestions? If not, I will submit a
 patch for 1.1 that replaces LZF with Snappy for the default compression
 codec to lower memory usage.


 allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567


Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Hi all,

I've been evaluating YourKit and would like to profile the heap and CPU usage 
of certain tests from the Spark test suite.  In particular, I'm very interested 
in tracking heap usage by allocation site.  Unfortunately, I get a lot of 
crashes running Spark tests with profiling (and thus allocation-site tracking) 
enabled in YourKit; just using the sampler works fine, but it appears that 
enabling the profiler breaks Utils.getCallSite.

Is there a way to make this combination work?  If not, what are people using to 
understand the memory and CPU behavior of Spark and Spark apps?


thanks,
wb


Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Matei Zaharia
I haven't seen issues using the JVM's own tools (jstack, jmap, hprof and such), 
so maybe there's a problem in YourKit or in your release of the JVM. Otherwise 
I'd suggest increasing the heap size of the unit tests a bit (you can do this 
in the SBT build file). Maybe they are very close to full and profiling pushes 
them over the edge.

Matei

On Jul 14, 2014, at 9:51 AM, Will Benton wi...@redhat.com wrote:

 Hi all,
 
 I've been evaluating YourKit and would like to profile the heap and CPU usage 
 of certain tests from the Spark test suite.  In particular, I'm very 
 interested in tracking heap usage by allocation site.  Unfortunately, I get a 
 lot of crashes running Spark tests with profiling (and thus allocation-site 
 tracking) enabled in YourKit; just using the sampler works fine, but it 
 appears that enabling the profiler breaks Utils.getCallSite.
 
 Is there a way to make this combination work?  If not, what are people using 
 to understand the memory and CPU behavior of Spark and Spark apps?
 
 
 thanks,
 wb



Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the
logging infrastructure.  However, the dependency should be very small and
thus easy to remove, and I would like catalyst to be usable outside of
Spark.  A pull request to make this possible would be welcome.

Ideally, we'd create some sort of spark common package that has things like
logging.  That way catalyst could depend on that, without pulling in all of
Hadoop, etc.  Maybe others have opinions though, so I'm cc-ing the dev list.


On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote:

 Make Catalyst independent of Spark is the goal of Catalyst, maybe need
 time and evolution.
 I awared that package org.apache.spark.sql.catalyst.util
 embraced org.apache.spark.util.{Utils = SparkUtils},
 so that Catalyst has a dependency on Spark core.
 I'm not sure whether it will be replaced by other component independent of
 Spark in later release.


 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com:

 As per the recent presentation given in Scala days (
 http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it
 was mentioned that Catalyst is independent of Spark. But on inspecting
 pom.xml of sql/catalyst module, it seems it has a dependency on Spark Core.
 Any particular reason for the dependency? I would love to use Catalyst
 outside Spark

 (reposted as previous email bounced. Sorry if this is a duplicate).





Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Cody Koeninger
Hi all, just wanted to give a heads up that we're seeing a reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2

If jira is a better place for this, apologies in advance - figured talking
about it on the mailing list was friendlier than randomly (re)opening jira
tickets.

I know Gary had mentioned some issues with 1.0.1 on the mailing list, once
we got a thread dump I wanted to follow up.

The thread dump shows the deadlock occurs in the synchronized block of code
that was changed in HadoopRDD.scala, for the Spark-1097 issue

Relevant portions of the thread dump are summarized below, we can provide
the whole dump if it's useful.

Found one Java-level deadlock:
=
Executor task launch worker-1:
  waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, a
org.apache.hadoop.co
nf.Configuration),
  which is held by Executor task launch worker-0
Executor task launch worker-0:
  waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, a
java.lang.Class),
  which is held by Executor task launch worker-1


Executor task launch worker-1:
at
org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
- waiting to lock 0xfae7dc30 (a
org.apache.hadoop.conf.Configuration)
at
org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
- locked 0xfaca6ff8 (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
at
org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
at
org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
sorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
- locked 0xfaeb4fc8 (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



...elided...


Executor task launch worker-0 daemon prio=10 tid=0x01e71800
nid=0x2d97 waiting for monitor entry [0x7f24d2bf1000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
- waiting to lock 0xfaeb4fc8 (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
at

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things.

Matei

On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com wrote:

 Yeah, sadly this dependency was introduced when someone consolidated the 
 logging infrastructure.  However, the dependency should be very small and 
 thus easy to remove, and I would like catalyst to be usable outside of Spark. 
  A pull request to make this possible would be welcome.
 
 Ideally, we'd create some sort of spark common package that has things like 
 logging.  That way catalyst could depend on that, without pulling in all of 
 Hadoop, etc.  Maybe others have opinions though, so I'm cc-ing the dev list.
 
 
 On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote:
 Make Catalyst independent of Spark is the goal of Catalyst, maybe need time 
 and evolution.
 I awared that package org.apache.spark.sql.catalyst.util embraced 
 org.apache.spark.util.{Utils = SparkUtils},
 so that Catalyst has a dependency on Spark core. 
 I'm not sure whether it will be replaced by other component independent of 
 Spark in later release.
 
 
 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com:
 
 As per the recent presentation given in Scala days 
 (http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was 
 mentioned that Catalyst is independent of Spark. But on inspecting pom.xml of 
 sql/catalyst module, it seems it has a dependency on Spark Core. Any 
 particular reason for the dependency? I would love to use Catalyst outside 
 Spark
 
 (reposted as previous email bounced. Sorry if this is a duplicate).
 
 



Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Cody,

This Jstack seems truncated, would you mind giving the entire stack
trace? For the second thread, for instance, we can't see where the
lock is being acquired.

- Patrick

On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
cody.koenin...@mediacrossing.com wrote:
 Hi all, just wanted to give a heads up that we're seeing a reproducible
 deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2

 If jira is a better place for this, apologies in advance - figured talking
 about it on the mailing list was friendlier than randomly (re)opening jira
 tickets.

 I know Gary had mentioned some issues with 1.0.1 on the mailing list, once
 we got a thread dump I wanted to follow up.

 The thread dump shows the deadlock occurs in the synchronized block of code
 that was changed in HadoopRDD.scala, for the Spark-1097 issue

 Relevant portions of the thread dump are summarized below, we can provide
 the whole dump if it's useful.

 Found one Java-level deadlock:
 =
 Executor task launch worker-1:
   waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, a
 org.apache.hadoop.co
 nf.Configuration),
   which is held by Executor task launch worker-0
 Executor task launch worker-0:
   waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, a
 java.lang.Class),
   which is held by Executor task launch worker-1


 Executor task launch worker-1:
 at
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
 - waiting to lock 0xfae7dc30 (a
 org.apache.hadoop.conf.Configuration)
 at
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
 - locked 0xfaca6ff8 (a java.lang.Class for
 org.apache.hadoop.conf.Configurati
 on)
 at
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
 )
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
 java:57)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
 java:57)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
 sorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
 at java.lang.Class.newInstance0(Class.java:374)
 at java.lang.Class.newInstance(Class.java:327)
 at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
 at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
 at
 org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
 - locked 0xfaeb4fc8 (a java.lang.Class for
 org.apache.hadoop.fs.FileSystem)
 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
 at
 org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
 at
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
 at
 org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
 at
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



 ...elided...


 Executor task launch worker-0 daemon prio=10 tid=0x01e71800
 nid=0x2d97 waiting for monitor entry [0x7f24d2bf1000]
java.lang.Thread.State: BLOCKED (on object monitor)
 at
 org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
 - waiting to lock 0xfaeb4fc8 (a java.lang.Class for
 org.apache.hadoop.fs.FileSystem)
 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
 at

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Thanks, Matei; I have also had some success with jmap and friends and will 
probably just stick with them!


best,
wb


- Original Message -
 From: Matei Zaharia matei.zaha...@gmail.com
 To: dev@spark.apache.org
 Sent: Monday, July 14, 2014 1:02:04 PM
 Subject: Re: Profiling Spark tests with YourKit (or something else)
 
 I haven't seen issues using the JVM's own tools (jstack, jmap, hprof and
 such), so maybe there's a problem in YourKit or in your release of the JVM.
 Otherwise I'd suggest increasing the heap size of the unit tests a bit (you
 can do this in the SBT build file). Maybe they are very close to full and
 profiling pushes them over the edge.
 
 Matei
 
 On Jul 14, 2014, at 9:51 AM, Will Benton wi...@redhat.com wrote:
 
  Hi all,
  
  I've been evaluating YourKit and would like to profile the heap and CPU
  usage of certain tests from the Spark test suite.  In particular, I'm very
  interested in tracking heap usage by allocation site.  Unfortunately, I
  get a lot of crashes running Spark tests with profiling (and thus
  allocation-site tracking) enabled in YourKit; just using the sampler works
  fine, but it appears that enabling the profiler breaks Utils.getCallSite.
  
  Is there a way to make this combination work?  If not, what are people
  using to understand the memory and CPU behavior of Spark and Spark apps?
  
  
  thanks,
  wb
 
 


Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The full jstack would still be useful, but our current working theory is
that this is due to the fact that Configuration#loadDefaults goes through
every Configuration object that was ever created (via
Configuration.REGISTRY) and locks it, thus introducing a dependency from
new Configuration to old, otherwise unrelated, Configuration objects that
our locking did not anticipate.

I have created https://github.com/apache/spark/pull/1409 to hopefully fix
this bug.


On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire stack
 trace? For the second thread, for instance, we can't see where the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com wrote:
  Hi all, just wanted to give a heads up that we're seeing a reproducible
  deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
 
  If jira is a better place for this, apologies in advance - figured
 talking
  about it on the mailing list was friendlier than randomly (re)opening
 jira
  tickets.
 
  I know Gary had mentioned some issues with 1.0.1 on the mailing list,
 once
  we got a thread dump I wanted to follow up.
 
  The thread dump shows the deadlock occurs in the synchronized block of
 code
  that was changed in HadoopRDD.scala, for the Spark-1097 issue
 
  Relevant portions of the thread dump are summarized below, we can provide
  the whole dump if it's useful.
 
  Found one Java-level deadlock:
  =
  Executor task launch worker-1:
waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30,
 a
  org.apache.hadoop.co
  nf.Configuration),
which is held by Executor task launch worker-0
  Executor task launch worker-0:
waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8,
 a
  java.lang.Class),
which is held by Executor task launch worker-1
 
 
  Executor task launch worker-1:
  at
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
  - waiting to lock 0xfae7dc30 (a
  org.apache.hadoop.conf.Configuration)
  at
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
  - locked 0xfaca6ff8 (a java.lang.Class for
  org.apache.hadoop.conf.Configurati
  on)
  at
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
  at
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
  )
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
  java:57)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
  java:57)
  at
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
  sorImpl.java:45)
  at
 java.lang.reflect.Constructor.newInstance(Constructor.java:525)
  at java.lang.Class.newInstance0(Class.java:374)
  at java.lang.Class.newInstance(Class.java:327)
  at
 java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
  at
  org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
  - locked 0xfaeb4fc8 (a java.lang.Class for
  org.apache.hadoop.fs.FileSystem)
  at
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
  at
  org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
  at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
  at
  org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
  at
  org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
  at
  org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
  at
 
 org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
 
 
 
  ...elided...
 
 
  Executor task launch worker-0 daemon prio=10 tid=0x01e71800
  nid=0x2d97 waiting for monitor entry [0x7f24d2bf1000]
 java.lang.Thread.State: BLOCKED (on object monitor)
  at
  

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Nishkam Ravi
Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
help. I suspect we will start seeing the ConcurrentModificationException
again. The right fix has gone into Hadoop through 10456. Unfortunately, I
don't have any bright ideas on how to synchronize this at the Spark level
without the risk of deadlocks.


On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com wrote:

 The full jstack would still be useful, but our current working theory is
 that this is due to the fact that Configuration#loadDefaults goes through
 every Configuration object that was ever created (via
 Configuration.REGISTRY) and locks it, thus introducing a dependency from
 new Configuration to old, otherwise unrelated, Configuration objects that
 our locking did not anticipate.

 I have created https://github.com/apache/spark/pull/1409 to hopefully fix
 this bug.


 On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
  talking
   about it on the mailing list was friendlier than randomly (re)opening
  jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing list,
  once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized block of
  code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30,
  a
   org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8,
  a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
   at
  
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
   )
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
  
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
   sorImpl.java:45)
   at
  java.lang.reflect.Constructor.newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at
  java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
  
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
   at
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
   at
   org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
   at
  org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
   at
   org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
   at
   

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Stephen Haberman

Just a comment from the peanut gallery, but these buffers are a real
PITA for us as well. Probably 75% of our non-user-error job failures
are related to them.

Just naively, what about not doing compression on the fly? E.g. during
the shuffle just write straight to disk, uncompressed?

For us, we always have plenty of disk space, and if you're concerned
about network transmission, you could add a separate compress step
after the blocks have been written to disk, but before being sent over
the wire.

Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
see work in this area!

- Stephen



Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Sandy Ryza
Stephen,
Often the shuffle is bound by writes to disk, so even if disks have enough
space to store the uncompressed data, the shuffle can complete faster by
writing less data.

Reynold,
This isn't a big help in the short term, but if we switch to a sort-based
shuffle, we'll only need a single LZFOutputStream per map task.


On Mon, Jul 14, 2014 at 3:30 PM, Stephen Haberman 
stephen.haber...@gmail.com wrote:


 Just a comment from the peanut gallery, but these buffers are a real
 PITA for us as well. Probably 75% of our non-user-error job failures
 are related to them.

 Just naively, what about not doing compression on the fly? E.g. during
 the shuffle just write straight to disk, uncompressed?

 For us, we always have plenty of disk space, and if you're concerned
 about network transmission, you could add a separate compress step
 after the blocks have been written to disk, but before being sent over
 the wire.

 Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
 see work in this area!

 - Stephen




Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Matei Zaharia
You can actually turn off shuffle compression by setting spark.shuffle.compress 
to false. Try that out, there will still be some buffers for the various 
OutputStreams, but they should be smaller.

Matei

On Jul 14, 2014, at 3:30 PM, Stephen Haberman stephen.haber...@gmail.com 
wrote:

 
 Just a comment from the peanut gallery, but these buffers are a real
 PITA for us as well. Probably 75% of our non-user-error job failures
 are related to them.
 
 Just naively, what about not doing compression on the fly? E.g. during
 the shuffle just write straight to disk, uncompressed?
 
 For us, we always have plenty of disk space, and if you're concerned
 about network transmission, you could add a separate compress step
 after the blocks have been written to disk, but before being sent over
 the wire.
 
 Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
 see work in this area!
 
 - Stephen
 



Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of
strings when loading / storing data using parquet and Spark SQL.  Before,
Spark SQL did not support binary data in Parquet, so all binary blobs were
implicitly treated as Strings.  9fe693
https://github.com/apache/spark/commit/9fe693b5b6ed6af34ee1e800ab89c8a11991ea38
fixes
this limitation by adding support for binary data.

However, data written out with a prior version of Spark SQL will be missing
the annotation telling us to interpret a given column as a String, so old
string data will now be loaded as binary data.  If you would like to use
the data as a string, you will need to add a CAST to convert the datatype.

New string data written out after this change, will correctly be loaded in
as a string as now we will include an annotation about the desired type.
 Additionally, this should now interoperate correctly with other systems
that write Parquet data (hive, thrift, etc).

Michael


Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Nishkam,

Aaron's fix should prevent two concurrent accesses to getJobConf (and
the Hadoop code therein). But if there is code elsewhere that tries to
mutate the configuration, then I could see how we might still have the
ConcurrentModificationException.

I looked at your patch for HADOOP-10456 and the only example you give
is of the data being accessed inside of getJobConf. Is it accessed
somewhere else too from Spark that you are aware of?

https://issues.apache.org/jira/browse/HADOOP-10456

- Patrick

On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com wrote:
 Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
 help. I suspect we will start seeing the ConcurrentModificationException
 again. The right fix has gone into Hadoop through 10456. Unfortunately, I
 don't have any bright ideas on how to synchronize this at the Spark level
 without the risk of deadlocks.


 On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com wrote:

 The full jstack would still be useful, but our current working theory is
 that this is due to the fact that Configuration#loadDefaults goes through
 every Configuration object that was ever created (via
 Configuration.REGISTRY) and locks it, thus introducing a dependency from
 new Configuration to old, otherwise unrelated, Configuration objects that
 our locking did not anticipate.

 I have created https://github.com/apache/spark/pull/1409 to hopefully fix
 this bug.


 On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
  talking
   about it on the mailing list was friendlier than randomly (re)opening
  jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing list,
  once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized block of
  code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30,
  a
   org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8,
  a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
   at
  
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
   )
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
   java:57)
   at
  
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
   sorImpl.java:45)
   at
  java.lang.reflect.Constructor.newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at
  java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
  
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
   at
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
   at
   

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We use the Hadoop configuration inside of our code executing on Spark as we
need to list out files in the path.  Maybe that is why it is exposed for us.


On Mon, Jul 14, 2014 at 6:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hey Nishkam,

 Aaron's fix should prevent two concurrent accesses to getJobConf (and
 the Hadoop code therein). But if there is code elsewhere that tries to
 mutate the configuration, then I could see how we might still have the
 ConcurrentModificationException.

 I looked at your patch for HADOOP-10456 and the only example you give
 is of the data being accessed inside of getJobConf. Is it accessed
 somewhere else too from Spark that you are aware of?

 https://issues.apache.org/jira/browse/HADOOP-10456

 - Patrick

 On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com wrote:
  Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
  help. I suspect we will start seeing the ConcurrentModificationException
  again. The right fix has gone into Hadoop through 10456. Unfortunately, I
  don't have any bright ideas on how to synchronize this at the Spark level
  without the risk of deadlocks.
 
 
  On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com
 wrote:
 
  The full jstack would still be useful, but our current working theory is
  that this is due to the fact that Configuration#loadDefaults goes
 through
  every Configuration object that was ever created (via
  Configuration.REGISTRY) and locks it, thus introducing a dependency from
  new Configuration to old, otherwise unrelated, Configuration objects
 that
  our locking did not anticipate.
 
  I have created https://github.com/apache/spark/pull/1409 to hopefully
 fix
  this bug.
 
 
  On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey Cody,
  
   This Jstack seems truncated, would you mind giving the entire stack
   trace? For the second thread, for instance, we can't see where the
   lock is being acquired.
  
   - Patrick
  
   On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
   cody.koenin...@mediacrossing.com wrote:
Hi all, just wanted to give a heads up that we're seeing a
 reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
   
If jira is a better place for this, apologies in advance - figured
   talking
about it on the mailing list was friendlier than randomly
 (re)opening
   jira
tickets.
   
I know Gary had mentioned some issues with 1.0.1 on the mailing
 list,
   once
we got a thread dump I wanted to follow up.
   
The thread dump shows the deadlock occurs in the synchronized block
 of
   code
that was changed in HadoopRDD.scala, for the Spark-1097 issue
   
Relevant portions of the thread dump are summarized below, we can
  provide
the whole dump if it's useful.
   
Found one Java-level deadlock:
=
Executor task launch worker-1:
  waiting to lock monitor 0x7f250400c520 (object
  0xfae7dc30,
   a
org.apache.hadoop.co
nf.Configuration),
  which is held by Executor task launch worker-0
Executor task launch worker-0:
  waiting to lock monitor 0x7f2520495620 (object
  0xfaeb4fc8,
   a
java.lang.Class),
  which is held by Executor task launch worker-1
   
   
Executor task launch worker-1:
at
   
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
- waiting to lock 0xfae7dc30 (a
org.apache.hadoop.conf.Configuration)
at
   
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
- locked 0xfaca6ff8 (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
at
   
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
at
   
  
 
 org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
)
at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
   
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
at
  sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
   
  
 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
at
   
  
 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
sorImpl.java:45)
at
   java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at
   java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at
   
 org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
 

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Reynold Xin
Copying Jon here since he worked on the lzf library at Ning.

Jon - any comments on this topic?


On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 You can actually turn off shuffle compression by setting
 spark.shuffle.compress to false. Try that out, there will still be some
 buffers for the various OutputStreams, but they should be smaller.

 Matei

 On Jul 14, 2014, at 3:30 PM, Stephen Haberman stephen.haber...@gmail.com
 wrote:

 
  Just a comment from the peanut gallery, but these buffers are a real
  PITA for us as well. Probably 75% of our non-user-error job failures
  are related to them.
 
  Just naively, what about not doing compression on the fly? E.g. during
  the shuffle just write straight to disk, uncompressed?
 
  For us, we always have plenty of disk space, and if you're concerned
  about network transmission, you could add a separate compress step
  after the blocks have been written to disk, but before being sent over
  the wire.
 
  Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
  see work in this area!
 
  - Stephen
 




Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Gary Malouf
We'll try to run a build tomorrow AM.


On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com wrote:

 Andrew and Gary,

 Would you guys be able to test
 https://github.com/apache/spark/pull/1409/files and see if it solves
 your problem?

 - Patrick

 On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash and...@andrewash.com wrote:
  I observed a deadlock here when using the AvroInputFormat as well. The
  short of the issue is that there's one configuration object per JVM, but
  multiple threads, one for each task. If each thread attempts to add a
  configuration option to the Configuration object at once you get issues
  because HashMap isn't thread safe.
 
  More details to come tonight. Thanks!
  On Jul 14, 2014 4:11 PM, Nishkam Ravi nr...@cloudera.com wrote:
 
  HI Patrick, I'm not aware of another place where the access happens, but
  it's possible that it does. The original fix synchronized on the
  broadcastConf object and someone reported the same exception.
 
 
  On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey Nishkam,
  
   Aaron's fix should prevent two concurrent accesses to getJobConf (and
   the Hadoop code therein). But if there is code elsewhere that tries to
   mutate the configuration, then I could see how we might still have the
   ConcurrentModificationException.
  
   I looked at your patch for HADOOP-10456 and the only example you give
   is of the data being accessed inside of getJobConf. Is it accessed
   somewhere else too from Spark that you are aware of?
  
   https://issues.apache.org/jira/browse/HADOOP-10456
  
   - Patrick
  
   On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com
  wrote:
Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object
  would
help. I suspect we will start seeing the
  ConcurrentModificationException
again. The right fix has gone into Hadoop through 10456.
  Unfortunately, I
don't have any bright ideas on how to synchronize this at the Spark
  level
without the risk of deadlocks.
   
   
On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson ilike...@gmail.com
 
   wrote:
   
The full jstack would still be useful, but our current working
 theory
  is
that this is due to the fact that Configuration#loadDefaults goes
   through
every Configuration object that was ever created (via
Configuration.REGISTRY) and locks it, thus introducing a dependency
  from
new Configuration to old, otherwise unrelated, Configuration
 objects
   that
our locking did not anticipate.
   
I have created https://github.com/apache/spark/pull/1409 to
 hopefully
   fix
this bug.
   
   
On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell 
 pwend...@gmail.com
wrote:
   
 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire
 stack
 trace? For the second thread, for instance, we can't see where
 the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com wrote:
  Hi all, just wanted to give a heads up that we're seeing a
   reproducible
  deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
 
  If jira is a better place for this, apologies in advance -
 figured
 talking
  about it on the mailing list was friendlier than randomly
   (re)opening
 jira
  tickets.
 
  I know Gary had mentioned some issues with 1.0.1 on the mailing
   list,
 once
  we got a thread dump I wanted to follow up.
 
  The thread dump shows the deadlock occurs in the synchronized
  block
   of
 code
  that was changed in HadoopRDD.scala, for the Spark-1097 issue
 
  Relevant portions of the thread dump are summarized below, we
 can
provide
  the whole dump if it's useful.
 
  Found one Java-level deadlock:
  =
  Executor task launch worker-1:
waiting to lock monitor 0x7f250400c520 (object
0xfae7dc30,
 a
  org.apache.hadoop.co
  nf.Configuration),
which is held by Executor task launch worker-0
  Executor task launch worker-0:
waiting to lock monitor 0x7f2520495620 (object
0xfaeb4fc8,
 a
  java.lang.Class),
which is held by Executor task launch worker-1
 
 
  Executor task launch worker-1:
  at
 

   
  
 
 org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
  - waiting to lock 0xfae7dc30 (a
  org.apache.hadoop.conf.Configuration)
  at
 

   
  
 
 org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
  - locked 0xfaca6ff8 (a java.lang.Class for
  org.apache.hadoop.conf.Configurati
  on)
  at
 

   
  
 
 org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The patch won't solve the problem where two people try to add a
configuration option at the same time, but I think there is currently an
issue where two people can try to initialize the Configuration at the same
time and still run into a ConcurrentModificationException. This at least
reduces (slightly) the scope of the exception although eliminating it may
not be possible.


On Mon, Jul 14, 2014 at 4:35 PM, Gary Malouf malouf.g...@gmail.com wrote:

 We'll try to run a build tomorrow AM.


 On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell pwend...@gmail.com
 wrote:

  Andrew and Gary,
 
  Would you guys be able to test
  https://github.com/apache/spark/pull/1409/files and see if it solves
  your problem?
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash and...@andrewash.com
 wrote:
   I observed a deadlock here when using the AvroInputFormat as well. The
   short of the issue is that there's one configuration object per JVM,
 but
   multiple threads, one for each task. If each thread attempts to add a
   configuration option to the Configuration object at once you get issues
   because HashMap isn't thread safe.
  
   More details to come tonight. Thanks!
   On Jul 14, 2014 4:11 PM, Nishkam Ravi nr...@cloudera.com wrote:
  
   HI Patrick, I'm not aware of another place where the access happens,
 but
   it's possible that it does. The original fix synchronized on the
   broadcastConf object and someone reported the same exception.
  
  
   On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell pwend...@gmail.com
   wrote:
  
Hey Nishkam,
   
Aaron's fix should prevent two concurrent accesses to getJobConf
 (and
the Hadoop code therein). But if there is code elsewhere that tries
 to
mutate the configuration, then I could see how we might still have
 the
ConcurrentModificationException.
   
I looked at your patch for HADOOP-10456 and the only example you
 give
is of the data being accessed inside of getJobConf. Is it accessed
somewhere else too from Spark that you are aware of?
   
https://issues.apache.org/jira/browse/HADOOP-10456
   
- Patrick
   
On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi nr...@cloudera.com
   wrote:
 Hi Aaron, I'm not sure if synchronizing on an arbitrary lock
 object
   would
 help. I suspect we will start seeing the
   ConcurrentModificationException
 again. The right fix has gone into Hadoop through 10456.
   Unfortunately, I
 don't have any bright ideas on how to synchronize this at the
 Spark
   level
 without the risk of deadlocks.


 On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson 
 ilike...@gmail.com
  
wrote:

 The full jstack would still be useful, but our current working
  theory
   is
 that this is due to the fact that Configuration#loadDefaults goes
through
 every Configuration object that was ever created (via
 Configuration.REGISTRY) and locks it, thus introducing a
 dependency
   from
 new Configuration to old, otherwise unrelated, Configuration
  objects
that
 our locking did not anticipate.

 I have created https://github.com/apache/spark/pull/1409 to
  hopefully
fix
 this bug.


 On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell 
  pwend...@gmail.com
 wrote:

  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire
  stack
  trace? For the second thread, for instance, we can't see where
  the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a
reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance -
  figured
  talking
   about it on the mailing list was friendlier than randomly
(re)opening
  jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the
 mailing
list,
  once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized
   block
of
  code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we
  can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30,
  a
   org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8,
  a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
  

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Davies Liu
Maybe we could try LZ4 [1], which has better performance and smaller footprint
than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x
higher than LZF[2],
but memory used is 10x smaller than LZF (16k vs 190k).

[1] https://github.com/jpountz/lz4-java
[2] 
http://ning.github.io/jvm-compressor-benchmark/results/calgary/roundtrip-2013-06-06/index.html


On Mon, Jul 14, 2014 at 12:01 AM, Reynold Xin r...@databricks.com wrote:

 Hi Spark devs,

 I was looking into the memory usage of shuffle and one annoying thing is
 the default compression codec (LZF) is that the implementation we use
 allocates buffers pretty generously. I did a simple experiment and found
 that creating 1000 LZFOutputStream allocated 198976424 bytes (~190MB). If
 we have a shuffle task that uses 10k reducers and 32 threads running
 currently, the memory used by the lzf stream alone would be ~ 60GB.

 In comparison, Snappy only allocates ~ 65MB for every
 1k SnappyOutputStream. However, Snappy's compression is slightly lower than
 LZF's. In my experience, it leads to 10 - 20% increase in size. Compression
 ratio does matter here because we are sending data across the network.

 In future releases we will likely change the shuffle implementation to open
 less streams. Until that happens, I'm looking for compression codec
 implementations that are fast, allocate small buffers, and have decent
 compression ratio.

 Does anybody on this list have any suggestions? If not, I will submit a
 patch for 1.1 that replaces LZF with Snappy for the default compression
 codec to lower memory usage.


 allocation data here: https://gist.github.com/rxin/ad7217ea60e3fb36c567


SBT gen-idea doesn't work well after merging SPARK-1776

2014-07-14 Thread DB Tsai
I've a clean clone of spark master repository, and I generated the
intellij project file by sbt gen-idea as usual. There are two issues
we have after merging SPARK-1776 (read dependencies from Maven).

1) After SPARK-1776, sbt gen-idea will download the dependencies from
internet even those jars are in local cache. Before merging, the
second time we run gen-idea will not download anything but use the
jars in cache.

2) The tests with spark local context can not be run in the intellij.
It will show the following exception.

The current workaround we've are checking out any snapshot before
merging to gen-idea, and then switch back to current master. But this
will not work when the master deviate too much from the latest working
snapshot.

[ERROR] [07/14/2014 16:27:49.967] [ScalaTest-run] [Remoting] Remoting
error: [Startup timed out] [
akka.remote.RemoteTransportException: Startup timed out
at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
at akka.remote.Remoting.start(Remoting.scala:191)
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:104)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:153)
at org.apache.spark.SparkContext.init(SparkContext.scala:202)
at org.apache.spark.SparkContext.init(SparkContext.scala:117)
at org.apache.spark.SparkContext.init(SparkContext.scala:132)
at 
org.apache.spark.mllib.util.LocalSparkContext$class.beforeAll(LocalSparkContext.scala:29)
at org.apache.spark.mllib.optimization.LBFGSSuite.beforeAll(LBFGSSuite.scala:27)
at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
at org.apache.spark.mllib.optimization.LBFGSSuite.beforeAll(LBFGSSuite.scala:27)
at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
at org.apache.spark.mllib.optimization.LBFGSSuite.run(LBFGSSuite.scala:27)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
at 
org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
at 
org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1043)
at 
org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:2722)
at 
org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1043)
at org.scalatest.tools.Runner$.run(Runner.scala:883)
at org.scalatest.tools.Runner.run(Runner.scala)
at 
org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:141)
at 
org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.util.concurrent.TimeoutException: Futures timed out
after [1 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:173)
... 35 more
]

An exception or error caused a run to abort: Futures timed out after
[1 milliseconds]
java.util.concurrent.TimeoutException: Futures timed out after [1
milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:173)
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
at 

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Jon Hartlaub
Is the held memory due to just instantiating the LZFOutputStream?  If so,
I'm a surprised and I consider that a bug.

I suspect the held memory may be due to a SoftReference - memory will be
released with enough memory pressure.

Finally, is it necessary to keep 1000 (or more) decoders active?  Would it
be possible to keep an object pool of encoders and check them in and out as
needed?  I admit I have not done much homework to determine if this is
viable.

-Jon


On Mon, Jul 14, 2014 at 4:08 PM, Reynold Xin r...@databricks.com wrote:

 Copying Jon here since he worked on the lzf library at Ning.

 Jon - any comments on this topic?


 On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 You can actually turn off shuffle compression by setting
 spark.shuffle.compress to false. Try that out, there will still be some
 buffers for the various OutputStreams, but they should be smaller.

 Matei

 On Jul 14, 2014, at 3:30 PM, Stephen Haberman stephen.haber...@gmail.com
 wrote:

 
  Just a comment from the peanut gallery, but these buffers are a real
  PITA for us as well. Probably 75% of our non-user-error job failures
  are related to them.
 
  Just naively, what about not doing compression on the fly? E.g. during
  the shuffle just write straight to disk, uncompressed?
 
  For us, we always have plenty of disk space, and if you're concerned
  about network transmission, you could add a separate compress step
  after the blocks have been written to disk, but before being sent over
  the wire.
 
  Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
  see work in this area!
 
  - Stephen
 





Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Aaron Davidson
One of the core problems here is the number of open streams we have, which
is (# cores * # reduce partitions), which can easily climb into the tens of
thousands for large jobs. This is a more general problem that we are
planning on fixing for our largest shuffles, as even moderate buffer sizes
can explode to use huge amounts of memory at that scale.


On Mon, Jul 14, 2014 at 4:53 PM, Jon Hartlaub jhartl...@gmail.com wrote:

 Is the held memory due to just instantiating the LZFOutputStream?  If so,
 I'm a surprised and I consider that a bug.

 I suspect the held memory may be due to a SoftReference - memory will be
 released with enough memory pressure.

 Finally, is it necessary to keep 1000 (or more) decoders active?  Would it
 be possible to keep an object pool of encoders and check them in and out as
 needed?  I admit I have not done much homework to determine if this is
 viable.

 -Jon


 On Mon, Jul 14, 2014 at 4:08 PM, Reynold Xin r...@databricks.com wrote:

  Copying Jon here since he worked on the lzf library at Ning.
 
  Jon - any comments on this topic?
 
 
  On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia matei.zaha...@gmail.com
  wrote:
 
  You can actually turn off shuffle compression by setting
  spark.shuffle.compress to false. Try that out, there will still be some
  buffers for the various OutputStreams, but they should be smaller.
 
  Matei
 
  On Jul 14, 2014, at 3:30 PM, Stephen Haberman 
 stephen.haber...@gmail.com
  wrote:
 
  
   Just a comment from the peanut gallery, but these buffers are a real
   PITA for us as well. Probably 75% of our non-user-error job failures
   are related to them.
  
   Just naively, what about not doing compression on the fly? E.g. during
   the shuffle just write straight to disk, uncompressed?
  
   For us, we always have plenty of disk space, and if you're concerned
   about network transmission, you could add a separate compress step
   after the blocks have been written to disk, but before being sent over
   the wire.
  
   Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
   see work in this area!
  
   - Stephen
  
 
 
 



Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
- Original Message -
 From: Aaron Davidson ilike...@gmail.com
 To: dev@spark.apache.org
 Sent: Monday, July 14, 2014 5:21:10 PM
 Subject: Re: Profiling Spark tests with YourKit (or something else)
 
 Out of curiosity, what problems are you seeing with Utils.getCallSite?

Aaron, if I enable call site tracking or CPU profiling in YourKit, many (but 
not all) Spark test cases will NPE on the line filtering out getStackTrace 
from the stack trace (this is Utils.scala:812 in the current master).  I'm not 
sure if this is a consequence of Thread#getStackTrace including bogus frames 
when running instrumented or if whatever instrumentation YourKit inserts relies 
on assumptions that don't always hold for Scala code.


best,
wb


Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Aaron Davidson
Would you mind filing a JIRA for this? That does sound like something bogus
happening on the JVM/YourKit level, but this sort of diagnosis is
sufficiently important that we should be resilient against it.


On Mon, Jul 14, 2014 at 6:01 PM, Will Benton wi...@redhat.com wrote:

 - Original Message -
  From: Aaron Davidson ilike...@gmail.com
  To: dev@spark.apache.org
  Sent: Monday, July 14, 2014 5:21:10 PM
  Subject: Re: Profiling Spark tests with YourKit (or something else)
 
  Out of curiosity, what problems are you seeing with Utils.getCallSite?

 Aaron, if I enable call site tracking or CPU profiling in YourKit, many
 (but not all) Spark test cases will NPE on the line filtering out
 getStackTrace from the stack trace (this is Utils.scala:812 in the
 current master).  I'm not sure if this is a consequence of
 Thread#getStackTrace including bogus frames when running instrumented or if
 whatever instrumentation YourKit inserts relies on assumptions that don't
 always hold for Scala code.


 best,
 wb



ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Just launched an EC2 cluster from git hash
9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
accessing data in S3 yields the following error output.

I understand that NoClassDefFoundError errors may mean something in the
deployment was messed up. Is that correct? When I launch a cluster using
spark-ec2, I expect all critical deployment details to be taken care of by
the script.

So is something in the deployment executed by spark-ec2 borked?

Nick

java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
at 
org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
at 
org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
at 
org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
at 
org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
at org.apache.spark.rdd.RDD.take(RDD.scala:1036)
at $iwC$$iwC$$iwC$$iwC.init(console:26)
at $iwC$$iwC$$iwC.init(console:31)
at $iwC$$iwC.init(console:33)
at $iwC.init(console:35)
at init(console:37)
at .init(console:41)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread scwf

hi,Cody
  i met this issue days before and i post a PR for this( 
https://github.com/apache/spark/pull/1385)
it's very strange that if i synchronize conf it will deadlock but it is ok when 
synchronize initLocalJobConfFuncOpt



Here's the entire jstack output.


On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com 
mailto:pwend...@gmail.com wrote:

Hey Cody,

This Jstack seems truncated, would you mind giving the entire stack
trace? For the second thread, for instance, we can't see where the
lock is being acquired.

- Patrick

On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
cody.koenin...@mediacrossing.com 
mailto:cody.koenin...@mediacrossing.com wrote:
  Hi all, just wanted to give a heads up that we're seeing a reproducible
  deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
 
  If jira is a better place for this, apologies in advance - figured 
talking
  about it on the mailing list was friendlier than randomly (re)opening 
jira
  tickets.
 
  I know Gary had mentioned some issues with 1.0.1 on the mailing list, 
once
  we got a thread dump I wanted to follow up.
 
  The thread dump shows the deadlock occurs in the synchronized block of 
code
  that was changed in HadoopRDD.scala, for the Spark-1097 issue
 
  Relevant portions of the thread dump are summarized below, we can provide
  the whole dump if it's useful.
 
  Found one Java-level deadlock:
  =
  Executor task launch worker-1:
waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30, 
a
  org.apache.hadoop.co http://org.apache.hadoop.co
  nf.Configuration),
which is held by Executor task launch worker-0
  Executor task launch worker-0:
waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8, 
a
  java.lang.Class),
which is held by Executor task launch worker-1
 
 
  Executor task launch worker-1:
  at
  
org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
  - waiting to lock 0xfae7dc30 (a
  org.apache.hadoop.conf.Configuration)
  at
  
org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
  - locked 0xfaca6ff8 (a java.lang.Class for
  org.apache.hadoop.conf.Configurati
  on)
  at
  
org.apache.hadoop.hdfs.HdfsConfiguration.clinit(HdfsConfiguration.java:34)
  at
  
org.apache.hadoop.hdfs.DistributedFileSystem.clinit(DistributedFileSystem.java:110
  )
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
  
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
  java:57)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
  Method)
  at
  
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
  java:57)
  at
  
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
  sorImpl.java:45)
  at 
java.lang.reflect.Constructor.newInstance(Constructor.java:525)
  at java.lang.Class.newInstance0(Class.java:374)
  at java.lang.Class.newInstance(Class.java:327)
  at 
java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
  at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
  at
  org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
  - locked 0xfaeb4fc8 (a java.lang.Class for
  org.apache.hadoop.fs.FileSystem)
  at
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
  at
  org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
  at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
  at
  org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
  at
  
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
  at
  
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
  at
  org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
  at
  org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
  at
  
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
 
 
 
   

Re: how to run the program compiled with spark 1.0.0 in the branch-0.1-jdbc cluster

2014-07-14 Thread Nan Zhu
I resolved the issue by setting an internal maven repository to contain the 
Spark-1.0.1 jar compiled from branch-0.1-jdbc and replacing the dependency to 
the central repository with our own repository 

I believe there should be some more lightweight way

Best, 

-- 
Nan Zhu


On Monday, July 14, 2014 at 6:36 AM, Nan Zhu wrote:

 Ah, sorry, sorry
 
 It's executorState under deploy package
 
 On Monday, July 14, 2014, Patrick Wendell pwend...@gmail.com 
 (mailto:pwend...@gmail.com) wrote:
   1. The first error I met is the different SerializationVersionUID in 
   ExecuterStatus
  
   I resolved by explicitly declare SerializationVersionUID in 
   ExecuterStatus.scala and recompile branch-0.1-jdbc
  
  
  I don't think there is a class in Spark named ExecuterStatus (sic) ...
  or ExecutorStatus. Is this a class you made?



Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Will Benton
Sure thing:

   https://issues.apache.org/jira/browse/SPARK-2486
   https://github.com/apache/spark/pull/1413

best,
wb


- Original Message -
 From: Aaron Davidson ilike...@gmail.com
 To: dev@spark.apache.org
 Sent: Monday, July 14, 2014 8:38:16 PM
 Subject: Re: Profiling Spark tests with YourKit (or something else)
 
 Would you mind filing a JIRA for this? That does sound like something bogus
 happening on the JVM/YourKit level, but this sort of diagnosis is
 sufficiently important that we should be resilient against it.
 
 
 On Mon, Jul 14, 2014 at 6:01 PM, Will Benton wi...@redhat.com wrote:
 
  - Original Message -
   From: Aaron Davidson ilike...@gmail.com
   To: dev@spark.apache.org
   Sent: Monday, July 14, 2014 5:21:10 PM
   Subject: Re: Profiling Spark tests with YourKit (or something else)
  
   Out of curiosity, what problems are you seeing with Utils.getCallSite?
 
  Aaron, if I enable call site tracking or CPU profiling in YourKit, many
  (but not all) Spark test cases will NPE on the line filtering out
  getStackTrace from the stack trace (this is Utils.scala:812 in the
  current master).  I'm not sure if this is a consequence of
  Thread#getStackTrace including bogus frames when running instrumented or if
  whatever instrumentation YourKit inserts relies on assumptions that don't
  always hold for Scala code.
 
 
  best,
  wb
 
 


Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I'm not sure either of those PRs will fix the concurrent adds to
Configuration issue I observed. I've got a stack trace and writeup I'll
share in an hour or two (traveling today).
On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:

 hi,Cody
   i met this issue days before and i post a PR for this(
 https://github.com/apache/spark/pull/1385)
 it's very strange that if i synchronize conf it will deadlock but it is ok
 when synchronize initLocalJobConfFuncOpt


  Here's the entire jstack output.


 On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:

 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire stack
 trace? For the second thread, for instance, we can't see where the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com mailto:cody.koeninger@
 mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a
 reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
 talking
   about it on the mailing list was friendlier than randomly
 (re)opening jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing
 list, once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized
 block of code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30, a
   org.apache.hadoop.co http://org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8, a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
   org.apache.hadoop.conf.Configuration.reloadConfiguration(
 Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
   org.apache.hadoop.conf.Configuration.addDefaultResource(
 Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
   org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
 HdfsConfiguration.java:34)
   at
   org.apache.hadoop.hdfs.DistributedFileSystem.clinit
 (DistributedFileSystem.java:110
   )
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at
   sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
 DelegatingConstructorAcces
   sorImpl.java:45)
   at java.lang.reflect.Constructor.
 newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at java.util.ServiceLoader$LazyIterator.next(
 ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(
 FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
   org.apache.hadoop.fs.FileSystem.getFileSystemClass(
 FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(
 FileSystem.java:2392)
   at org.apache.hadoop.fs.FileSystem.access$200(
 FileSystem.java:89)
   at
   org.apache.hadoop.fs.FileSystem$Cache.getInternal(
 FileSystem.java:2431)
   at org.apache.hadoop.fs.FileSystem$Cache.get(
 FileSystem.java:2413)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:368)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:167)
   at
   org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
 JobConf.java:587)
   at
   org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
 FileInputFormat.java:315)
   at
   

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
immediate priority is addressing regressions between these two
releases.

On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote:
 I'm not sure either of those PRs will fix the concurrent adds to
 Configuration issue I observed. I've got a stack trace and writeup I'll
 share in an hour or two (traveling today).
 On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:

 hi,Cody
   i met this issue days before and i post a PR for this(
 https://github.com/apache/spark/pull/1385)
 it's very strange that if i synchronize conf it will deadlock but it is ok
 when synchronize initLocalJobConfFuncOpt


  Here's the entire jstack output.


 On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:

 Hey Cody,

 This Jstack seems truncated, would you mind giving the entire stack
 trace? For the second thread, for instance, we can't see where the
 lock is being acquired.

 - Patrick

 On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
 cody.koenin...@mediacrossing.com mailto:cody.koeninger@
 mediacrossing.com wrote:
   Hi all, just wanted to give a heads up that we're seeing a
 reproducible
   deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
  
   If jira is a better place for this, apologies in advance - figured
 talking
   about it on the mailing list was friendlier than randomly
 (re)opening jira
   tickets.
  
   I know Gary had mentioned some issues with 1.0.1 on the mailing
 list, once
   we got a thread dump I wanted to follow up.
  
   The thread dump shows the deadlock occurs in the synchronized
 block of code
   that was changed in HadoopRDD.scala, for the Spark-1097 issue
  
   Relevant portions of the thread dump are summarized below, we can
 provide
   the whole dump if it's useful.
  
   Found one Java-level deadlock:
   =
   Executor task launch worker-1:
 waiting to lock monitor 0x7f250400c520 (object
 0xfae7dc30, a
   org.apache.hadoop.co http://org.apache.hadoop.co
   nf.Configuration),
 which is held by Executor task launch worker-0
   Executor task launch worker-0:
 waiting to lock monitor 0x7f2520495620 (object
 0xfaeb4fc8, a
   java.lang.Class),
 which is held by Executor task launch worker-1
  
  
   Executor task launch worker-1:
   at
   org.apache.hadoop.conf.Configuration.reloadConfiguration(
 Configuration.java:791)
   - waiting to lock 0xfae7dc30 (a
   org.apache.hadoop.conf.Configuration)
   at
   org.apache.hadoop.conf.Configuration.addDefaultResource(
 Configuration.java:690)
   - locked 0xfaca6ff8 (a java.lang.Class for
   org.apache.hadoop.conf.Configurati
   on)
   at
   org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
 HdfsConfiguration.java:34)
   at
   org.apache.hadoop.hdfs.DistributedFileSystem.clinit
 (DistributedFileSystem.java:110
   )
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at sun.reflect.NativeConstructorAccessorImpl.
 newInstance0(Native
   Method)
   at
   sun.reflect.NativeConstructorAccessorImpl.newInstance(
 NativeConstructorAccessorImpl.
   java:57)
   at
   sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
 DelegatingConstructorAcces
   sorImpl.java:45)
   at java.lang.reflect.Constructor.
 newInstance(Constructor.java:525)
   at java.lang.Class.newInstance0(Class.java:374)
   at java.lang.Class.newInstance(Class.java:327)
   at java.util.ServiceLoader$LazyIterator.next(
 ServiceLoader.java:373)
   at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
   at
   org.apache.hadoop.fs.FileSystem.loadFileSystems(
 FileSystem.java:2364)
   - locked 0xfaeb4fc8 (a java.lang.Class for
   org.apache.hadoop.fs.FileSystem)
   at
   org.apache.hadoop.fs.FileSystem.getFileSystemClass(
 FileSystem.java:2375)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(
 FileSystem.java:2392)
   at org.apache.hadoop.fs.FileSystem.access$200(
 FileSystem.java:89)
   at
   org.apache.hadoop.fs.FileSystem$Cache.getInternal(
 FileSystem.java:2431)
   at org.apache.hadoop.fs.FileSystem$Cache.get(
 FileSystem.java:2413)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:368)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.
 java:167)
   at
   

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case
where a small amount of duplicated code could get rid of the
dependency, that could also be a good short-term option.

- Patrick

On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Yeah, I'd just add a spark-util that has these things.

 Matei

 On Jul 14, 2014, at 1:04 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Yeah, sadly this dependency was introduced when someone consolidated the
 logging infrastructure.  However, the dependency should be very small and
 thus easy to remove, and I would like catalyst to be usable outside of
 Spark.  A pull request to make this possible would be welcome.

 Ideally, we'd create some sort of spark common package that has things like
 logging.  That way catalyst could depend on that, without pulling in all of
 Hadoop, etc.  Maybe others have opinions though, so I'm cc-ing the dev list.


 On Mon, Jul 14, 2014 at 12:21 AM, Yanbo Liang yanboha...@gmail.com wrote:

 Make Catalyst independent of Spark is the goal of Catalyst, maybe need
 time and evolution.
 I awared that package org.apache.spark.sql.catalyst.util embraced
 org.apache.spark.util.{Utils = SparkUtils},
 so that Catalyst has a dependency on Spark core.
 I'm not sure whether it will be replaced by other component independent of
 Spark in later release.


 2014-07-14 11:51 GMT+08:00 Aniket Bhatnagar aniket.bhatna...@gmail.com:

 As per the recent presentation given in Scala days
 (http://people.apache.org/~marmbrus/talks/SparkSQLScalaDays2014.pdf), it was
 mentioned that Catalyst is independent of Spark. But on inspecting pom.xml
 of sql/catalyst module, it seems it has a dependency on Spark Core. Any
 particular reason for the dependency? I would love to use Catalyst outside
 Spark

 (reposted as previous email bounced. Sorry if this is a duplicate).






Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Andrew Ash
I don't believe mine is a regression. But it is related to thread safety on
Hadoop Configuration objects. Should I start a new thread?
On Jul 15, 2014 12:55 AM, Patrick Wendell pwend...@gmail.com wrote:

 Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
 immediate priority is addressing regressions between these two
 releases.

 On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote:
  I'm not sure either of those PRs will fix the concurrent adds to
  Configuration issue I observed. I've got a stack trace and writeup I'll
  share in an hour or two (traveling today).
  On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:
 
  hi,Cody
i met this issue days before and i post a PR for this(
  https://github.com/apache/spark/pull/1385)
  it's very strange that if i synchronize conf it will deadlock but it is
 ok
  when synchronize initLocalJobConfFuncOpt
 
 
   Here's the entire jstack output.
 
 
  On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
  mailto:pwend...@gmail.com wrote:
 
  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com mailto:cody.koeninger@
  mediacrossing.com wrote:
Hi all, just wanted to give a heads up that we're seeing a
  reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
   
If jira is a better place for this, apologies in advance -
 figured
  talking
about it on the mailing list was friendlier than randomly
  (re)opening jira
tickets.
   
I know Gary had mentioned some issues with 1.0.1 on the mailing
  list, once
we got a thread dump I wanted to follow up.
   
The thread dump shows the deadlock occurs in the synchronized
  block of code
that was changed in HadoopRDD.scala, for the Spark-1097 issue
   
Relevant portions of the thread dump are summarized below, we
 can
  provide
the whole dump if it's useful.
   
Found one Java-level deadlock:
=
Executor task launch worker-1:
  waiting to lock monitor 0x7f250400c520 (object
  0xfae7dc30, a
org.apache.hadoop.co http://org.apache.hadoop.co
nf.Configuration),
  which is held by Executor task launch worker-0
Executor task launch worker-0:
  waiting to lock monitor 0x7f2520495620 (object
  0xfaeb4fc8, a
java.lang.Class),
  which is held by Executor task launch worker-1
   
   
Executor task launch worker-1:
at
org.apache.hadoop.conf.Configuration.reloadConfiguration(
  Configuration.java:791)
- waiting to lock 0xfae7dc30 (a
org.apache.hadoop.conf.Configuration)
at
org.apache.hadoop.conf.Configuration.addDefaultResource(
  Configuration.java:690)
- locked 0xfaca6ff8 (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
at
org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
  HdfsConfiguration.java:34)
at
org.apache.hadoop.hdfs.DistributedFileSystem.clinit
  (DistributedFileSystem.java:110
)
at sun.reflect.NativeConstructorAccessorImpl.
  newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(
  NativeConstructorAccessorImpl.
java:57)
at sun.reflect.NativeConstructorAccessorImpl.
  newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(
  NativeConstructorAccessorImpl.
java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
  DelegatingConstructorAcces
sorImpl.java:45)
at java.lang.reflect.Constructor.
  newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at java.util.ServiceLoader$LazyIterator.next(
  ServiceLoader.java:373)
at
 java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at
org.apache.hadoop.fs.FileSystem.loadFileSystems(
  FileSystem.java:2364)
- locked 0xfaeb4fc8 (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(
  FileSystem.java:2375)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(
  FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(
  FileSystem.java:89)
at

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Aaron Davidson
This one is typically due to a mismatch between the Hadoop versions --
i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
classpath, or something like that. Not certain why you're seeing this with
spark-ec2, but I'm assuming this is related to the issues you posted in a
separate thread.


On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Just launched an EC2 cluster from git hash
 9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
 accessing data in S3 yields the following error output.

 I understand that NoClassDefFoundError errors may mean something in the
 deployment was messed up. Is that correct? When I launch a cluster using
 spark-ec2, I expect all critical deployment details to be taken care of by
 the script.

 So is something in the deployment executed by spark-ec2 borked?

 Nick

 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
 at
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
 at
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
 at
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
 at
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
 at
 org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
 at
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
 at
 org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
 at
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
 at
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
 at
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
 at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
 at
 org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
 at org.apache.spark.rdd.RDD.take(RDD.scala:1036)
 at $iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC.init(console:31)
 at $iwC$$iwC.init(console:33)
 at $iwC.init(console:35)
 at init(console:37)
 at 

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Shivaram Venkataraman
My guess is that this is related to
https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets
excluded from the SBT assembly jar. I am not sure if the assembly jar used
in EC2 is generated using SBT though.

Shivaram


On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson ilike...@gmail.com wrote:

 This one is typically due to a mismatch between the Hadoop versions --
 i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
 classpath, or something like that. Not certain why you're seeing this with
 spark-ec2, but I'm assuming this is related to the issues you posted in a
 separate thread.


 On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  Just launched an EC2 cluster from git hash
  9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
  accessing data in S3 yields the following error output.
 
  I understand that NoClassDefFoundError errors may mean something in the
  deployment was messed up. Is that correct? When I launch a cluster using
  spark-ec2, I expect all critical deployment details to be taken care of
 by
  the script.
 
  So is something in the deployment executed by spark-ec2 borked?
 
  Nick
 
  java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
  at
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
  at
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at
 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
  at
  org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
  at
  org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
  at
  org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
  at
 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
  at
  org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
  at
 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
  at
 
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
  at
 org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
  at
  org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Patrick Wendell
Yeah - this is likely caused by SPARK-2471.

On Mon, Jul 14, 2014 at 10:11 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
 My guess is that this is related to
 https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library gets
 excluded from the SBT assembly jar. I am not sure if the assembly jar used
 in EC2 is generated using SBT though.

 Shivaram


 On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson ilike...@gmail.com wrote:

 This one is typically due to a mismatch between the Hadoop versions --
 i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
 classpath, or something like that. Not certain why you're seeing this with
 spark-ec2, but I'm assuming this is related to the issues you posted in a
 separate thread.


 On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  Just launched an EC2 cluster from git hash
  9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
  accessing data in S3 yields the following error output.
 
  I understand that NoClassDefFoundError errors may mean something in the
  deployment was messed up. Is that correct? When I launch a cluster using
  spark-ec2, I expect all critical deployment details to be taken care of
 by
  the script.
 
  So is something in the deployment executed by spark-ec2 borked?
 
  Nick
 
  java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
  at
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
  at
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
  at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
  at
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at
 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
  at
  org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
  at
  org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
  at
  org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
  at
 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
  at
  org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
  at
 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
  at
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator.init(CoalescedRDD.scala:185)
  at
 
 org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
  at
 org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
  at
  org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
  at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
  at 

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Patrick Wendell
Hey Andrew,

Yeah, that would be preferable. Definitely worth investigating both,
but the regression is more pressing at the moment.

- Patrick

On Mon, Jul 14, 2014 at 10:02 PM, Andrew Ash and...@andrewash.com wrote:
 I don't believe mine is a regression. But it is related to thread safety on
 Hadoop Configuration objects. Should I start a new thread?
 On Jul 15, 2014 12:55 AM, Patrick Wendell pwend...@gmail.com wrote:

 Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
 immediate priority is addressing regressions between these two
 releases.

 On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash and...@andrewash.com wrote:
  I'm not sure either of those PRs will fix the concurrent adds to
  Configuration issue I observed. I've got a stack trace and writeup I'll
  share in an hour or two (traveling today).
  On Jul 14, 2014 9:50 PM, scwf wangf...@huawei.com wrote:
 
  hi,Cody
i met this issue days before and i post a PR for this(
  https://github.com/apache/spark/pull/1385)
  it's very strange that if i synchronize conf it will deadlock but it is
 ok
  when synchronize initLocalJobConfFuncOpt
 
 
   Here's the entire jstack output.
 
 
  On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell pwend...@gmail.com
  mailto:pwend...@gmail.com wrote:
 
  Hey Cody,
 
  This Jstack seems truncated, would you mind giving the entire stack
  trace? For the second thread, for instance, we can't see where the
  lock is being acquired.
 
  - Patrick
 
  On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
  cody.koenin...@mediacrossing.com mailto:cody.koeninger@
  mediacrossing.com wrote:
Hi all, just wanted to give a heads up that we're seeing a
  reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
   
If jira is a better place for this, apologies in advance -
 figured
  talking
about it on the mailing list was friendlier than randomly
  (re)opening jira
tickets.
   
I know Gary had mentioned some issues with 1.0.1 on the mailing
  list, once
we got a thread dump I wanted to follow up.
   
The thread dump shows the deadlock occurs in the synchronized
  block of code
that was changed in HadoopRDD.scala, for the Spark-1097 issue
   
Relevant portions of the thread dump are summarized below, we
 can
  provide
the whole dump if it's useful.
   
Found one Java-level deadlock:
=
Executor task launch worker-1:
  waiting to lock monitor 0x7f250400c520 (object
  0xfae7dc30, a
org.apache.hadoop.co http://org.apache.hadoop.co
nf.Configuration),
  which is held by Executor task launch worker-0
Executor task launch worker-0:
  waiting to lock monitor 0x7f2520495620 (object
  0xfaeb4fc8, a
java.lang.Class),
  which is held by Executor task launch worker-1
   
   
Executor task launch worker-1:
at
org.apache.hadoop.conf.Configuration.reloadConfiguration(
  Configuration.java:791)
- waiting to lock 0xfae7dc30 (a
org.apache.hadoop.conf.Configuration)
at
org.apache.hadoop.conf.Configuration.addDefaultResource(
  Configuration.java:690)
- locked 0xfaca6ff8 (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
at
org.apache.hadoop.hdfs.HdfsConfiguration.clinit(
  HdfsConfiguration.java:34)
at
org.apache.hadoop.hdfs.DistributedFileSystem.clinit
  (DistributedFileSystem.java:110
)
at sun.reflect.NativeConstructorAccessorImpl.
  newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(
  NativeConstructorAccessorImpl.
java:57)
at sun.reflect.NativeConstructorAccessorImpl.
  newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(
  NativeConstructorAccessorImpl.
java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
  DelegatingConstructorAcces
sorImpl.java:45)
at java.lang.reflect.Constructor.
  newInstance(Constructor.java:525)
at java.lang.Class.newInstance0(Class.java:374)
at java.lang.Class.newInstance(Class.java:327)
at java.util.ServiceLoader$LazyIterator.next(
  ServiceLoader.java:373)
at
 java.util.ServiceLoader$1.next(ServiceLoader.java:445)
at
org.apache.hadoop.fs.FileSystem.loadFileSystems(
  FileSystem.java:2364)
- locked 0xfaeb4fc8 (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(
  FileSystem.java:2375)

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Nicholas Chammas
Okie doke--added myself as a watcher on that issue.

On a related note, what are the thoughts on automatically spinning up/down
EC2 clusters and running tests against them? It would probably be way too
cumbersome to do that for every build, but perhaps on some schedule it
could help validate that we are still deploying EC2 clusters correctly.

Would something like that be valuable?

Nick


On Tue, Jul 15, 2014 at 1:19 AM, Patrick Wendell pwend...@gmail.com wrote:

 Yeah - this is likely caused by SPARK-2471.

 On Mon, Jul 14, 2014 at 10:11 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  My guess is that this is related to
  https://issues.apache.org/jira/browse/SPARK-2471 where the S3 library
 gets
  excluded from the SBT assembly jar. I am not sure if the assembly jar
 used
  in EC2 is generated using SBT though.
 
  Shivaram
 
 
  On Mon, Jul 14, 2014 at 10:02 PM, Aaron Davidson ilike...@gmail.com
 wrote:
 
  This one is typically due to a mismatch between the Hadoop versions --
  i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
  classpath, or something like that. Not certain why you're seeing this
 with
  spark-ec2, but I'm assuming this is related to the issues you posted in
 a
  separate thread.
 
 
  On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
   Just launched an EC2 cluster from git hash
   9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
   accessing data in S3 yields the following error output.
  
   I understand that NoClassDefFoundError errors may mean something in
 the
   deployment was messed up. Is that correct? When I launch a cluster
 using
   spark-ec2, I expect all critical deployment details to be taken care
 of
  by
   the script.
  
   So is something in the deployment executed by spark-ec2 borked?
  
   Nick
  
   java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at
  
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
   at
  
 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
   at
   org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
   at
  
 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
   at
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at
 org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at
  
 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at
  org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
   at org.apache.spark.ShuffleDependency.init(Dependency.scala:71)
   at
   org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
   at
   org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
   at
   org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
   at
  
 
 org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
   at
   org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
   at
  
 
 org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
   at