How to run specific sparkSQL test with maven

2014-08-01 Thread 田毅
Hi everyone!

Could any one tell me how to run specific sparkSQL test with maven?

For example:

I want to test HiveCompatibilitySuite.

I ran “mvm test -Dtest=HiveCompatibilitySuite”

It did not work. 

BTW, is there any information about how to build a test environment of sparkSQL?

I got this error when i ran the test.

It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am 
I right?
 
Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1” 







Re:How to run specific sparkSQL test with maven

2014-08-01 Thread witgo
You can try these commands‍
./sbt/sbt assembly‍./sbt/sbt test-only *.HiveCompatibilitySuite -Phive‍

‍





-- Original --
From:  田毅;tia...@asiainfo.com;
Date:  Fri, Aug 1, 2014 05:00 PM
To:  devdev@spark.apache.org; 

Subject:  How to run specific sparkSQL test with maven



Hi everyone!

Could any one tell me how to run specific sparkSQL test with maven?

For example:

I want to test HiveCompatibilitySuite.

I ran “mvm test -Dtest=HiveCompatibilitySuite”

It did not work. 

BTW, is there any information about how to build a test environment of sparkSQL?

I got this error when i ran the test.

It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am 
I right?
 
Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”

Re: Re:How to run specific sparkSQL test with maven

2014-08-01 Thread Jeremy Freeman
With maven you can run a particular test suite like this:

mvn -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite test

see the note here (under Spark Tests in Maven):

http://spark.apache.org/docs/latest/building-with-maven.html



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-specific-sparkSQL-test-with-maven-tp7624p7626.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread andy petrella
Heya,
Dunno if these ideas are still in the air or felt in the warp ^^.
However there is a paper on avocado
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf
that
mentions a way of working with their data (sequence's reads) in a windowed
manner without neither time nor timestamp field's value, but a kind-of
internal index as range delimiter -- thus defining their own exotic
continuum and break function.

greetz,

 aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab


On Thu, Jul 17, 2014 at 1:11 AM, andy petrella andy.petre...@gmail.com
wrote:

 Indeed, these two cases are tightly coupled (the first one is a special
 case of the second).

 Actually, these outliers could be handled by a dedicated function what I
 named outliersManager -- I was not so much inspired ^^, but we could name
 these outliers, outlaws and thus the function would be sheriff.
 The purpose of this sheriff function would be to create yet another
 distributed collection (RDD, CRDD, ...?) with only the --outliers-- outlaws
 in it.

 Because these problems have a nature which will be as different as the use
 case will be, it's hard to find a generic way to tackle them. So, you
 know... that's why... I put temporarily them in jail and wait for the judge
 to show them the right path! ( okay it's late in Belgium -- 1AM).

 All in all, it's more or less what we would do in DStream as well actually.
 Let me expand a bit this reasoning, let's assume that some data points can
 come along with the time, but aren't in sync with it -- f.i., a device that
 wakes up and send all it's data at once.
 The DStream will package them into RDDs mixed-up with true current data
 points, however, the logic of the job will have to use a 'Y' road :
 * to integrate them into a database at the right place
 * to simply drop them out because they're won't be part of a shown chart
 * etc

 In this case, the 'Y' road would be of the contract ;-), and so left at
 the appreciation of the dev.

 Another way, to do it would be to ignore but log them, but that would be
 very crappy, non professional and useful (and of course I'm just kidding).

 my0.002¢



  aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab


 On Thu, Jul 17, 2014 at 12:31 AM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 I think it makes sense, though without a concrete implementation its hard
 to be sure. Applying sorting on the RDD according to the RDDs makes sense,
 but I can think of two kinds of fundamental problems.

 1. How do you deal with ordering across RDD boundaries. Say two
 consecutive
 RDDs in the DStream has the following record timestampsRDD1: [ 1, 2,
 3,
 4, 6, 7 ]   RDD 2: [ 5, 8, 9, 10] . And you want to run a function through
 all these records in the timestamp order. I am curious to find how this
 problem can be solved without sacrificing efficiency (e.g. I can imagine
 doing multiple pass magic)

 2. An even more fundamental question is how do you ensure ordering with
 delayed records. If you want to process in order of application time, and
 records are delayed how do you deal with them.

 Any ideas? ;)

 TD



 On Wed, Jul 16, 2014 at 2:37 AM, andy petrella andy.petre...@gmail.com
 wrote:

  Heya TD,
 
  Thanks for the detailed answer! Much appreciated.
 
  Regarding order among elements within an RDD, you're definitively right,
  it'd kill the //ism and would require synchronization which is
 completely
  avoided in distributed env.
 
  That's why, I won't push this constraint to the RDDs themselves
 actually,
  only the Space is something that *defines* ordered elements, and thus
 there
  are two functions that will break the RDDs based on a given (extensible,
  plugable) heuristic f.i.
  Since the Space is rather decoupled from the data, thus the source and
 the
  partitions, it's the responsibility of the CRRD implementation to
 dictate
  how (if necessary) the elements should be sorted in the RDDs... which
 will
  require some shuffles :-s -- Or the couple (source, space) is something
  intrinsically ordered (like it is for DStream).
 
  To be more concrete an RDD would be composed of un-ordered iterator of
  millions of events for which all timestamps land into the same time
  interval.
 
  WDYT, would that makes sense?
 
  thanks again for the answer!
 
  greetz
 
   aℕdy ℙetrella
  about.me/noootsab
  [image: aℕdy ℙetrella on about.me]
 
  http://about.me/noootsab
 
 
  On Wed, Jul 16, 2014 at 12:33 AM, Tathagata Das 
  tathagata.das1...@gmail.com
   wrote:
 
   Very interesting ideas Andy!
  
   Conceptually i think it makes sense. In fact, it is true that dealing
  with
   time series data, windowing over application time, windowing over
 number
  of
   events, are things that DStream does not natively support. The real
   challenge is actually mapping the conceptual windows with the
 underlying
   RDD model. On aspect you 

Re: How to run specific sparkSQL test with maven

2014-08-01 Thread Michael Armbrust

 It seems that the HiveCompatibilitySuite need a hadoop and hive
 environment, am I right?

 Relative path in absolute URI:
 file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”


You should only need Hadoop and Hive if you are creating new tests that we
need to compute the answers for.  Existing tests are run with cached
answers.  There are details about the configuration here:
https://github.com/apache/spark/tree/master/sql


Interested in contributing to GraphX in Python

2014-08-01 Thread Rajiv Abraham
Hi,
I just saw Ankur's GraphX presentation and it looks very exciting! I would
like to contribute to a Python version of GraphX. I checked out JIRA and
Github but I did not find much info.

- Are there limitations currently to port GraphX in Python? (e.g. Maybe the
Python Spark RDD API is incomplete or not refactored for GraphX as compared
to the Scala version)
- If I had to start, could  I take inspiration from the Scala version and
try to emulate it in Python?
- Otherwise any suggestions of  starter tasks regarding GraphX in Python
would be appreciated



-- 
Take care,
Rajiv


My Spark application had huge performance refression after Spark git commit: 0441515f221146756800dc583b225bdec8a6c075

2014-08-01 Thread Jin, Zhonghui

I found huge performance regression ( 1/20 of original) of my application after 
Spark git commit: 0441515f221146756800dc583b225bdec8a6c075.

Apply the following patch, will fix my issue:

diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala 
b/core/src/main/scala/org/apache/spark/executor/Executor.scala
index 214a8c8..ebec21d 100644
--- a/core/src/main/scala/org/apache/spark/executor/Executor.scala
+++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala
@@ -145,7 +145,7 @@ private[spark] class Executor(
   }
 }
-override def run() {
+override def run() : Unit = SparkHadoopUtil.get.runAsSparkUser { () =
   val startTime = System.currentTimeMillis()
   SparkEnv.set(env)
   Thread.currentThread.setContextClassLoader(replClassLoader)

In the runAsSparkUser will call the 'UserGroupInformation.doAs()' to execute 
the task and my application running OK;
if not through it, the performance was very poor. Application hotspot was 
JNIHandleBlock::alloc_handle (JVM code, very high CPI (cycles per instruction, 
 1 is OK)  10)

My application passed large array data (80K length) to native C code through 
JNI.

Why the UserGroupInformation.doAs() great impacted the performance under this 
situation?


Thanks,
Zhonghui



Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-08-01 Thread Andrew Ash
After several days of debugging, we think the issue is that we have
conflicting versions of Guava.  Our application was running with Guava 14
and the Spark services (Master, Workers, Executors) had Guava 16.  We had
custom Kryo serializers for Guava's ImmutableLists, and commenting out
those register calls did the trick.

Have people had issues with Guava version mismatches in the past?

I've found @srowen's Guava 14 - 11 downgrade PR here
https://github.com/apache/spark/pull/1610 and some extended discussion on
https://issues.apache.org/jira/browse/SPARK-2420 for Hive compatibility


On Thu, Jul 31, 2014 at 10:47 AM, Andrew Ash and...@andrewash.com wrote:

 Hi everyone,

 I'm seeing the below exception coming out of Spark 1.0.1 when I call it
 from my application.  I can't share the source to that application, but the
 quick gist is that it uses Spark's Java APIs to read from Avro files in
 HDFS, do processing, and write back to Avro files.  It does this by
 receiving a REST call, then spinning up a new JVM as the driver application
 that connects to Spark.  I'm using CDH4.4.0 and have enabled Kryo and also
 speculation.  The cluster is running in standalone mode on a 6 node cluster
 in AWS (not using Spark's EC2 scripts though).

 The below stacktraces are reliably reproduceable on every run of the job.
  The issue seems to be that on deserialization of a task result on the
 driver, Kryo spits up while reading the ClassManifest.

 I've tried swapping in Kryo 2.23.1 rather than 2.21 (2.22 had some
 backcompat issues) but had the same error.

 Any ideas on what can be done here?

 Thanks!
 Andrew



 In the driver (Kryo exception while deserializing a DirectTaskResult):

 INFO   | jvm 1| 2014/07/30 20:52:52 | 20:52:52.667 [Result resolver
 thread-0] ERROR o.a.spark.scheduler.TaskResultGetter - Exception while
 getting task result
 INFO   | jvm 1| 2014/07/30 20:52:52 |
 com.esotericsoftware.kryo.KryoException: Buffer underflow.
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.io.Input.require(Input.java:156)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:762)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:624) ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:26)
 ~[chill_2.10-0.3.6.jar:0.3.6]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:19)
 ~[chill_2.10-0.3.6.jar:0.3.6]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:147)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [na:1.7.0_65]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [na:1.7.0_65]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]


 In the DAGScheduler 

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman
Thanks Patrick -- It does look like some maven misconfiguration as

wget
http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom

works for me.

Shivaram



On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote:

 This is a Scala bug - I filed something upstream, hopefully they can fix
 it soon and/or we can provide a work around:

 https://issues.scala-lang.org/browse/SI-8772

 - Patrick


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:

 Currently scala 2.10.2 can't be pulled in from maven central it seems,
 however if you have it in your ivy cache it should work.


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca
 wrote:

 Me 3


 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote:

 I also ran into same issue. What is the solution?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
 Cell : 425-233-8271




 --
 Cell : 425-233-8271





Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread Mayur Rustagi
Interesting, clickstream data would have its own window concept based on
session of User , I can imagine windows would change across streams but
wouldnt they large be domain specific in Nature?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Fri, Aug 1, 2014 at 9:48 AM, andy petrella andy.petre...@gmail.com
wrote:

 Heya,
 Dunno if these ideas are still in the air or felt in the warp ^^.
 However there is a paper on avocado
 
 http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf
 
 that
 mentions a way of working with their data (sequence's reads) in a windowed
 manner without neither time nor timestamp field's value, but a kind-of
 internal index as range delimiter -- thus defining their own exotic
 continuum and break function.

 greetz,

  aℕdy ℙetrella
 about.me/noootsab
 [image: aℕdy ℙetrella on about.me]

 http://about.me/noootsab


 On Thu, Jul 17, 2014 at 1:11 AM, andy petrella andy.petre...@gmail.com
 wrote:

  Indeed, these two cases are tightly coupled (the first one is a special
  case of the second).
 
  Actually, these outliers could be handled by a dedicated function what
 I
  named outliersManager -- I was not so much inspired ^^, but we could name
  these outliers, outlaws and thus the function would be sheriff.
  The purpose of this sheriff function would be to create yet another
  distributed collection (RDD, CRDD, ...?) with only the --outliers--
 outlaws
  in it.
 
  Because these problems have a nature which will be as different as the
 use
  case will be, it's hard to find a generic way to tackle them. So, you
  know... that's why... I put temporarily them in jail and wait for the
 judge
  to show them the right path! ( okay it's late in Belgium -- 1AM).
 
  All in all, it's more or less what we would do in DStream as well
 actually.
  Let me expand a bit this reasoning, let's assume that some data points
 can
  come along with the time, but aren't in sync with it -- f.i., a device
 that
  wakes up and send all it's data at once.
  The DStream will package them into RDDs mixed-up with true current data
  points, however, the logic of the job will have to use a 'Y' road :
  * to integrate them into a database at the right place
  * to simply drop them out because they're won't be part of a shown chart
  * etc
 
  In this case, the 'Y' road would be of the contract ;-), and so left at
  the appreciation of the dev.
 
  Another way, to do it would be to ignore but log them, but that would be
  very crappy, non professional and useful (and of course I'm just
 kidding).
 
  my0.002¢
 
 
 
   aℕdy ℙetrella
  about.me/noootsab
  [image: aℕdy ℙetrella on about.me]
 
  http://about.me/noootsab
 
 
  On Thu, Jul 17, 2014 at 12:31 AM, Tathagata Das 
  tathagata.das1...@gmail.com wrote:
 
  I think it makes sense, though without a concrete implementation its
 hard
  to be sure. Applying sorting on the RDD according to the RDDs makes
 sense,
  but I can think of two kinds of fundamental problems.
 
  1. How do you deal with ordering across RDD boundaries. Say two
  consecutive
  RDDs in the DStream has the following record timestampsRDD1: [ 1, 2,
  3,
  4, 6, 7 ]   RDD 2: [ 5, 8, 9, 10] . And you want to run a function
 through
  all these records in the timestamp order. I am curious to find how this
  problem can be solved without sacrificing efficiency (e.g. I can imagine
  doing multiple pass magic)
 
  2. An even more fundamental question is how do you ensure ordering with
  delayed records. If you want to process in order of application time,
 and
  records are delayed how do you deal with them.
 
  Any ideas? ;)
 
  TD
 
 
 
  On Wed, Jul 16, 2014 at 2:37 AM, andy petrella andy.petre...@gmail.com
 
  wrote:
 
   Heya TD,
  
   Thanks for the detailed answer! Much appreciated.
  
   Regarding order among elements within an RDD, you're definitively
 right,
   it'd kill the //ism and would require synchronization which is
  completely
   avoided in distributed env.
  
   That's why, I won't push this constraint to the RDDs themselves
  actually,
   only the Space is something that *defines* ordered elements, and thus
  there
   are two functions that will break the RDDs based on a given
 (extensible,
   plugable) heuristic f.i.
   Since the Space is rather decoupled from the data, thus the source and
  the
   partitions, it's the responsibility of the CRRD implementation to
  dictate
   how (if necessary) the elements should be sorted in the RDDs... which
  will
   require some shuffles :-s -- Or the couple (source, space) is
 something
   intrinsically ordered (like it is for DStream).
  
   To be more concrete an RDD would be composed of un-ordered iterator of
   millions of events for which all timestamps land into the same time
   interval.
  
   WDYT, would that makes sense?
  
   thanks again for the answer!
  
   greetz
  
aℕdy ℙetrella
   about.me/noootsab
   

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread andy petrella
Actually for click stream, the users space wouldn't be a continuum, unless
the order of users is important or the fact that they are coming in a kind
of order can be used by the algo.
The purpose of the break or binning function is to package things in a
cluster for which we know the properties, but we don't know in advance
which or how many elements it will contain.
However,  this would need to extend the notion of continuum I thought of,
to, indeed,  include categorical space and thus allowing groupBy mapping to
RDDs.
And actually,  there would be a way to fallback to a continuum if the
breaks function would be dictated by a trained model that can cluster the
users,  and they were previously and accordingly shuffled to form a
sequence where they come in batch.
Just thinking (and hardly trying to use a tablet to write it, man... How
unfriendly is this keyboard and small screen ☺)
Cheers
Andy


Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-08-01 Thread Colin McCabe
On Fri, Aug 1, 2014 at 2:45 PM, Andrew Ash and...@andrewash.com wrote:
 After several days of debugging, we think the issue is that we have
 conflicting versions of Guava.  Our application was running with Guava 14
 and the Spark services (Master, Workers, Executors) had Guava 16.  We had
 custom Kryo serializers for Guava's ImmutableLists, and commenting out
 those register calls did the trick.

 Have people had issues with Guava version mismatches in the past?

There's some discussion about dealing with Guava version issues in
Spark in SPARK-2420.

best,
Colin



 I've found @srowen's Guava 14 - 11 downgrade PR here
 https://github.com/apache/spark/pull/1610 and some extended discussion on
 https://issues.apache.org/jira/browse/SPARK-2420 for Hive compatibility


 On Thu, Jul 31, 2014 at 10:47 AM, Andrew Ash and...@andrewash.com wrote:

 Hi everyone,

 I'm seeing the below exception coming out of Spark 1.0.1 when I call it
 from my application.  I can't share the source to that application, but the
 quick gist is that it uses Spark's Java APIs to read from Avro files in
 HDFS, do processing, and write back to Avro files.  It does this by
 receiving a REST call, then spinning up a new JVM as the driver application
 that connects to Spark.  I'm using CDH4.4.0 and have enabled Kryo and also
 speculation.  The cluster is running in standalone mode on a 6 node cluster
 in AWS (not using Spark's EC2 scripts though).

 The below stacktraces are reliably reproduceable on every run of the job.
  The issue seems to be that on deserialization of a task result on the
 driver, Kryo spits up while reading the ClassManifest.

 I've tried swapping in Kryo 2.23.1 rather than 2.21 (2.22 had some
 backcompat issues) but had the same error.

 Any ideas on what can be done here?

 Thanks!
 Andrew



 In the driver (Kryo exception while deserializing a DirectTaskResult):

 INFO   | jvm 1| 2014/07/30 20:52:52 | 20:52:52.667 [Result resolver
 thread-0] ERROR o.a.spark.scheduler.TaskResultGetter - Exception while
 getting task result
 INFO   | jvm 1| 2014/07/30 20:52:52 |
 com.esotericsoftware.kryo.KryoException: Buffer underflow.
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.io.Input.require(Input.java:156)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:762)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:624) ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:26)
 ~[chill_2.10-0.3.6.jar:0.3.6]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:19)
 ~[chill_2.10-0.3.6.jar:0.3.6]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:147)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [na:1.7.0_65]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 

SparkContext.hadoopConfiguration vs. SparkHadoopUtil.newConfiguration()

2014-08-01 Thread Marcelo Vanzin
Hi all,

While working on some seemingly unrelated code, I ran into this issue
where spark.hadoop.* configs were not making it to the Configuration
objects in some parts of the code. I was trying to do that to avoid
having to do dirty ticks with the classpath while running tests, but
that's a little besides the point.

Since I don't know the history of that code in SparkContext, does
anybody see any issue with moving it up a layer so that all code that
uses SparkHadoopUtil.newConfiguration() does the same thing?

This would also include some code (e.g. in the yarn module) that does
new Configuration() directly instead of going through the wrapper.


-- 
Marcelo


Re: How to run specific sparkSQL test with maven

2014-08-01 Thread Cheng Lian
It’s also useful to set hive.exec.mode.local.auto to true to accelerate the
test.
​


On Sat, Aug 2, 2014 at 1:36 AM, Michael Armbrust mich...@databricks.com
wrote:

 
  It seems that the HiveCompatibilitySuite need a hadoop and hive
  environment, am I right?
 
  Relative path in absolute URI:
  file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”
 

 You should only need Hadoop and Hive if you are creating new tests that we
 need to compute the answers for.  Existing tests are run with cached
 answers.  There are details about the configuration here:
 https://github.com/apache/spark/tree/master/sql