date:20140801

How to run specific sparkSQL test with maven

2014-08-01 Thread 田毅

Hi everyone!

Could any one tell me how to run specific sparkSQL test with maven?

For example:

I want to test HiveCompatibilitySuite.

I ran “mvm test -Dtest=HiveCompatibilitySuite”

It did not work. 

BTW, is there any information about how to build a test environment of sparkSQL?

I got this error when i ran the test.

It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am 
I right?
 
Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”

Re:How to run specific sparkSQL test with maven

2014-08-01 Thread witgo

You can try these commands‍
./sbt/sbt assembly‍./sbt/sbt test-only *.HiveCompatibilitySuite -Phive‍

‍





-- Original --
From:  田毅;tia...@asiainfo.com;
Date:  Fri, Aug 1, 2014 05:00 PM
To:  devdev@spark.apache.org; 

Subject:  How to run specific sparkSQL test with maven



Hi everyone!

Could any one tell me how to run specific sparkSQL test with maven?

For example:

I want to test HiveCompatibilitySuite.

I ran “mvm test -Dtest=HiveCompatibilitySuite”

It did not work. 

BTW, is there any information about how to build a test environment of sparkSQL?

I got this error when i ran the test.

It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am 
I right?
 
Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”

Re: Re:How to run specific sparkSQL test with maven

2014-08-01 Thread Jeremy Freeman

With maven you can run a particular test suite like this:

mvn -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite test

see the note here (under Spark Tests in Maven):

http://spark.apache.org/docs/latest/building-with-maven.html



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-specific-sparkSQL-test-with-maven-tp7624p7626.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread andy petrella

Heya,
Dunno if these ideas are still in the air or felt in the warp ^^.
However there is a paper on avocado
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf
that
mentions a way of working with their data (sequence's reads) in a windowed
manner without neither time nor timestamp field's value, but a kind-of
internal index as range delimiter -- thus defining their own exotic
continuum and break function.

greetz,

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Thu, Jul 17, 2014 at 1:11 AM, andy petrella andy.petre...@gmail.com
wrote:

Indeed, these two cases are tightly coupled (the first one is a special
case of the second).

Actually, these outliers could be handled by a dedicated function what I
named outliersManager -- I was not so much inspired ^^, but we could name
these outliers, outlaws and thus the function would be sheriff.
The purpose of this sheriff function would be to create yet another
distributed collection (RDD, CRDD, ...?) with only the --outliers-- outlaws
in it.

Because these problems have a nature which will be as different as the use
case will be, it's hard to find a generic way to tackle them. So, you
know... that's why... I put temporarily them in jail and wait for the judge
to show them the right path! ( okay it's late in Belgium -- 1AM).

All in all, it's more or less what we would do in DStream as well actually.
Let me expand a bit this reasoning, let's assume that some data points can
come along with the time, but aren't in sync with it -- f.i., a device that
wakes up and send all it's data at once.
The DStream will package them into RDDs mixed-up with true current data
points, however, the logic of the job will have to use a 'Y' road :
* to integrate them into a database at the right place
* to simply drop them out because they're won't be part of a shown chart
* etc

In this case, the 'Y' road would be of the contract ;-), and so left at
the appreciation of the dev.

Another way, to do it would be to ignore but log them, but that would be
very crappy, non professional and useful (and of course I'm just kidding).

my0.002¢

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Thu, Jul 17, 2014 at 12:31 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:

I think it makes sense, though without a concrete implementation its hard
to be sure. Applying sorting on the RDD according to the RDDs makes sense,
but I can think of two kinds of fundamental problems.

1. How do you deal with ordering across RDD boundaries. Say two
consecutive
RDDs in the DStream has the following record timestampsRDD1: [ 1, 2,
3,
4, 6, 7 ] RDD 2: [ 5, 8, 9, 10] . And you want to run a function through
all these records in the timestamp order. I am curious to find how this
problem can be solved without sacrificing efficiency (e.g. I can imagine
doing multiple pass magic)

2. An even more fundamental question is how do you ensure ordering with
delayed records. If you want to process in order of application time, and
records are delayed how do you deal with them.

Any ideas? ;)

On Wed, Jul 16, 2014 at 2:37 AM, andy petrella andy.petre...@gmail.com
wrote:

Heya TD,

Thanks for the detailed answer! Much appreciated.

Regarding order among elements within an RDD, you're definitively right,
it'd kill the //ism and would require synchronization which is
completely
avoided in distributed env.

That's why, I won't push this constraint to the RDDs themselves
actually,
only the Space is something that *defines* ordered elements, and thus
there
are two functions that will break the RDDs based on a given (extensible,
plugable) heuristic f.i.
Since the Space is rather decoupled from the data, thus the source and
the
partitions, it's the responsibility of the CRRD implementation to
dictate
how (if necessary) the elements should be sorted in the RDDs... which
will
require some shuffles :-s -- Or the couple (source, space) is something
intrinsically ordered (like it is for DStream).

To be more concrete an RDD would be composed of un-ordered iterator of
millions of events for which all timestamps land into the same time
interval.

WDYT, would that makes sense?

thanks again for the answer!

greetz

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Wed, Jul 16, 2014 at 12:33 AM, Tathagata Das
tathagata.das1...@gmail.com
wrote:

Very interesting ideas Andy!

Conceptually i think it makes sense. In fact, it is true that dealing
with
time series data, windowing over application time, windowing over
number
of
events, are things that DStream does not natively support. The real
challenge is actually mapping the conceptual windows with the
underlying
RDD model. On aspect you

Re: How to run specific sparkSQL test with maven

2014-08-01 Thread Michael Armbrust


 It seems that the HiveCompatibilitySuite need a hadoop and hive
 environment, am I right?

 Relative path in absolute URI:
 file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”


You should only need Hadoop and Hive if you are creating new tests that we
need to compute the answers for.  Existing tests are run with cached
answers.  There are details about the configuration here:
https://github.com/apache/spark/tree/master/sql

Interested in contributing to GraphX in Python

2014-08-01 Thread Rajiv Abraham

Hi,
I just saw Ankur's GraphX presentation and it looks very exciting! I would
like to contribute to a Python version of GraphX. I checked out JIRA and
Github but I did not find much info.

- Are there limitations currently to port GraphX in Python? (e.g. Maybe the
Python Spark RDD API is incomplete or not refactored for GraphX as compared
to the Scala version)
- If I had to start, could  I take inspiration from the Scala version and
try to emulate it in Python?
- Otherwise any suggestions of  starter tasks regarding GraphX in Python
would be appreciated



-- 
Take care,
Rajiv

My Spark application had huge performance refression after Spark git commit: 0441515f221146756800dc583b225bdec8a6c075

2014-08-01 Thread Jin, Zhonghui


I found huge performance regression ( 1/20 of original) of my application after 
Spark git commit: 0441515f221146756800dc583b225bdec8a6c075.

Apply the following patch, will fix my issue:

diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala 
b/core/src/main/scala/org/apache/spark/executor/Executor.scala
index 214a8c8..ebec21d 100644
--- a/core/src/main/scala/org/apache/spark/executor/Executor.scala
+++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala
@@ -145,7 +145,7 @@ private[spark] class Executor(
   }
 }
-override def run() {
+override def run() : Unit = SparkHadoopUtil.get.runAsSparkUser { () =
   val startTime = System.currentTimeMillis()
   SparkEnv.set(env)
   Thread.currentThread.setContextClassLoader(replClassLoader)

In the runAsSparkUser will call the 'UserGroupInformation.doAs()' to execute 
the task and my application running OK;
if not through it, the performance was very poor. Application hotspot was 
JNIHandleBlock::alloc_handle (JVM code, very high CPI (cycles per instruction, 
 1 is OK)  10)

My application passed large array data (80K length) to native C code through 
JNI.

Why the UserGroupInformation.doAs() great impacted the performance under this 
situation?


Thanks,
Zhonghui

Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-08-01 Thread Andrew Ash

After several days of debugging, we think the issue is that we have
conflicting versions of Guava.  Our application was running with Guava 14
and the Spark services (Master, Workers, Executors) had Guava 16.  We had
custom Kryo serializers for Guava's ImmutableLists, and commenting out
those register calls did the trick.

Have people had issues with Guava version mismatches in the past?

I've found @srowen's Guava 14 - 11 downgrade PR here
https://github.com/apache/spark/pull/1610 and some extended discussion on
https://issues.apache.org/jira/browse/SPARK-2420 for Hive compatibility


On Thu, Jul 31, 2014 at 10:47 AM, Andrew Ash and...@andrewash.com wrote:

 Hi everyone,

 I'm seeing the below exception coming out of Spark 1.0.1 when I call it
 from my application.  I can't share the source to that application, but the
 quick gist is that it uses Spark's Java APIs to read from Avro files in
 HDFS, do processing, and write back to Avro files.  It does this by
 receiving a REST call, then spinning up a new JVM as the driver application
 that connects to Spark.  I'm using CDH4.4.0 and have enabled Kryo and also
 speculation.  The cluster is running in standalone mode on a 6 node cluster
 in AWS (not using Spark's EC2 scripts though).

 The below stacktraces are reliably reproduceable on every run of the job.
  The issue seems to be that on deserialization of a task result on the
 driver, Kryo spits up while reading the ClassManifest.

 I've tried swapping in Kryo 2.23.1 rather than 2.21 (2.22 had some
 backcompat issues) but had the same error.

 Any ideas on what can be done here?

 Thanks!
 Andrew



 In the driver (Kryo exception while deserializing a DirectTaskResult):

 INFO   | jvm 1| 2014/07/30 20:52:52 | 20:52:52.667 [Result resolver
 thread-0] ERROR o.a.spark.scheduler.TaskResultGetter - Exception while
 getting task result
 INFO   | jvm 1| 2014/07/30 20:52:52 |
 com.esotericsoftware.kryo.KryoException: Buffer underflow.
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.io.Input.require(Input.java:156)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:762)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:624) ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:26)
 ~[chill_2.10-0.3.6.jar:0.3.6]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:19)
 ~[chill_2.10-0.3.6.jar:0.3.6]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:147)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [na:1.7.0_65]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [na:1.7.0_65]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]


 In the DAGScheduler

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-01 Thread Shivaram Venkataraman

Thanks Patrick -- It does look like some maven misconfiguration as

wget
http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom

works for me.

Shivaram



On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote:

 This is a Scala bug - I filed something upstream, hopefully they can fix
 it soon and/or we can provide a work around:

 https://issues.scala-lang.org/browse/SI-8772

 - Patrick


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:

 Currently scala 2.10.2 can't be pulled in from maven central it seems,
 however if you have it in your ivy cache it should work.


 On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca
 wrote:

 Me 3


 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote:

 I also ran into same issue. What is the solution?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.




 --
 Cell : 425-233-8271




 --
 Cell : 425-233-8271

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread Mayur Rustagi

Interesting, clickstream data would have its own window concept based on
session of User , I can imagine windows would change across streams but
wouldnt they large be domain specific in Nature?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi

On Fri, Aug 1, 2014 at 9:48 AM, andy petrella andy.petre...@gmail.com
wrote:

Heya,
Dunno if these ideas are still in the air or felt in the warp ^^.
However there is a paper on avocado

http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf

that
mentions a way of working with their data (sequence's reads) in a windowed
manner without neither time nor timestamp field's value, but a kind-of
internal index as range delimiter -- thus defining their own exotic
continuum and break function.

greetz,

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Thu, Jul 17, 2014 at 1:11 AM, andy petrella andy.petre...@gmail.com
wrote:

Indeed, these two cases are tightly coupled (the first one is a special
case of the second).

Actually, these outliers could be handled by a dedicated function what
I
named outliersManager -- I was not so much inspired ^^, but we could name
these outliers, outlaws and thus the function would be sheriff.
The purpose of this sheriff function would be to create yet another
distributed collection (RDD, CRDD, ...?) with only the --outliers--
outlaws
in it.

Because these problems have a nature which will be as different as the
use
case will be, it's hard to find a generic way to tackle them. So, you
know... that's why... I put temporarily them in jail and wait for the
judge
to show them the right path! ( okay it's late in Belgium -- 1AM).

All in all, it's more or less what we would do in DStream as well
actually.
Let me expand a bit this reasoning, let's assume that some data points
can
come along with the time, but aren't in sync with it -- f.i., a device
that
wakes up and send all it's data at once.
The DStream will package them into RDDs mixed-up with true current data
points, however, the logic of the job will have to use a 'Y' road :
* to integrate them into a database at the right place
* to simply drop them out because they're won't be part of a shown chart
* etc

In this case, the 'Y' road would be of the contract ;-), and so left at
the appreciation of the dev.

Another way, to do it would be to ignore but log them, but that would be
very crappy, non professional and useful (and of course I'm just
kidding).

my0.002¢

aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]

http://about.me/noootsab

On Thu, Jul 17, 2014 at 12:31 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:

I think it makes sense, though without a concrete implementation its
hard
to be sure. Applying sorting on the RDD according to the RDDs makes
sense,
but I can think of two kinds of fundamental problems.

1. How do you deal with ordering across RDD boundaries. Say two
consecutive
RDDs in the DStream has the following record timestampsRDD1: [ 1, 2,
3,
4, 6, 7 ] RDD 2: [ 5, 8, 9, 10] . And you want to run a function
through
all these records in the timestamp order. I am curious to find how this
problem can be solved without sacrificing efficiency (e.g. I can imagine
doing multiple pass magic)

2. An even more fundamental question is how do you ensure ordering with
delayed records. If you want to process in order of application time,
and
records are delayed how do you deal with them.

Any ideas? ;)

On Wed, Jul 16, 2014 at 2:37 AM, andy petrella andy.petre...@gmail.com

wrote:

Heya TD,

Thanks for the detailed answer! Much appreciated.

Regarding order among elements within an RDD, you're definitively
right,
it'd kill the //ism and would require synchronization which is
completely
avoided in distributed env.

That's why, I won't push this constraint to the RDDs themselves
actually,
only the Space is something that *defines* ordered elements, and thus
there
are two functions that will break the RDDs based on a given
(extensible,
plugable) heuristic f.i.
Since the Space is rather decoupled from the data, thus the source and
the
partitions, it's the responsibility of the CRRD implementation to
dictate
how (if necessary) the elements should be sorted in the RDDs... which
will
require some shuffles :-s -- Or the couple (source, space) is
something
intrinsically ordered (like it is for DStream).

To be more concrete an RDD would be composed of un-ordered iterator of
millions of events for which all timestamps land into the same time
interval.

WDYT, would that makes sense?

thanks again for the answer!

greetz

aℕdy ℙetrella
about.me/noootsab

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread andy petrella

Actually for click stream, the users space wouldn't be a continuum, unless
the order of users is important or the fact that they are coming in a kind
of order can be used by the algo.
The purpose of the break or binning function is to package things in a
cluster for which we know the properties, but we don't know in advance
which or how many elements it will contain.
However,  this would need to extend the notion of continuum I thought of,
to, indeed,  include categorical space and thus allowing groupBy mapping to
RDDs.
And actually,  there would be a way to fallback to a continuum if the
breaks function would be dictated by a trained model that can cluster the
users,  and they were previously and accordingly shuffled to form a
sequence where they come in batch.
Just thinking (and hardly trying to use a tablet to write it, man... How
unfriendly is this keyboard and small screen ☺)
Cheers
Andy

Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

2014-08-01 Thread Colin McCabe

On Fri, Aug 1, 2014 at 2:45 PM, Andrew Ash and...@andrewash.com wrote:
 After several days of debugging, we think the issue is that we have
 conflicting versions of Guava.  Our application was running with Guava 14
 and the Spark services (Master, Workers, Executors) had Guava 16.  We had
 custom Kryo serializers for Guava's ImmutableLists, and commenting out
 those register calls did the trick.

 Have people had issues with Guava version mismatches in the past?

There's some discussion about dealing with Guava version issues in
Spark in SPARK-2420.

best,
Colin



 I've found @srowen's Guava 14 - 11 downgrade PR here
 https://github.com/apache/spark/pull/1610 and some extended discussion on
 https://issues.apache.org/jira/browse/SPARK-2420 for Hive compatibility


 On Thu, Jul 31, 2014 at 10:47 AM, Andrew Ash and...@andrewash.com wrote:

 Hi everyone,

 I'm seeing the below exception coming out of Spark 1.0.1 when I call it
 from my application.  I can't share the source to that application, but the
 quick gist is that it uses Spark's Java APIs to read from Avro files in
 HDFS, do processing, and write back to Avro files.  It does this by
 receiving a REST call, then spinning up a new JVM as the driver application
 that connects to Spark.  I'm using CDH4.4.0 and have enabled Kryo and also
 speculation.  The cluster is running in standalone mode on a 6 node cluster
 in AWS (not using Spark's EC2 scripts though).

 The below stacktraces are reliably reproduceable on every run of the job.
  The issue seems to be that on deserialization of a task result on the
 driver, Kryo spits up while reading the ClassManifest.

 I've tried swapping in Kryo 2.23.1 rather than 2.21 (2.22 had some
 backcompat issues) but had the same error.

 Any ideas on what can be done here?

 Thanks!
 Andrew



 In the driver (Kryo exception while deserializing a DirectTaskResult):

 INFO   | jvm 1| 2014/07/30 20:52:52 | 20:52:52.667 [Result resolver
 thread-0] ERROR o.a.spark.scheduler.TaskResultGetter - Exception while
 getting task result
 INFO   | jvm 1| 2014/07/30 20:52:52 |
 com.esotericsoftware.kryo.KryoException: Buffer underflow.
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.io.Input.require(Input.java:156)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:762)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:624) ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:26)
 ~[chill_2.10-0.3.6.jar:0.3.6]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:19)
 ~[chill_2.10-0.3.6.jar:0.3.6]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
 ~[kryo-2.21.jar:na]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:147)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316)
 ~[spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
 [spark-core_2.10-1.0.1.jar:1.0.1]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [na:1.7.0_65]
 INFO   | jvm 1| 2014/07/30 20:52:52 |   at

SparkContext.hadoopConfiguration vs. SparkHadoopUtil.newConfiguration()

2014-08-01 Thread Marcelo Vanzin

Hi all,

While working on some seemingly unrelated code, I ran into this issue
where spark.hadoop.* configs were not making it to the Configuration
objects in some parts of the code. I was trying to do that to avoid
having to do dirty ticks with the classpath while running tests, but
that's a little besides the point.

Since I don't know the history of that code in SparkContext, does
anybody see any issue with moving it up a layer so that all code that
uses SparkHadoopUtil.newConfiguration() does the same thing?

This would also include some code (e.g. in the yarn module) that does
new Configuration() directly instead of going through the wrapper.


-- 
Marcelo

Re: How to run specific sparkSQL test with maven

2014-08-01 Thread Cheng Lian

It’s also useful to set hive.exec.mode.local.auto to true to accelerate the
test.



On Sat, Aug 2, 2014 at 1:36 AM, Michael Armbrust mich...@databricks.com
wrote:

 
  It seems that the HiveCompatibilitySuite need a hadoop and hive
  environment, am I right?
 
  Relative path in absolute URI:
  file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”
 

 You should only need Hadoop and Hive if you are creating new tests that we
 need to compute the answers for.  Existing tests are run with cached
 answers.  There are details about the configuration here:
 https://github.com/apache/spark/tree/master/sql

How to run specific sparkSQL test with maven

Re:How to run specific sparkSQL test with maven

Re: Re:How to run specific sparkSQL test with maven

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

Re: How to run specific sparkSQL test with maven

Interested in contributing to GraphX in Python

My Spark application had huge performance refression after Spark git commit: 0441515f221146756800dc583b225bdec8a6c075

Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow

SparkContext.hadoopConfiguration vs. SparkHadoopUtil.newConfiguration()

Re: How to run specific sparkSQL test with maven

14 matches

Site Navigation

Mail list logo

Footer information