How to run specific sparkSQL test with maven
Hi everyone! Could any one tell me how to run specific sparkSQL test with maven? For example: I want to test HiveCompatibilitySuite. I ran “mvm test -Dtest=HiveCompatibilitySuite” It did not work. BTW, is there any information about how to build a test environment of sparkSQL? I got this error when i ran the test. It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am I right? Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”
Re:How to run specific sparkSQL test with maven
You can try these commands ./sbt/sbt assembly./sbt/sbt test-only *.HiveCompatibilitySuite -Phive -- Original -- From: 田毅;tia...@asiainfo.com; Date: Fri, Aug 1, 2014 05:00 PM To: devdev@spark.apache.org; Subject: How to run specific sparkSQL test with maven Hi everyone! Could any one tell me how to run specific sparkSQL test with maven? For example: I want to test HiveCompatibilitySuite. I ran “mvm test -Dtest=HiveCompatibilitySuite” It did not work. BTW, is there any information about how to build a test environment of sparkSQL? I got this error when i ran the test. It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am I right? Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1”
Re: Re:How to run specific sparkSQL test with maven
With maven you can run a particular test suite like this: mvn -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite test see the note here (under Spark Tests in Maven): http://spark.apache.org/docs/latest/building-with-maven.html -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/How-to-run-specific-sparkSQL-test-with-maven-tp7624p7626.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?
Heya, Dunno if these ideas are still in the air or felt in the warp ^^. However there is a paper on avocado http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf that mentions a way of working with their data (sequence's reads) in a windowed manner without neither time nor timestamp field's value, but a kind-of internal index as range delimiter -- thus defining their own exotic continuum and break function. greetz, aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Thu, Jul 17, 2014 at 1:11 AM, andy petrella andy.petre...@gmail.com wrote: Indeed, these two cases are tightly coupled (the first one is a special case of the second). Actually, these outliers could be handled by a dedicated function what I named outliersManager -- I was not so much inspired ^^, but we could name these outliers, outlaws and thus the function would be sheriff. The purpose of this sheriff function would be to create yet another distributed collection (RDD, CRDD, ...?) with only the --outliers-- outlaws in it. Because these problems have a nature which will be as different as the use case will be, it's hard to find a generic way to tackle them. So, you know... that's why... I put temporarily them in jail and wait for the judge to show them the right path! ( okay it's late in Belgium -- 1AM). All in all, it's more or less what we would do in DStream as well actually. Let me expand a bit this reasoning, let's assume that some data points can come along with the time, but aren't in sync with it -- f.i., a device that wakes up and send all it's data at once. The DStream will package them into RDDs mixed-up with true current data points, however, the logic of the job will have to use a 'Y' road : * to integrate them into a database at the right place * to simply drop them out because they're won't be part of a shown chart * etc In this case, the 'Y' road would be of the contract ;-), and so left at the appreciation of the dev. Another way, to do it would be to ignore but log them, but that would be very crappy, non professional and useful (and of course I'm just kidding). my0.002¢ aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Thu, Jul 17, 2014 at 12:31 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I think it makes sense, though without a concrete implementation its hard to be sure. Applying sorting on the RDD according to the RDDs makes sense, but I can think of two kinds of fundamental problems. 1. How do you deal with ordering across RDD boundaries. Say two consecutive RDDs in the DStream has the following record timestampsRDD1: [ 1, 2, 3, 4, 6, 7 ] RDD 2: [ 5, 8, 9, 10] . And you want to run a function through all these records in the timestamp order. I am curious to find how this problem can be solved without sacrificing efficiency (e.g. I can imagine doing multiple pass magic) 2. An even more fundamental question is how do you ensure ordering with delayed records. If you want to process in order of application time, and records are delayed how do you deal with them. Any ideas? ;) TD On Wed, Jul 16, 2014 at 2:37 AM, andy petrella andy.petre...@gmail.com wrote: Heya TD, Thanks for the detailed answer! Much appreciated. Regarding order among elements within an RDD, you're definitively right, it'd kill the //ism and would require synchronization which is completely avoided in distributed env. That's why, I won't push this constraint to the RDDs themselves actually, only the Space is something that *defines* ordered elements, and thus there are two functions that will break the RDDs based on a given (extensible, plugable) heuristic f.i. Since the Space is rather decoupled from the data, thus the source and the partitions, it's the responsibility of the CRRD implementation to dictate how (if necessary) the elements should be sorted in the RDDs... which will require some shuffles :-s -- Or the couple (source, space) is something intrinsically ordered (like it is for DStream). To be more concrete an RDD would be composed of un-ordered iterator of millions of events for which all timestamps land into the same time interval. WDYT, would that makes sense? thanks again for the answer! greetz aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Wed, Jul 16, 2014 at 12:33 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Very interesting ideas Andy! Conceptually i think it makes sense. In fact, it is true that dealing with time series data, windowing over application time, windowing over number of events, are things that DStream does not natively support. The real challenge is actually mapping the conceptual windows with the underlying RDD model. On aspect you
Re: How to run specific sparkSQL test with maven
It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am I right? Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1” You should only need Hadoop and Hive if you are creating new tests that we need to compute the answers for. Existing tests are run with cached answers. There are details about the configuration here: https://github.com/apache/spark/tree/master/sql
Interested in contributing to GraphX in Python
Hi, I just saw Ankur's GraphX presentation and it looks very exciting! I would like to contribute to a Python version of GraphX. I checked out JIRA and Github but I did not find much info. - Are there limitations currently to port GraphX in Python? (e.g. Maybe the Python Spark RDD API is incomplete or not refactored for GraphX as compared to the Scala version) - If I had to start, could I take inspiration from the Scala version and try to emulate it in Python? - Otherwise any suggestions of starter tasks regarding GraphX in Python would be appreciated -- Take care, Rajiv
My Spark application had huge performance refression after Spark git commit: 0441515f221146756800dc583b225bdec8a6c075
I found huge performance regression ( 1/20 of original) of my application after Spark git commit: 0441515f221146756800dc583b225bdec8a6c075. Apply the following patch, will fix my issue: diff --git a/core/src/main/scala/org/apache/spark/executor/Executor.scala b/core/src/main/scala/org/apache/spark/executor/Executor.scala index 214a8c8..ebec21d 100644 --- a/core/src/main/scala/org/apache/spark/executor/Executor.scala +++ b/core/src/main/scala/org/apache/spark/executor/Executor.scala @@ -145,7 +145,7 @@ private[spark] class Executor( } } -override def run() { +override def run() : Unit = SparkHadoopUtil.get.runAsSparkUser { () = val startTime = System.currentTimeMillis() SparkEnv.set(env) Thread.currentThread.setContextClassLoader(replClassLoader) In the runAsSparkUser will call the 'UserGroupInformation.doAs()' to execute the task and my application running OK; if not through it, the performance was very poor. Application hotspot was JNIHandleBlock::alloc_handle (JVM code, very high CPI (cycles per instruction, 1 is OK) 10) My application passed large array data (80K length) to native C code through JNI. Why the UserGroupInformation.doAs() great impacted the performance under this situation? Thanks, Zhonghui
Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow
After several days of debugging, we think the issue is that we have conflicting versions of Guava. Our application was running with Guava 14 and the Spark services (Master, Workers, Executors) had Guava 16. We had custom Kryo serializers for Guava's ImmutableLists, and commenting out those register calls did the trick. Have people had issues with Guava version mismatches in the past? I've found @srowen's Guava 14 - 11 downgrade PR here https://github.com/apache/spark/pull/1610 and some extended discussion on https://issues.apache.org/jira/browse/SPARK-2420 for Hive compatibility On Thu, Jul 31, 2014 at 10:47 AM, Andrew Ash and...@andrewash.com wrote: Hi everyone, I'm seeing the below exception coming out of Spark 1.0.1 when I call it from my application. I can't share the source to that application, but the quick gist is that it uses Spark's Java APIs to read from Avro files in HDFS, do processing, and write back to Avro files. It does this by receiving a REST call, then spinning up a new JVM as the driver application that connects to Spark. I'm using CDH4.4.0 and have enabled Kryo and also speculation. The cluster is running in standalone mode on a 6 node cluster in AWS (not using Spark's EC2 scripts though). The below stacktraces are reliably reproduceable on every run of the job. The issue seems to be that on deserialization of a task result on the driver, Kryo spits up while reading the ClassManifest. I've tried swapping in Kryo 2.23.1 rather than 2.21 (2.22 had some backcompat issues) but had the same error. Any ideas on what can be done here? Thanks! Andrew In the driver (Kryo exception while deserializing a DirectTaskResult): INFO | jvm 1| 2014/07/30 20:52:52 | 20:52:52.667 [Result resolver thread-0] ERROR o.a.spark.scheduler.TaskResultGetter - Exception while getting task result INFO | jvm 1| 2014/07/30 20:52:52 | com.esotericsoftware.kryo.KryoException: Buffer underflow. INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.io.Input.require(Input.java:156) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:762) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:624) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:26) ~[chill_2.10-0.3.6.jar:0.3.6] INFO | jvm 1| 2014/07/30 20:52:52 | at com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:19) ~[chill_2.10-0.3.6.jar:0.3.6] INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:147) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65] INFO | jvm 1| 2014/07/30 20:52:52 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65] INFO | jvm 1| 2014/07/30 20:52:52 | at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] In the DAGScheduler
Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2
Thanks Patrick -- It does look like some maven misconfiguration as wget http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom works for me. Shivaram On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote: This is a Scala bug - I filed something upstream, hopefully they can fix it soon and/or we can provide a work around: https://issues.scala-lang.org/browse/SI-8772 - Patrick On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Currently scala 2.10.2 can't be pulled in from maven central it seems, however if you have it in your ivy cache it should work. On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote: Me 3 On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote: I also ran into same issue. What is the solution? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Cell : 425-233-8271 -- Cell : 425-233-8271
Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?
Interesting, clickstream data would have its own window concept based on session of User , I can imagine windows would change across streams but wouldnt they large be domain specific in Nature? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Aug 1, 2014 at 9:48 AM, andy petrella andy.petre...@gmail.com wrote: Heya, Dunno if these ideas are still in the air or felt in the warp ^^. However there is a paper on avocado http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf that mentions a way of working with their data (sequence's reads) in a windowed manner without neither time nor timestamp field's value, but a kind-of internal index as range delimiter -- thus defining their own exotic continuum and break function. greetz, aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Thu, Jul 17, 2014 at 1:11 AM, andy petrella andy.petre...@gmail.com wrote: Indeed, these two cases are tightly coupled (the first one is a special case of the second). Actually, these outliers could be handled by a dedicated function what I named outliersManager -- I was not so much inspired ^^, but we could name these outliers, outlaws and thus the function would be sheriff. The purpose of this sheriff function would be to create yet another distributed collection (RDD, CRDD, ...?) with only the --outliers-- outlaws in it. Because these problems have a nature which will be as different as the use case will be, it's hard to find a generic way to tackle them. So, you know... that's why... I put temporarily them in jail and wait for the judge to show them the right path! ( okay it's late in Belgium -- 1AM). All in all, it's more or less what we would do in DStream as well actually. Let me expand a bit this reasoning, let's assume that some data points can come along with the time, but aren't in sync with it -- f.i., a device that wakes up and send all it's data at once. The DStream will package them into RDDs mixed-up with true current data points, however, the logic of the job will have to use a 'Y' road : * to integrate them into a database at the right place * to simply drop them out because they're won't be part of a shown chart * etc In this case, the 'Y' road would be of the contract ;-), and so left at the appreciation of the dev. Another way, to do it would be to ignore but log them, but that would be very crappy, non professional and useful (and of course I'm just kidding). my0.002¢ aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Thu, Jul 17, 2014 at 12:31 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I think it makes sense, though without a concrete implementation its hard to be sure. Applying sorting on the RDD according to the RDDs makes sense, but I can think of two kinds of fundamental problems. 1. How do you deal with ordering across RDD boundaries. Say two consecutive RDDs in the DStream has the following record timestampsRDD1: [ 1, 2, 3, 4, 6, 7 ] RDD 2: [ 5, 8, 9, 10] . And you want to run a function through all these records in the timestamp order. I am curious to find how this problem can be solved without sacrificing efficiency (e.g. I can imagine doing multiple pass magic) 2. An even more fundamental question is how do you ensure ordering with delayed records. If you want to process in order of application time, and records are delayed how do you deal with them. Any ideas? ;) TD On Wed, Jul 16, 2014 at 2:37 AM, andy petrella andy.petre...@gmail.com wrote: Heya TD, Thanks for the detailed answer! Much appreciated. Regarding order among elements within an RDD, you're definitively right, it'd kill the //ism and would require synchronization which is completely avoided in distributed env. That's why, I won't push this constraint to the RDDs themselves actually, only the Space is something that *defines* ordered elements, and thus there are two functions that will break the RDDs based on a given (extensible, plugable) heuristic f.i. Since the Space is rather decoupled from the data, thus the source and the partitions, it's the responsibility of the CRRD implementation to dictate how (if necessary) the elements should be sorted in the RDDs... which will require some shuffles :-s -- Or the couple (source, space) is something intrinsically ordered (like it is for DStream). To be more concrete an RDD would be composed of un-ordered iterator of millions of events for which all timestamps land into the same time interval. WDYT, would that makes sense? thanks again for the answer! greetz aℕdy ℙetrella about.me/noootsab
Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?
Actually for click stream, the users space wouldn't be a continuum, unless the order of users is important or the fact that they are coming in a kind of order can be used by the algo. The purpose of the break or binning function is to package things in a cluster for which we know the properties, but we don't know in advance which or how many elements it will contain. However, this would need to extend the notion of continuum I thought of, to, indeed, include categorical space and thus allowing groupBy mapping to RDDs. And actually, there would be a way to fallback to a continuum if the breaks function would be dictated by a trained model that can cluster the users, and they were previously and accordingly shuffled to form a sequence where they come in batch. Just thinking (and hardly trying to use a tablet to write it, man... How unfriendly is this keyboard and small screen ☺) Cheers Andy
Re: Exception in Spark 1.0.1: com.esotericsoftware.kryo.KryoException: Buffer underflow
On Fri, Aug 1, 2014 at 2:45 PM, Andrew Ash and...@andrewash.com wrote: After several days of debugging, we think the issue is that we have conflicting versions of Guava. Our application was running with Guava 14 and the Spark services (Master, Workers, Executors) had Guava 16. We had custom Kryo serializers for Guava's ImmutableLists, and commenting out those register calls did the trick. Have people had issues with Guava version mismatches in the past? There's some discussion about dealing with Guava version issues in Spark in SPARK-2420. best, Colin I've found @srowen's Guava 14 - 11 downgrade PR here https://github.com/apache/spark/pull/1610 and some extended discussion on https://issues.apache.org/jira/browse/SPARK-2420 for Hive compatibility On Thu, Jul 31, 2014 at 10:47 AM, Andrew Ash and...@andrewash.com wrote: Hi everyone, I'm seeing the below exception coming out of Spark 1.0.1 when I call it from my application. I can't share the source to that application, but the quick gist is that it uses Spark's Java APIs to read from Avro files in HDFS, do processing, and write back to Avro files. It does this by receiving a REST call, then spinning up a new JVM as the driver application that connects to Spark. I'm using CDH4.4.0 and have enabled Kryo and also speculation. The cluster is running in standalone mode on a 6 node cluster in AWS (not using Spark's EC2 scripts though). The below stacktraces are reliably reproduceable on every run of the job. The issue seems to be that on deserialization of a task result on the driver, Kryo spits up while reading the ClassManifest. I've tried swapping in Kryo 2.23.1 rather than 2.21 (2.22 had some backcompat issues) but had the same error. Any ideas on what can be done here? Thanks! Andrew In the driver (Kryo exception while deserializing a DirectTaskResult): INFO | jvm 1| 2014/07/30 20:52:52 | 20:52:52.667 [Result resolver thread-0] ERROR o.a.spark.scheduler.TaskResultGetter - Exception while getting task result INFO | jvm 1| 2014/07/30 20:52:52 | com.esotericsoftware.kryo.KryoException: Buffer underflow. INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.io.Input.require(Input.java:156) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:762) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:624) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:26) ~[chill_2.10-0.3.6.jar:0.3.6] INFO | jvm 1| 2014/07/30 20:52:52 | at com.twitter.chill.ClassManifestSerializer.read(ClassManifestSerializer.scala:19) ~[chill_2.10-0.3.6.jar:0.3.6] INFO | jvm 1| 2014/07/30 20:52:52 | at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) ~[kryo-2.21.jar:na] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:147) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:480) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:316) ~[spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) [spark-core_2.10-1.0.1.jar:1.0.1] INFO | jvm 1| 2014/07/30 20:52:52 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65] INFO | jvm 1| 2014/07/30 20:52:52 | at
SparkContext.hadoopConfiguration vs. SparkHadoopUtil.newConfiguration()
Hi all, While working on some seemingly unrelated code, I ran into this issue where spark.hadoop.* configs were not making it to the Configuration objects in some parts of the code. I was trying to do that to avoid having to do dirty ticks with the classpath while running tests, but that's a little besides the point. Since I don't know the history of that code in SparkContext, does anybody see any issue with moving it up a layer so that all code that uses SparkHadoopUtil.newConfiguration() does the same thing? This would also include some code (e.g. in the yarn module) that does new Configuration() directly instead of going through the wrapper. -- Marcelo
Re: How to run specific sparkSQL test with maven
It’s also useful to set hive.exec.mode.local.auto to true to accelerate the test. On Sat, Aug 2, 2014 at 1:36 AM, Michael Armbrust mich...@databricks.com wrote: It seems that the HiveCompatibilitySuite need a hadoop and hive environment, am I right? Relative path in absolute URI: file:$%7Bsystem:test.tmp.dir%7D/tmp_showcrt1” You should only need Hadoop and Hive if you are creating new tests that we need to compute the answers for. Existing tests are run with cached answers. There are details about the configuration here: https://github.com/apache/spark/tree/master/sql