data locality in spark
Hi guys, I am running some SQL queries, but all my tasks are reported as either NODE_LOCAL or PROCESS_LOCAL. In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because they have to aggregate data from multiple hosts. However, in Spark even the aggregation stages are reported as NODE/PROCESS LOCAL. Do I miss something, or why the reduce-like tasks are still NODE/PROCESS LOCAL ? Thanks,Robert
Re: counters in spark
Guys, Do you have any thoughts on this ? Thanks,Robert On Sunday, April 12, 2015 5:35 PM, Grandl Robert rgra...@yahoo.com.INVALID wrote: Hi guys, I was trying to figure out some counters in Spark, related to the amount of CPU or Memory used (in some metric), used by a task/stage/job, but I could not find any. Is there any such counter available ? Thank you,Robert
counters in spark
Hi guys, I was trying to figure out some counters in Spark, related to the amount of CPU or Memory used (in some metric), used by a task/stage/job, but I could not find any. Is there any such counter available ? Thank you,Robert
question regarding the dependency DAG in Spark
Hi guys, I am trying to get a better understanding of the DAG generation for a job in Spark. Ideally, what I want is to run some SQL query and extract the generated DAG by Spark. By DAG I mean the stages and dependencies among stages, and the number of tasks in every stage. Could you guys point me to the code where is that happening ? Thank you, Robert
run spark standalone mode
Hi guys, I have a stupid question, but I am not sure how to get out of it. I deployed spark 1.2.1 on a cluster of 30 nodes. Looking at master:8088 I can see all the workers I have created so far. (I start the cluster with sbin/start-all.sh) However, when running a Spark SQL query or even spark-shell, I cannot see any job executing at master webUI, but the jobs are able to finish. I suspect they are executing locally on the master, but I don't understand why/how and why not on slave machines. My conf/spark-env.sh is as following:export SPARK_MASTER_IP=ms0220 export SPARK_CLASSPATH=$SPARK_CLASSPATH:/users/rgrandl/software/spark-1.2.1-bin-hadoop2.4/lib/snappy-java-1.0.4.1.jar export SPARK_LOCAL_DIRS=/users/rgrandl/software/data/spark/local export SPARK_WORKER_MEMORY=52000M export SPARK_WORKER_INSTANCES=2 export SPARK_WORKER_CORES=2 export SPARK_WORKER_DIR=/users/rgrandl/software/data/spark/worker export SPARK_DAEMON_MEMORY=5200M #export SPARK_DAEMON_JAVA_OPTS=4800M While conf/slaves is populated with the list of machines used for workers. I have to mention that spark-env.sh and slaves files are deployed on all machines. Thank you,Robert
Re: run spark standalone mode
Sorry guys for this. It seems that I need to start the thrift server with --master spark://ms0220:7077 option and now I can see applications running in my web UI. Thanks,Robert On Thursday, March 12, 2015 10:57 AM, Grandl Robert rgra...@yahoo.com.INVALID wrote: I figured out for spark-shell by passing the --master option. However, still troubleshooting for launching sql queries. My current command is like that: ./bin/beeline -u jdbc:hive2://ms0220:1 -n `whoami` -p ignored -f tpch_query10.sql On Thursday, March 12, 2015 10:37 AM, Grandl Robert rgra...@yahoo.com.INVALID wrote: Hi guys, I have a stupid question, but I am not sure how to get out of it. I deployed spark 1.2.1 on a cluster of 30 nodes. Looking at master:8088 I can see all the workers I have created so far. (I start the cluster with sbin/start-all.sh) However, when running a Spark SQL query or even spark-shell, I cannot see any job executing at master webUI, but the jobs are able to finish. I suspect they are executing locally on the master, but I don't understand why/how and why not on slave machines. My conf/spark-env.sh is as following:export SPARK_MASTER_IP=ms0220 export SPARK_CLASSPATH=$SPARK_CLASSPATH:/users/rgrandl/software/spark-1.2.1-bin-hadoop2.4/lib/snappy-java-1.0.4.1.jar export SPARK_LOCAL_DIRS=/users/rgrandl/software/data/spark/local export SPARK_WORKER_MEMORY=52000M export SPARK_WORKER_INSTANCES=2 export SPARK_WORKER_CORES=2 export SPARK_WORKER_DIR=/users/rgrandl/software/data/spark/worker export SPARK_DAEMON_MEMORY=5200M #export SPARK_DAEMON_JAVA_OPTS=4800M While conf/slaves is populated with the list of machines used for workers. I have to mention that spark-env.sh and slaves files are deployed on all machines. Thank you,Robert
Re: run spark standalone mode
I figured out for spark-shell by passing the --master option. However, still troubleshooting for launching sql queries. My current command is like that: ./bin/beeline -u jdbc:hive2://ms0220:1 -n `whoami` -p ignored -f tpch_query10.sql On Thursday, March 12, 2015 10:37 AM, Grandl Robert rgra...@yahoo.com.INVALID wrote: Hi guys, I have a stupid question, but I am not sure how to get out of it. I deployed spark 1.2.1 on a cluster of 30 nodes. Looking at master:8088 I can see all the workers I have created so far. (I start the cluster with sbin/start-all.sh) However, when running a Spark SQL query or even spark-shell, I cannot see any job executing at master webUI, but the jobs are able to finish. I suspect they are executing locally on the master, but I don't understand why/how and why not on slave machines. My conf/spark-env.sh is as following:export SPARK_MASTER_IP=ms0220 export SPARK_CLASSPATH=$SPARK_CLASSPATH:/users/rgrandl/software/spark-1.2.1-bin-hadoop2.4/lib/snappy-java-1.0.4.1.jar export SPARK_LOCAL_DIRS=/users/rgrandl/software/data/spark/local export SPARK_WORKER_MEMORY=52000M export SPARK_WORKER_INSTANCES=2 export SPARK_WORKER_CORES=2 export SPARK_WORKER_DIR=/users/rgrandl/software/data/spark/worker export SPARK_DAEMON_MEMORY=5200M #export SPARK_DAEMON_JAVA_OPTS=4800M While conf/slaves is populated with the list of machines used for workers. I have to mention that spark-env.sh and slaves files are deployed on all machines. Thank you,Robert
Spark SQL using Hive metastore
Hi guys, I am a newbie in running Spark SQL / Spark. My goal is to run some TPC-H queries atop Spark SQL using Hive metastore. It looks like spark 1.2.1 release has Spark SQL / Hive support. However, I am not able to fully connect all the dots. I did the following: 1. Copied hive-site.xml from hive to spark/conf2. Copied mysql connector to spark/lib3. I have started hive metastore service: hive --service metastore 3. I have started ./bin/spark-sql 4. I typed: spark-sql show tables; However, the following error was thrown: Job 0 failed: collect at SparkPlan.scala:84, took 0.241788 s 15/03/11 15:02:35 ERROR SparkSQLDriver: Failed in [show tables] org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] no native library is found for os.name=Linux and os.arch=aarch64 Do you know what I am doing wrong ? I mention that I have hive-0.14 instead of hive-0.13. And another question: What is the right command to run sql queries with spark sql using hive metastore ? Thanks,Robert
shark queries failed
Hi guys, I deployed BlinkDB(built atop Shark) and running Spark 0.9. I tried to run several TPCDS shark queries taken from https://github.com/cloudera/impala-tpcds-kit/tree/master/queries-sql92-modified/queries/shark. However, the following exceptions are encountered. Do you have any idea why that might happen ? Thanks,Robert 2015-02-14 17:58:29,358 WARN util.NativeCodeLoader (NativeCodeLoader.java:clinit(52)) - Unable to load native- hadoop library for your platform... using builtin-java classes where applicable 2015-02-14 17:58:29,360 WARN snappy.LoadSnappy (LoadSnappy.java:clinit(46)) - Snappy native library not loaded 2015-02-14 17:58:34,963 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 6 (task 5.0:2) 2015-02-14 17:58:34,970 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Loss was due to java.lang .ClassCastException java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.io.FloatWrita ble at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableFloatObjectInspector.get(WritableFloat ObjectInspector.java:35) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:331) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:257) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:204) at shark.execution.ReduceSinkOperator$$anonfun$processPartitionNoDistinct$1.apply(ReduceSinkOperator.scal a:188) at shark.execution.ReduceSinkOperator$$anonfun$processPartitionNoDistinct$1.apply(ReduceSinkOperator.scal a:153) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) 2015-02-14 17:58:34,983 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 8 (task 5.0:4) 2015-02-14 17:58:35,075 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 12 (task 5.0:8) 2015-02-14 17:58:35,119 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 15 (task 5.0:2) 2015-02-14 17:58:35,134 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 9 (task 5.0:5) 2015-02-14 17:58:35,187 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 16 (task 5.0:4) 2015-02-14 17:58:35,203 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 11 (task 5.0:7) 2015-02-14 17:58:35,214 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 13 (task 5.0:9) 2015-02-14 17:58:35,265 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 4 (task 5.0:0) 2015-02-14 17:58:35,274 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 18 (task 5.0:2) 2015-02-14 17:58:35,304 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 17 (task 5.0:8) 2015-02-14 17:58:35,330 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 5 (task 5.0:1) 2015-02-14 17:58:35,354 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 20 (task 5.0:4) 2015-02-14 17:58:35,387 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 19 (task 5.0:5) 2015-02-14 17:58:35,430 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 7 (task 5.0:3) 2015-02-14 17:58:35,432 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 24 (task 5.0:2) 2015-02-14 17:58:35,433 ERROR scheduler.TaskSetManager (Logging.scala:logError(65)) - Task 5.0:2 failed 4 times; aborting job 2015-02-14 17:58:35,438 ERROR ql.Driver (SessionState.java:printError(400)) - FAILED: Execution Error, return cod e -101 from shark.execution.SparkTask 2015-02-14 17:58:35,552 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 30 (task 6.0:0) 2015-02-14 17:58:35,565 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Loss was due to java.io.F ileNotFoundException java.io.FileNotFoundException: http://10.200.146.12:46812/broadcast_4 at
Re: shark queries failed
Thanks for reply, Akhil. I cannot update the spark version and run SparkSQL due to some old dependencies and a specific project I want to run. I was wondering if you have any clue, why that exception might be triggered, or if you saw it before. Thanks,Robert On Sunday, February 15, 2015 9:18 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I'd suggest you updating your spark to the latest version and try SparkSQL instead of Shark. ThanksBest Regards On Sun, Feb 15, 2015 at 7:36 AM, Grandl Robert rgra...@yahoo.com.invalid wrote: Hi guys, I deployed BlinkDB(built atop Shark) and running Spark 0.9. I tried to run several TPCDS shark queries taken from https://github.com/cloudera/impala-tpcds-kit/tree/master/queries-sql92-modified/queries/shark. However, the following exceptions are encountered. Do you have any idea why that might happen ? Thanks,Robert 2015-02-14 17:58:29,358 WARN util.NativeCodeLoader (NativeCodeLoader.java:clinit(52)) - Unable to load native- hadoop library for your platform... using builtin-java classes where applicable 2015-02-14 17:58:29,360 WARN snappy.LoadSnappy (LoadSnappy.java:clinit(46)) - Snappy native library not loaded 2015-02-14 17:58:34,963 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 6 (task 5.0:2) 2015-02-14 17:58:34,970 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Loss was due to java.lang .ClassCastException java.lang.ClassCastException: org.apache.hadoop.io.NullWritable cannot be cast to org.apache.hadoop.io.FloatWrita ble at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableFloatObjectInspector.get(WritableFloat ObjectInspector.java:35) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:331) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serializeStruct(LazyBinarySerDe.java:257) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe.serialize(LazyBinarySerDe.java:204) at shark.execution.ReduceSinkOperator$$anonfun$processPartitionNoDistinct$1.apply(ReduceSinkOperator.scal a:188) at shark.execution.ReduceSinkOperator$$anonfun$processPartitionNoDistinct$1.apply(ReduceSinkOperator.scal a:153) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) 2015-02-14 17:58:34,983 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 8 (task 5.0:4) 2015-02-14 17:58:35,075 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 12 (task 5.0:8) 2015-02-14 17:58:35,119 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 15 (task 5.0:2) 2015-02-14 17:58:35,134 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 9 (task 5.0:5) 2015-02-14 17:58:35,187 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 16 (task 5.0:4) 2015-02-14 17:58:35,203 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 11 (task 5.0:7) 2015-02-14 17:58:35,214 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 13 (task 5.0:9) 2015-02-14 17:58:35,265 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 4 (task 5.0:0) 2015-02-14 17:58:35,274 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 18 (task 5.0:2) 2015-02-14 17:58:35,304 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 17 (task 5.0:8) 2015-02-14 17:58:35,330 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 5 (task 5.0:1) 2015-02-14 17:58:35,354 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 20 (task 5.0:4) 2015-02-14 17:58:35,387 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 19 (task 5.0:5) 2015-02-14 17:58:35,430 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 7 (task 5.0:3) 2015-02-14 17:58:35,432 WARN scheduler.TaskSetManager (Logging.scala:logWarning(61)) - Lost TID 24 (task 5.0:2) 2015-02-14 17:58:35,433 ERROR scheduler.TaskSetManager
Spark standalone and HDFS 2.6
Hi guys, Probably a dummy question. Do you know how to compile Spark 0.9 to easily integrate with HDFS 2.6.0 ? I was trying sbt/sbt -Pyarn -Phadoop-2.6 assembly ormvn -Dhadoop.version=2.6.0 -DskipTests clean package but none of these approaches succeeded. Thanks,Robert
Re: Spark standalone and HDFS 2.6
I am trying to run BlinkDB(https://github.com/sameeragarwal/blinkdb) which seems to work only with Spark 0.9. However, if I want to access HDFS I need to compile Spark against Hadoop version which is running on my cluster(2.6.0). Hence, the versions problem ... On Friday, February 13, 2015 11:28 AM, Sean Owen so...@cloudera.com wrote: Oh right, you said Spark 0.9. Those profiles won't exist back then. I don't even know if Hadoop 2.6 will work with 0.9 as-is. The profiles were introduced later to fix up some compatibility. Why not use 1.2.1? On Fri, Feb 13, 2015 at 7:26 PM, Grandl Robert rgra...@yahoo.com wrote: Thanks Sean for your prompt response. I was trying to compile as following: mvn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package but I got a bunch of errors(see below). Hadoop-2.6.0 compiled correctly, and all hadoop jars are in .m2 repository. Do you have any idea what might happens ? Robert [WARNING] Class com.google.protobuf.Parser not found - continuing with a stub. [ERROR] error while loading RpcResponseHeaderProto, class file '/home/rgrandl/.m2/repository/org/apache/hadoop/hadoop-common/2.6.0/hadoop-common-2.6.0.jar(org/apache/hadoop/ipc/protobuf/RpcHeaderProtos$RpcResponseHeaderProto.class)' is broken (class java.lang.NullPointerException/null) [WARNING] one warning found [ERROR] one error found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.537s] [INFO] Spark Project Core FAILURE [25.917s] [INFO] Spark Project Bagel ... SKIPPED [INFO] Spark Project GraphX .. SKIPPED [INFO] Spark Project ML Library .. SKIPPED [INFO] Spark Project Streaming ... SKIPPED [INFO] Spark Project Tools ... SKIPPED [INFO] Spark Project REPL SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 30.002s [INFO] Finished at: Fri Feb 13 11:21:36 PST 2015 [INFO] Final Memory: 49M/1226M [INFO] [WARNING] The requested profile hadoop-2.4 could not be activated because it does not exist. [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on project spark-core_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - [Help 1] On Friday, February 13, 2015 11:16 AM, Sean Owen so...@cloudera.com wrote: If you just need standalone mode, you don't need -Pyarn. There is no -Phadoop-2.6; you should use -Phadoop-2.4 for 2.4+. Yes, set -Dhadoop.version=2.6.0. That should be it. If that still doesn't work, define doesn't succeed. On Fri, Feb 13, 2015 at 7:13 PM, Grandl Robert rgra...@yahoo.com.invalid wrote: Hi guys, Probably a dummy question. Do you know how to compile Spark 0.9 to easily integrate with HDFS 2.6.0 ? I was trying sbt/sbt -Pyarn -Phadoop-2.6 assembly or mvn -Dhadoop.version=2.6.0 -DskipTests clean package but none of these approaches succeeded. Thanks, Robert On Friday, February 13, 2015 11:28 AM, Sean Owen so...@cloudera.com wrote: Oh right, you said Spark 0.9. Those profiles won't exist back then. I don't even know if Hadoop 2.6 will work with 0.9 as-is. The profiles were introduced later to fix up some compatibility. Why not use 1.2.1? On Fri, Feb 13, 2015 at 7:26 PM, Grandl Robert rgra...@yahoo.com wrote: Thanks Sean for your prompt response. I was trying to compile as following: mvn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package but I got a bunch of errors(see below). Hadoop-2.6.0 compiled correctly, and all hadoop jars are in .m2 repository. Do you have any idea what might happens ? Robert [WARNING] Class com.google.protobuf.Parser not found - continuing with a stub. [ERROR] error while loading RpcResponseHeaderProto, class file '/home/rgrandl/.m2/repository/org/apache/hadoop/hadoop-common/2.6.0/hadoop-common-2.6.0.jar(org/apache/hadoop/ipc/protobuf/RpcHeaderProtos$RpcResponseHeaderProto.class)' is broken (class
Re: Spark standalone and HDFS 2.6
Thanks Sean for your prompt response. I was trying to compile as following: mvn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package but I got a bunch of errors(see below). Hadoop-2.6.0 compiled correctly, and all hadoop jars are in .m2 repository. Do you have any idea what might happens ? Robert [WARNING] Class com.google.protobuf.Parser not found - continuing with a stub. [ERROR] error while loading RpcResponseHeaderProto, class file '/home/rgrandl/.m2/repository/org/apache/hadoop/hadoop-common/2.6.0/hadoop-common-2.6.0.jar(org/apache/hadoop/ipc/protobuf/RpcHeaderProtos$RpcResponseHeaderProto.class)' is broken (class java.lang.NullPointerException/null) [WARNING] one warning found [ERROR] one error found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.537s] [INFO] Spark Project Core FAILURE [25.917s] [INFO] Spark Project Bagel ... SKIPPED [INFO] Spark Project GraphX .. SKIPPED [INFO] Spark Project ML Library .. SKIPPED [INFO] Spark Project Streaming ... SKIPPED [INFO] Spark Project Tools ... SKIPPED [INFO] Spark Project REPL SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 30.002s [INFO] Finished at: Fri Feb 13 11:21:36 PST 2015 [INFO] Final Memory: 49M/1226M [INFO] [WARNING] The requested profile hadoop-2.4 could not be activated because it does not exist. [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile (scala-compile-first) on project spark-core_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.1.5:compile failed. CompileFailed - [Help 1] On Friday, February 13, 2015 11:16 AM, Sean Owen so...@cloudera.com wrote: If you just need standalone mode, you don't need -Pyarn. There is no -Phadoop-2.6; you should use -Phadoop-2.4 for 2.4+. Yes, set -Dhadoop.version=2.6.0. That should be it. If that still doesn't work, define doesn't succeed. On Fri, Feb 13, 2015 at 7:13 PM, Grandl Robert rgra...@yahoo.com.invalid wrote: Hi guys, Probably a dummy question. Do you know how to compile Spark 0.9 to easily integrate with HDFS 2.6.0 ? I was trying sbt/sbt -Pyarn -Phadoop-2.6 assembly or mvn -Dhadoop.version=2.6.0 -DskipTests clean package but none of these approaches succeeded. Thanks, Robert