Re: spark streaming, kafka, SPARK_CLASSPATH

Luis Ángel Vicente Sánchez Tue, 17 Jun 2014 10:06:07 -0700

I have been able to submit a job successfully but I had to config my spark
job this way:


  val sparkConf: SparkConf =
    new SparkConf()
      .setAppName("TwitterPopularTags")
      .setMaster("spark://int-spark-master:7077")
      .setSparkHome("/opt/spark")
      .setJars(Seq("/tmp/spark-test-0.1-SNAPSHOT.jar"))

Now I'm getting this error on my worker:

4/06/17 17:03:40 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory



2014-06-17 15:38 GMT+01:00 Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com>:

> After playing a bit, I have been able to create a fatjar this way:
>
> lazy val rootDependencies = Seq(
>   "org.apache.spark" %% "spark-core"              % "1.0.0" % "provided",
>   "org.apache.spark" %% "spark-streaming"         % "1.0.0" % "provided",
>   "org.apache.spark" %% "spark-streaming-twitter" % "1.0.0"
> exclude("org.apache.spark", "spark-core_2.10") exclude("org.apache.spark",
> "spark-streaming_2.10")
> )
>
> Excluding those transitive dependencies, we can create a fatjar ~400Kb
> instead of 40Mb.
>
> My problem is not to run the streaming job locally but trying to submit it
> to standalone cluster using spark-submit, everytime I ran the following
> command, my workers died:
>
> ~/development/tools/spark/1.0.0/bin/spark-submit \
> --class "org.apache.spark.examples.streaming.TwitterPopularTags" \
> --master "spark://int-spark-master:7077" \
> --deploy-mode "cluster" \
> file:///tmp/spark-test-0.1-SNAPSHOT.jar
>
> I have copied my fatjar to my master /tmp folder.
>
>
> 2014-06-17 10:30 GMT+01:00 Michael Cutler <mich...@tumra.com>:
>
> Admittedly getting Spark Streaming / Kafka working for the first time can
>> be a bit tricky with the web of dependencies that get pulled in.  I've
>> taken the KafkaWorkCount example from the Spark project and set up a simple
>> standalone SBT project that shows you how to get it working and using
>> spark-submit.
>>
>> *https://github.com/cotdp/spark-example-kafka
>> <https://github.com/cotdp/spark-example-kafka>*
>>
>> The key trick is in the use of sbt-assembly instead of relying on any of
>> the "add jars" functionality.  You mark "spark-core" and "spark-streaming"
>> as provided, because they are part of the core spark-assembly already
>> running your cluster.  However "spark-streaming-kafka" is not, so you need
>> to package it in your 'fat JAR' while excluding all the mess that causes
>> the build to break.
>>
>> build.sbt
>> <https://github.com/cotdp/spark-example-kafka/blob/master/build.sbt>:
>>
>> import AssemblyKeys._
>>
>> assemblySettings
>>
>> name := "spark-example-kafka"
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.4"
>>
>>
>> jarName in assembly := "spark-example-kafka_2.10-1.0.jar"
>>
>>
>> assemblyOption in assembly ~= { _.copy(includeScala = false) }
>>
>>
>> libraryDependencies ++= Seq(
>>
>>   "org.apache.spark" %% "spark-core" % "1.0.0" % "provided",
>>
>>   "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided",
>>
>>   ("org.apache.spark" %% "spark-streaming-kafka" % "1.0.0").
>>
>>     exclude("commons-beanutils", "commons-beanutils").
>>
>>     exclude("commons-collections", "commons-collections").
>>
>>     exclude("com.esotericsoftware.minlog", "minlog")
>>
>> )
>>
>> mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
>>
>>   {
>>     case x if x.startsWith("META-INF/ECLIPSEF.RSA") => MergeStrategy.last
>>
>>     case x if x.startsWith("META-INF/mailcap") => MergeStrategy.last
>>
>>     case x if x.startsWith("plugin.properties") => MergeStrategy.last
>>
>>     case x => old(x)
>>
>>   }
>> }
>>
>>
>> You can see the "exclude()" has to go around the spark-streaming-kafka 
>> dependency,
>> and I've used a MergeStrategy to solve the "deduplicate: different file
>> contents found in the following" errors.
>>
>> Build the JAR with sbt assembly and use the scripts in bin/ to run the
>> examples.
>>
>> I'm using this same approach to run my Spark Streaming jobs with
>> spark-submit and have them managed using Mesos/Marathon
>> <http://mesosphere.io/> to handle failures and restarts with long
>> running processes.
>>
>> Good luck!
>>
>> MC
>>
>>
>>
>>
>>
>>  *Michael Cutler*
>> Founder, CTO
>>
>>
>> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com <mich...@tumra.com>
>> Web:     tumra.com
>> <http://tumra.com/?utm_source=signature&utm_medium=email> *
>> *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>*
>> *Registered in England & Wales, 07916412. VAT No. 130595328*
>>
>>
>> This email and any files transmitted with it are confidential and may
>> also be privileged. It is intended only for the person to whom it is
>> addressed. If you have received this email in error, please inform the
>> sender immediately. If you are not the intended recipient you must not
>> use, disclose, copy, print, distribute or rely on this email.
>>
>>
>> On 17 June 2014 02:51, Gino Bustelo <lbust...@gmail.com> wrote:
>>
>>> +1 for this issue. Documentation for spark-submit are misleading. Among
>>> many issues, the jar support is bad. HTTP urls do not work. This is because
>>> spark is using hadoop's FileSystem class. You have to specify the jars
>>> twice to get things to work. Once for the DriverWrapper to laid your
>>> classes and a 2nd time in the Context to distribute to workers.
>>>
>>> I would like to see some contrib response to this issue.
>>>
>>> Gino B.
>>>
>>> On Jun 16, 2014, at 1:49 PM, Luis Ángel Vicente Sánchez <
>>> langel.gro...@gmail.com> wrote:
>>>
>>> Did you manage to make it work? I'm facing similar problems and this a
>>> serious blocker issue. spark-submit seems kind of broken to me if you can
>>> use it for spark-streaming.
>>>
>>> Regards,
>>>
>>> Luis
>>>
>>>
>>> 2014-06-11 1:48 GMT+01:00 lannyripple <lanny.rip...@gmail.com>:
>>>
>>>> I am using Spark 1.0.0 compiled with Hadoop 1.2.1.
>>>>
>>>> I have a toy spark-streaming-kafka program.  It reads from a kafka
>>>> queue and
>>>> does
>>>>
>>>>     stream
>>>>       .map {case (k, v) => (v, 1)}
>>>>       .reduceByKey(_ + _)
>>>>       .print()
>>>>
>>>> using a 1 second interval on the stream.
>>>>
>>>> The docs say to make Spark and Hadoop jars 'provided' but this breaks
>>>> for
>>>> spark-streaming.  Including spark-streaming (and spark-streaming-kafka)
>>>> as
>>>> 'compile' to sweep them into our assembly gives collisions on javax.*
>>>> classes.  To work around this I modified
>>>> $SPARK_HOME/bin/compute-classpath.sh to include spark-streaming,
>>>> spark-streaming-kafka, and zkclient.  (Note that kafka is included as
>>>> 'compile' in my project and picked up in the assembly.)
>>>>
>>>> I have set up conf/spark-env.sh as needed.  I have copied my assembly to
>>>> /tmp/myjar.jar on all spark hosts and to my hdfs /tmp/jars directory.
>>>>  I am
>>>> running spark-submit from my spark master.  I am guided by the
>>>> information
>>>> here https://spark.apache.org/docs/latest/submitting-applications.html
>>>>
>>>> Well at this point I was going to detail all the ways spark-submit
>>>> fails to
>>>> follow it's own documentation.  If I do not invoke
>>>> sparkContext.setJars()
>>>> then it just fails to find the driver class.  This is using various
>>>> combinations of absolute path, file:, hdfs: (Warning: Skip remote
>>>> jar)??,
>>>> and local: prefixes on the application-jar and --jars arguments.
>>>>
>>>> If I invoke sparkContext.setJars() and include my assembly jar I get
>>>> further.  At this point I get a failure from
>>>> kafka.consumer.ConsumerConnector not being found.  I suspect this is
>>>> because
>>>> spark-streaming-kafka needs the Kafka dependency it but my assembly jar
>>>> is
>>>> too late in the classpath.
>>>>
>>>> At this point I try setting spark.files.userClassPathfirst to 'true' but
>>>> this causes more things to blow up.
>>>>
>>>> I finally found something that works.  Namely setting environment
>>>> variable
>>>> SPARK_CLASSPATH=/tmp/myjar.jar  But silly me, this is deprecated and I'm
>>>> helpfully informed to
>>>>
>>>>   Please instead use:
>>>>    - ./spark-submit with --driver-class-path to augment the driver
>>>> classpath
>>>>    - spark.executor.extraClassPath to augment the executor classpath
>>>>
>>>> which when put into a file and introduced with --properties-file does
>>>> not
>>>> work.  (Also tried spark.files.userClassPathFirst here.)  These fail
>>>> with
>>>> the kafka.consumer.ConsumerConnector error.
>>>>
>>>> At a guess what's going on is that using SPARK_CLASSPATH I have my
>>>> assembly
>>>> jar in the classpath at SparkSubmit invocation
>>>>
>>>>   Spark Command: java -cp
>>>>
>>>> /tmp/myjar.jar::/opt/spark/conf:/opt/spark/lib/spark-assembly-1.0.0-hadoop1.2.1.jar:/opt/spark/lib/spark-streaming_2.10-1.0.0.jar:/opt/spark/lib/spark-streaming-kafka_2.10-1.0.0.jar:/opt/spark/lib/zkclient-0.4.jar
>>>> -XX:MaxPermSize=128m -Djava.library.path= -Xms512m -Xmx512m
>>>> org.apache.spark.deploy.SparkSubmit --class me.KafkaStreamingWC
>>>> /tmp/myjar.jar
>>>>
>>>> but using --properties-file then the assembly is not available for
>>>> SparkSubmit.
>>>>
>>>> I think the root cause is either spark-submit not handling the
>>>> spark-streaming libraries so they can be 'provided' or the inclusion of
>>>> org.elicpse.jetty.orbit in the streaming libraries which cause
>>>>
>>>>   [error] (*:assembly) deduplicate: different file contents found in the
>>>> following:
>>>>   [error]
>>>>
>>>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.transaction/orbits/javax.transaction-1.1.1.v201105210645.jar:META-INF/ECLIPSEF.RSA
>>>>   [error]
>>>>
>>>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.servlet/orbits/javax.servlet-3.0.0.v201112011016.jar:META-INF/ECLIPSEF.RSA
>>>>   [error]
>>>>
>>>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.mail.glassfish/orbits/javax.mail.glassfish-1.4.1.v201005082020.jar:META-INF/ECLIPSEF.RSA
>>>>   [error]
>>>>
>>>> /Users/lanny/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA
>>>>
>>>> I've tried applying mergeStategy in assembly for my assembly.sbt but
>>>> then I
>>>> get
>>>>
>>>>   Invalid signature file digest for Manifest main attributes
>>>>
>>>> If anyone knows the magic to get this working a reply would be greatly
>>>> appreciated.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-kafka-SPARK-CLASSPATH-tp7356.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>>> .
>>>>
>>>
>>>
>>
>

Re: spark streaming, kafka, SPARK_CLASSPATH

Reply via email to