Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni
When you say driver running on mesos can you explain how are you doing that...??

> On Mar 10, 2016, at 4:44 PM, Eran Chinthaka Withana 
>  wrote:
> 
> Yanling I'm already running the driver on mesos (through docker). FYI, I'm 
> running this on cluster mode with MesosClusterDispatcher.
> 
> Mac (client) > MesosClusterDispatcher > Driver running on Mesos --> 
> Workers running on Mesos
> 
> My next step is to run MesosClusterDispatcher in mesos through marathon. 
> 
> Thanks,
> Eran Chinthaka Withana
> 
>> On Thu, Mar 10, 2016 at 11:11 AM, yanlin wang  wrote:
>> How you guys make driver docker within container to be reachable from spark 
>> worker ? 
>> 
>> Would you share your driver docker? i am trying to put only driver in docker 
>> and spark running with yarn outside of container and i don’t want to use 
>> —net=host 
>> 
>> Thx
>> Yanlin
>> 
>>> On Mar 10, 2016, at 11:06 AM, Guillaume Eynard Bontemps 
>>>  wrote:
>>> 
>>> Glad to hear it. Thanks all  for sharing your  solutions.
>>> 
>>> 
>>> Le jeu. 10 mars 2016 19:19, Eran Chinthaka Withana 
>>>  a écrit :
 Phew, it worked. All I had to do was to add export 
 SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6"
  before calling spark-submit. Guillaume, thanks for the pointer. 
 
 Timothy, thanks for looking into this. Looking forward to see a fix soon. 
 
 Thanks,
 Eran Chinthaka Withana
 
> On Thu, Mar 10, 2016 at 10:10 AM, Tim Chen  wrote:
> Hi Eran,
> 
> I need to investigate but perhaps that's true, we're using 
> SPARK_JAVA_OPTS to pass all the options and not --conf.
> 
> I'll take a look at the bug, but if you can try the workaround and see if 
> that fixes your problem.
> 
> Tim
> 
>> On Thu, Mar 10, 2016 at 10:08 AM, Eran Chinthaka Withana 
>>  wrote:
>> Hi Timothy
>> 
>>> What version of spark are you guys running?
>> 
>> I'm using Spark 1.6.0. You can see the Dockerfile I used here: 
>> https://github.com/echinthaka/spark-mesos-docker/blob/master/docker/mesos-spark/Dockerfile
>>  
>>  
>>> And also did you set the working dir in your image to be spark home?
>> 
>> Yes I did. You can see it here: https://goo.gl/8PxtV8
>> 
>> Can it be because of this: 
>> https://issues.apache.org/jira/browse/SPARK-13258 as Guillaume pointed 
>> out above? As you can see, I'm passing in the docker image URI through 
>> spark-submit (--conf 
>> spark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6)
>> 
>> Thanks,
>> Eran
> 


Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni
Hi Tim ,

Can you please share your dockerfiles and configuration as it will help a
lot , I am planing to publish a blog post on the same .

Ashish

On Thu, Mar 10, 2016 at 10:34 AM, Timothy Chen <t...@mesosphere.io> wrote:

> No you don't need to install spark on each slave, we have been running
> this setup in Mesosphere without any problem at this point, I think most
> likely configuration problem and perhaps a chance something is missing in
> the code to handle some cases.
>
> What version of spark are you guys running? And also did you set the
> working dir in your image to be spark home?
>
> Tim
>
>
> On Mar 10, 2016, at 3:11 AM, Ashish Soni <asoni.le...@gmail.com> wrote:
>
> You need to install spark on each mesos slave and then while starting
> container make a workdir to your spark home so that it can find the spark
> class.
>
> Ashish
>
> On Mar 10, 2016, at 5:22 AM, Guillaume Eynard Bontemps <
> g.eynard.bonte...@gmail.com> wrote:
>
> For an answer to my question see this:
> http://stackoverflow.com/a/35660466?noredirect=1.
>
> But for your problem did you define  the  Spark.mesos.docker. home or
> something like  that property?
>
> Le jeu. 10 mars 2016 04:26, Eran Chinthaka Withana <
> eran.chinth...@gmail.com> a écrit :
>
>> Hi
>>
>> I'm also having this issue and can not get the tasks to work inside mesos.
>>
>> In my case, the spark-submit command is the following.
>>
>> $SPARK_HOME/bin/spark-submit \
>>  --class com.mycompany.SparkStarter \
>>  --master mesos://mesos-dispatcher:7077 \ --name SparkStarterJob \
>> --driver-memory 1G \
>>  --executor-memory 4G \
>> --deploy-mode cluster \
>>  --total-executor-cores 1 \
>>  --conf 
>> spark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6 \
>>  http://abc.com/spark-starter.jar
>>
>>
>> And the error I'm getting is the following.
>>
>> I0310 03:13:11.417009 131594 exec.cpp:132] Version: 0.23.1
>> I0310 03:13:11.419452 131601 exec.cpp:206] Executor registered on slave 
>> 20160223-000314-3439362570-5050-631-S0
>> sh: 1: /usr/spark-1.6.0-bin-hadoop2.6/bin/spark-class: not found
>>
>>
>> (Looked into Spark JIRA and I found that
>> https://issues.apache.org/jira/browse/SPARK-11759 is marked as closed
>> since https://issues.apache.org/jira/browse/SPARK-12345 is marked as
>> resolved)
>>
>> Really appreciate if I can get some help here.
>>
>> Thanks,
>> Eran Chinthaka Withana
>>
>> On Wed, Feb 17, 2016 at 2:00 PM, g.eynard.bonte...@gmail.com <
>> g.eynard.bonte...@gmail.com> wrote:
>>
>>> Hi everybody,
>>>
>>> I am testing the use of Docker for executing Spark algorithms on MESOS. I
>>> managed to execute Spark in client mode with executors inside Docker,
>>> but I
>>> wanted to go further and have also my Driver running into a Docker
>>> Container. Here I ran into a behavior that I'm not sure is normal, let me
>>> try to explain.
>>>
>>> I submit my spark application through MesosClusterDispatcher using a
>>> command
>>> like:
>>> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> mesos://spark-master-1:7077 --deploy-mode cluster --conf
>>> spark.mesos.executor.docker.image=myuser/myimage:0.0.2
>>>
>>> https://storage.googleapis.com/some-bucket/spark-examples-1.5.2-hadoop2.6.0.jar
>>> 10
>>>
>>> My driver is running fine, inside its docker container, but the executors
>>> fail:
>>> "sh: /some/spark/home/bin/spark-class: No such file or directory"
>>>
>>> Looking on MESOS slaves log, I think that the executors do not run inside
>>> docker: "docker.cpp:775] No container info found, skipping launch". As my
>>> Mesos slaves do not have spark installed, it fails.
>>>
>>> *It seems that the spark conf that I gave in the first spark-submit is
>>> not
>>> transmitted to the Driver submitted conf*, when launched in the docker
>>> container. The only workaround I found is to modify my Docker image in
>>> order
>>> to define inside its spark conf the spark.mesos.executor.docker.image
>>> property. This way, my executors get the conf well and are launched
>>> inside
>>> docker on Mesos. This seems a little complicated to me, and I feel the
>>> configuration passed to the early spark-submit should be transmitted to
>>> the
>>> Driver submit...
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-mixing-MESOS-Cluster-Mode-and-Docker-task-execution-tp26258.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com>.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ashish Soni
You need to install spark on each mesos slave and then while starting container 
make a workdir to your spark home so that it can find the spark class.

Ashish

> On Mar 10, 2016, at 5:22 AM, Guillaume Eynard Bontemps 
>  wrote:
> 
> For an answer to my question see this: 
> http://stackoverflow.com/a/35660466?noredirect=1.
> 
> But for your problem did you define  the  Spark.mesos.docker. home or 
> something like  that property?
> 
> 
> Le jeu. 10 mars 2016 04:26, Eran Chinthaka Withana  
> a écrit :
>> Hi
>> 
>> I'm also having this issue and can not get the tasks to work inside mesos.
>> 
>> In my case, the spark-submit command is the following.
>> 
>> $SPARK_HOME/bin/spark-submit \
>>  --class com.mycompany.SparkStarter \
>>  --master mesos://mesos-dispatcher:7077 \
>>  --name SparkStarterJob \
>> --driver-memory 1G \
>>  --executor-memory 4G \
>> --deploy-mode cluster \
>>  --total-executor-cores 1 \
>>  --conf 
>> spark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6 \
>>  http://abc.com/spark-starter.jar
>> 
>> And the error I'm getting is the following.
>> 
>> I0310 03:13:11.417009 131594 exec.cpp:132] Version: 0.23.1
>> I0310 03:13:11.419452 131601 exec.cpp:206] Executor registered on slave 
>> 20160223-000314-3439362570-5050-631-S0
>> sh: 1: /usr/spark-1.6.0-bin-hadoop2.6/bin/spark-class: not found
>> 
>> (Looked into Spark JIRA and I found that 
>> https://issues.apache.org/jira/browse/SPARK-11759 is marked as closed since 
>> https://issues.apache.org/jira/browse/SPARK-12345 is marked as resolved)
>> 
>> Really appreciate if I can get some help here.
>> 
>> Thanks,
>> Eran Chinthaka Withana
>> 
>>> On Wed, Feb 17, 2016 at 2:00 PM, g.eynard.bonte...@gmail.com 
>>>  wrote:
>>> Hi everybody,
>>> 
>>> I am testing the use of Docker for executing Spark algorithms on MESOS. I
>>> managed to execute Spark in client mode with executors inside Docker, but I
>>> wanted to go further and have also my Driver running into a Docker
>>> Container. Here I ran into a behavior that I'm not sure is normal, let me
>>> try to explain.
>>> 
>>> I submit my spark application through MesosClusterDispatcher using a command
>>> like:
>>> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> mesos://spark-master-1:7077 --deploy-mode cluster --conf
>>> spark.mesos.executor.docker.image=myuser/myimage:0.0.2
>>> https://storage.googleapis.com/some-bucket/spark-examples-1.5.2-hadoop2.6.0.jar
>>> 10
>>> 
>>> My driver is running fine, inside its docker container, but the executors
>>> fail:
>>> "sh: /some/spark/home/bin/spark-class: No such file or directory"
>>> 
>>> Looking on MESOS slaves log, I think that the executors do not run inside
>>> docker: "docker.cpp:775] No container info found, skipping launch". As my
>>> Mesos slaves do not have spark installed, it fails.
>>> 
>>> *It seems that the spark conf that I gave in the first spark-submit is not
>>> transmitted to the Driver submitted conf*, when launched in the docker
>>> container. The only workaround I found is to modify my Docker image in order
>>> to define inside its spark conf the spark.mesos.executor.docker.image
>>> property. This way, my executors get the conf well and are launched inside
>>> docker on Mesos. This seems a little complicated to me, and I feel the
>>> configuration passed to the early spark-submit should be transmitted to the
>>> Driver submit...
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-mixing-MESOS-Cluster-Mode-and-Docker-task-execution-tp26258.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org


Re: Spark 1.5 on Mesos

2016-03-04 Thread Ashish Soni
It did not helped , same error , Is this the issue i am running into
https://issues.apache.org/jira/browse/SPARK-11638

*Warning: Local jar /mnt/mesos/sandbox/spark-examples-1.6.0-hadoop2.6.0.jar
does not exist, skipping.*
java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi

On Thu, Mar 3, 2016 at 4:12 PM, Tim Chen <t...@mesosphere.io> wrote:

> Ah I see, I think it's because you've launched the Mesos slave in a docker
> container, and when you launch also the executor in a container it's not
> able to mount in the sandbox to the other container since the slave is in a
> chroot.
>
> Can you try mounting in a volume from the host when you launch the slave
> for your slave's workdir?
> docker run -v /tmp/mesos/slave:/tmp/mesos/slave mesos_image mesos-slave
> --work_dir=/tmp/mesos/slave 
>
> Tim
>
> On Thu, Mar 3, 2016 at 4:42 AM, Ashish Soni <asoni.le...@gmail.com> wrote:
>
>> Hi Tim ,
>>
>>
>> I think I know the problem but i do not have a solution , *The Mesos
>> Slave supposed to download the Jars from the URI specified and placed in
>> $MESOS_SANDBOX location but it is not downloading not sure why* .. see
>> below logs
>>
>> My command looks like below
>>
>> docker run -it --rm -m 2g -e SPARK_MASTER="mesos://10.0.2.15:7077"  -e
>> SPARK_IMAGE="spark_driver:latest" spark_driver:latest ./bin/spark-submit
>> --deploy-mode cluster --class org.apache.spark.examples.SparkPi
>> http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
>>
>> [root@Mindstorm spark-1.6.0]# docker logs d22d8e897b79
>> *Warning: Local jar
>> /mnt/mesos/sandbox/spark-examples-1.6.0-hadoop2.6.0.jar does not exist,
>> skipping.*
>> java.lang.ClassNotFoundException: org.apache.spark.examples.SparkPi
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:278)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> When i do docker inspect i see below command gets issued
>>
>> "Cmd": [
>> "-c",
>> "./bin/spark-submit --name org.apache.spark.examples.SparkPi
>> --master mesos://10.0.2.15:5050 --driver-cores 1.0 --driver-memory 1024M
>> --class org.apache.spark.examples.SparkPi 
>> $*MESOS_SANDBOX*/spark-examples-1.6.0-hadoop2.6.0.jar
>> "
>>
>>
>>
>> On Thu, Mar 3, 2016 at 12:09 AM, Tim Chen <t...@mesosphere.io> wrote:
>>
>>> You shouldn't need to specify --jars at all since you only have one jar.
>>>
>>> The error is pretty odd as it suggests it's trying to load
>>> /opt/spark/Example but that doesn't really seem to be anywhere in your
>>> image or command.
>>>
>>> Can you paste your stdout from the driver task launched by the cluster
>>> dispatcher, that shows you the spark-submit command it eventually ran?
>>>
>>>
>>> Tim
>>>
>>>
>>>
>>> On Wed, Mar 2, 2016 at 5:42 PM, Ashish Soni <asoni.le...@gmail.com>
>>> wrote:
>>>
>>>> See below  and Attached the Dockerfile to build the spark image  (
>>>> between i just upgraded to 1.6 )
>>>>
>>>> I am running below setup -
>>>>
>>>> Mesos Master - Docker Container
>>>> Mesos Slave 1 - Docker Container
>>>> Mesos Slave 2 - Docker Container
>>>> Marathon - Docker Container
>>>> Spark MESOS Dispatcher - Docker Container
>>>>
>>>> when i submit the Spark PI Example Job using below command
>>>>
>>>> *docker run -it --rm -m 2g -e SPARK_MASTER="mesos://10.0.2.15:7077
>>>> <http://10.0.2.15:7077>"  -e SPARK_IMAGE="spark_driver:**latest"
>>>> spark_driver:latest ./bin/spa

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
See below  and Attached the Dockerfile to build the spark image  ( between
i just upgraded to 1.6 )

I am running below setup -

Mesos Master - Docker Container
Mesos Slave 1 - Docker Container
Mesos Slave 2 - Docker Container
Marathon - Docker Container
Spark MESOS Dispatcher - Docker Container

when i submit the Spark PI Example Job using below command

*docker run -it --rm -m 2g -e SPARK_MASTER="mesos://10.0.2.15:7077
<http://10.0.2.15:7077>"  -e SPARK_IMAGE="spark_driver:**latest"
spark_driver:latest ./bin/spark-submit  --deploy-mode cluster --name "PI
Example" --class org.apache.spark.examples.**SparkPi
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
<http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar> --jars
/opt/spark/lib/spark-examples-**1.6.0-hadoop2.6.0.jar --verbose*

Below is the ERROR
Error: Cannot load main class from JAR file:/opt/spark/Example
Run with --help for usage help or --verbose for debug output


When i docker Inspect for the stopped / dead container i see below output
what is interesting to see is some one or executor replaced by original
command with below in highlighted and i do not see Executor is downloading
the JAR -- IS this a BUG i am hitting or not sure if that is supposed to
work this way and i am missing some configuration

"Env": [
"SPARK_IMAGE=spark_driver:latest",
"SPARK_SCALA_VERSION=2.10",
"SPARK_VERSION=1.6.0",
"SPARK_EXECUTOR_URI=
http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz;,
"MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/libmesos-0.25.0.so",
"SPARK_MASTER=mesos://10.0.2.15:7077",

"SPARK_EXECUTOR_OPTS=-Dspark.executorEnv.MESOS_NATIVE_JAVA_LIBRARY=/usr/lib/
libmesos-0.25.0.so -Dspark.jars=
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
-Dspark.mesos.mesosExecutor.cores=0.1 -Dspark.driver.supervise=false -
Dspark.app.name=PI Example -Dspark.mesos.uris=
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
-Dspark.mesos.executor.docker.image=spark_driver:latest
-Dspark.submit.deployMode=cluster -Dspark.master=mesos://10.0.2.15:7077
-Dspark.driver.extraClassPath=/opt/spark/custom/lib/*
-Dspark.executor.extraClassPath=/opt/spark/custom/lib/*
-Dspark.executor.uri=
http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
-Dspark.mesos.executor.home=/opt/spark",
"MESOS_SANDBOX=/mnt/mesos/sandbox",

"MESOS_CONTAINER_NAME=mesos-e47f8d4c-5ee1-4d01-ad07-0d9a03ced62d-S1.43c08f82-e508-4d57-8c0b-fa05bee77fd6",

"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"HADOOP_VERSION=2.6",
"SPARK_HOME=/opt/spark"
],
"Cmd": [
"-c",
   * "./bin/spark-submit --name PI Example --master
mesos://10.0.2.15:5050 <http://10.0.2.15:5050> --driver-cores 1.0
--driver-memory 1024M --class org.apache.spark.examples.SparkPi
$MESOS_SANDBOX/spark-examples-1.6.0-hadoop2.6.0.jar --jars
/opt/spark/lib/spark-examples-1.6.0-hadoop2.6.0.jar --verbose"*
],
"Image": "spark_driver:latest",












On Wed, Mar 2, 2016 at 5:49 PM, Charles Allen <charles.al...@metamarkets.com
> wrote:

> @Tim yes, this is asking about 1.5 though
>
> On Wed, Mar 2, 2016 at 2:35 PM Tim Chen <t...@mesosphere.io> wrote:
>
>> Hi Charles,
>>
>> I thought that's fixed with your patch in latest master now right?
>>
>> Ashish, yes please give me your docker image name (if it's in the public
>> registry) and what you've tried and I can see what's wrong. I think it's
>> most likely just the configuration of where the Spark home folder is in the
>> image.
>>
>> Tim
>>
>> On Wed, Mar 2, 2016 at 2:28 PM, Charles Allen <
>> charles.al...@metamarkets.com> wrote:
>>
>>> Re: Spark on Mesos Warning regarding disk space:
>>> https://issues.apache.org/jira/browse/SPARK-12330
>>>
>>> That's a spark flaw I encountered on a very regular basis on mesos. That
>>> and a few other annoyances are fixed in
>>> https://github.com/metamx/spark/tree/v1.5.2-mmx
>>>
>>> Here's another mild annoyance I've encountered:
>>> https://issues.apache.org/jira/browse/SPARK-11714
>>>
>>> On Wed, Mar 2, 2016 at 1:31 PM Ashish Soni <asoni.le...@gmail.com>
>>> wrote:
>>>
>>>> I have no luck and i would to ask the question to spark committers will
>>>> this be ever designed to run on mesos ?
>>>>
>>>> spark app as a docker container not working at all on mesos  ,if any
>>>> one would like the code i can send it over to have a look.
&g

Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
I have no luck and i would to ask the question to spark committers will
this be ever designed to run on mesos ?

spark app as a docker container not working at all on mesos  ,if any one
would like the code i can send it over to have a look.

Ashish

On Wed, Mar 2, 2016 at 12:23 PM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Try passing jar using --jars option
>
> On Wed, Mar 2, 2016 at 10:17 AM Ashish Soni <asoni.le...@gmail.com> wrote:
>
>> I made some progress but now i am stuck at this point , Please help as
>> looks like i am close to get it working
>>
>> I have everything running in docker container including mesos slave and
>> master
>>
>> When i try to submit the pi example i get below error
>> *Error: Cannot load main class from JAR file:/opt/spark/Example*
>>
>> Below is the command i use to submit as a docker container
>>
>> docker run -it --rm -e SPARK_MASTER="mesos://10.0.2.15:7077"  -e
>> SPARK_IMAGE="spark_driver:latest" spark_driver:latest ./bin/spark-submit
>> --deploy-mode cluster --name "PI Example" --class
>> org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory
>> 512m --executor-cores 1
>> http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar
>>
>>
>> On Tue, Mar 1, 2016 at 2:59 PM, Timothy Chen <t...@mesosphere.io> wrote:
>>
>>> Can you go through the Mesos UI and look at the driver/executor log from
>>> steer file and see what the problem is?
>>>
>>> Tim
>>>
>>> On Mar 1, 2016, at 8:05 AM, Ashish Soni <asoni.le...@gmail.com> wrote:
>>>
>>> Not sure what is the issue but i am getting below error  when i try to
>>> run spark PI example
>>>
>>> Blacklisting Mesos slave value: "5345asdasdasdkas234234asdasdasdasd"
>>>due to too many failures; is Spark installed on it?
>>> WARN TaskSchedulerImpl: Initial job has not accepted any resources; 
>>> check your cluster UI to ensure that workers are registered and have 
>>> sufficient resources
>>>
>>>
>>> On Mon, Feb 29, 2016 at 1:39 PM, Sathish Kumaran Vairavelu <
>>> vsathishkuma...@gmail.com> wrote:
>>>
>>>> May be the Mesos executor couldn't find spark image or the constraints
>>>> are not satisfied. Check your Mesos UI if you see Spark application in the
>>>> Frameworks tab
>>>>
>>>> On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni <asoni.le...@gmail.com>
>>>> wrote:
>>>>
>>>>> What is the Best practice , I have everything running as docker
>>>>> container in single host ( mesos and marathon also as docker container )
>>>>>  and everything comes up fine but when i try to launch the spark shell i
>>>>> get below error
>>>>>
>>>>>
>>>>> SQL context available as sqlContext.
>>>>>
>>>>> scala> val data = sc.parallelize(1 to 100)
>>>>> data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
>>>>> parallelize at :27
>>>>>
>>>>> scala> data.count
>>>>> [Stage 0:>  (0
>>>>> + 0) / 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not
>>>>> accepted any resources; check your cluster UI to ensure that workers are
>>>>> registered and have sufficient resources
>>>>> 16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted
>>>>> any resources; check your cluster UI to ensure that workers are registered
>>>>> and have sufficient resources
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>>>
>>>>>> No you don't have to run Mesos in docker containers to run Spark in
>>>>>> docker containers.
>>>>>>
>>>>>> Once you have Mesos cluster running you can then specfiy the Spark
>>>>>> configurations in your Spark job (i.e: 
>>>>>> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
>>>>>> and Mesos will automatically launch docker containers for you.
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni <asoni.le...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes i read that and not much details here.
>>>>>>>
>>>>>>> Is it true that we need to have spark installed on each mesos docker
>>>>>>> container ( master and slave ) ...
>>>>>>>
>>>>>>> Ashish
>>>>>>>
>>>>>>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>>>>>
>>>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html should
>>>>>>>> be the best source, what problems were you running into?
>>>>>>>>
>>>>>>>> Tim
>>>>>>>>
>>>>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang <yy201...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Have you read this ?
>>>>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>>>>>>
>>>>>>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni <
>>>>>>>>> asoni.le...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All ,
>>>>>>>>>>
>>>>>>>>>> Is there any proper documentation as how to run spark on mesos ,
>>>>>>>>>> I am trying from the last few days and not able to make it work.
>>>>>>>>>>
>>>>>>>>>> Please help
>>>>>>>>>>
>>>>>>>>>> Ashish
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>


Re: Spark 1.5 on Mesos

2016-03-02 Thread Ashish Soni
I made some progress but now i am stuck at this point , Please help as
looks like i am close to get it working

I have everything running in docker container including mesos slave and
master

When i try to submit the pi example i get below error
*Error: Cannot load main class from JAR file:/opt/spark/Example*

Below is the command i use to submit as a docker container

docker run -it --rm -e SPARK_MASTER="mesos://10.0.2.15:7077"  -e
SPARK_IMAGE="spark_driver:latest" spark_driver:latest ./bin/spark-submit
--deploy-mode cluster --name "PI Example" --class
org.apache.spark.examples.SparkPi --driver-memory 512m --executor-memory
512m --executor-cores 1
http://10.0.2.15/spark-examples-1.6.0-hadoop2.6.0.jar


On Tue, Mar 1, 2016 at 2:59 PM, Timothy Chen <t...@mesosphere.io> wrote:

> Can you go through the Mesos UI and look at the driver/executor log from
> steer file and see what the problem is?
>
> Tim
>
> On Mar 1, 2016, at 8:05 AM, Ashish Soni <asoni.le...@gmail.com> wrote:
>
> Not sure what is the issue but i am getting below error  when i try to run
> spark PI example
>
> Blacklisting Mesos slave value: "5345asdasdasdkas234234asdasdasdasd"
>due to too many failures; is Spark installed on it?
> WARN TaskSchedulerImpl: Initial job has not accepted any resources; check 
> your cluster UI to ensure that workers are registered and have sufficient 
> resources
>
>
> On Mon, Feb 29, 2016 at 1:39 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> May be the Mesos executor couldn't find spark image or the constraints
>> are not satisfied. Check your Mesos UI if you see Spark application in the
>> Frameworks tab
>>
>> On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni <asoni.le...@gmail.com>
>> wrote:
>>
>>> What is the Best practice , I have everything running as docker
>>> container in single host ( mesos and marathon also as docker container )
>>>  and everything comes up fine but when i try to launch the spark shell i
>>> get below error
>>>
>>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val data = sc.parallelize(1 to 100)
>>> data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
>>> parallelize at :27
>>>
>>> scala> data.count
>>> [Stage 0:>  (0 +
>>> 0) / 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not
>>> accepted any resources; check your cluster UI to ensure that workers are
>>> registered and have sufficient resources
>>> 16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted
>>> any resources; check your cluster UI to ensure that workers are registered
>>> and have sufficient resources
>>>
>>>
>>>
>>> On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>
>>>> No you don't have to run Mesos in docker containers to run Spark in
>>>> docker containers.
>>>>
>>>> Once you have Mesos cluster running you can then specfiy the Spark
>>>> configurations in your Spark job (i.e: 
>>>> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
>>>> and Mesos will automatically launch docker containers for you.
>>>>
>>>> Tim
>>>>
>>>> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni <asoni.le...@gmail.com>
>>>> wrote:
>>>>
>>>>> Yes i read that and not much details here.
>>>>>
>>>>> Is it true that we need to have spark installed on each mesos docker
>>>>> container ( master and slave ) ...
>>>>>
>>>>> Ashish
>>>>>
>>>>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>>>
>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html should be
>>>>>> the best source, what problems were you running into?
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang <yy201...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Have you read this ?
>>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>>>>
>>>>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni <asoni.le...@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi All ,
>>>>>>>>
>>>>>>>> Is there any proper documentation as how to run spark on mesos , I
>>>>>>>> am trying from the last few days and not able to make it work.
>>>>>>>>
>>>>>>>> Please help
>>>>>>>>
>>>>>>>> Ashish
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>


Spark Submit using Convert to Marthon REST API

2016-03-01 Thread Ashish Soni
Hi All ,

Can some one please help me how do i translate below spark submit to
marathon JSON request

docker run -it --rm -e SPARK_MASTER="mesos://10.0.2.15:5050"  -e
SPARK_IMAGE="spark_driver:latest" spark_driver:latest
/opt/spark/bin/spark-submit  --name "PI Example" --class
org.apache.spark.examples.SparkPi --driver-memory 1g --executor-memory 1g
--executor-cores 1 /opt/spark/lib/spark-examples-1.6.0-hadoop2.6.0.jar

Thanks,


Re: Spark 1.5 on Mesos

2016-03-01 Thread Ashish Soni
Not sure what is the issue but i am getting below error  when i try to run
spark PI example

Blacklisting Mesos slave value: "5345asdasdasdkas234234asdasdasdasd"
   due to too many failures; is Spark installed on it?
WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered
and have sufficient resources


On Mon, Feb 29, 2016 at 1:39 PM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> May be the Mesos executor couldn't find spark image or the constraints are
> not satisfied. Check your Mesos UI if you see Spark application in the
> Frameworks tab
>
> On Mon, Feb 29, 2016 at 12:23 PM Ashish Soni <asoni.le...@gmail.com>
> wrote:
>
>> What is the Best practice , I have everything running as docker container
>> in single host ( mesos and marathon also as docker container )  and
>> everything comes up fine but when i try to launch the spark shell i get
>> below error
>>
>>
>> SQL context available as sqlContext.
>>
>> scala> val data = sc.parallelize(1 to 100)
>> data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
>> parallelize at :27
>>
>> scala> data.count
>> [Stage 0:>  (0 +
>> 0) / 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient resources
>> 16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted
>> any resources; check your cluster UI to ensure that workers are registered
>> and have sufficient resources
>>
>>
>>
>> On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen <t...@mesosphere.io> wrote:
>>
>>> No you don't have to run Mesos in docker containers to run Spark in
>>> docker containers.
>>>
>>> Once you have Mesos cluster running you can then specfiy the Spark
>>> configurations in your Spark job (i.e: 
>>> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
>>> and Mesos will automatically launch docker containers for you.
>>>
>>> Tim
>>>
>>> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni <asoni.le...@gmail.com>
>>> wrote:
>>>
>>>> Yes i read that and not much details here.
>>>>
>>>> Is it true that we need to have spark installed on each mesos docker
>>>> container ( master and slave ) ...
>>>>
>>>> Ashish
>>>>
>>>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen <t...@mesosphere.io> wrote:
>>>>
>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html should be
>>>>> the best source, what problems were you running into?
>>>>>
>>>>> Tim
>>>>>
>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang <yy201...@gmail.com> wrote:
>>>>>
>>>>>> Have you read this ?
>>>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>>>
>>>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni <asoni.le...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All ,
>>>>>>>
>>>>>>> Is there any proper documentation as how to run spark on mesos , I
>>>>>>> am trying from the last few days and not able to make it work.
>>>>>>>
>>>>>>> Please help
>>>>>>>
>>>>>>> Ashish
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>


Re: Spark 1.5 on Mesos

2016-02-29 Thread Ashish Soni
What is the Best practice , I have everything running as docker container
in single host ( mesos and marathon also as docker container )  and
everything comes up fine but when i try to launch the spark shell i get
below error


SQL context available as sqlContext.

scala> val data = sc.parallelize(1 to 100)
data: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at
parallelize at :27

scala> data.count
[Stage 0:>  (0 + 0)
/ 2]16/02/29 18:21:12 WARN TaskSchedulerImpl: Initial job has not accepted
any resources; check your cluster UI to ensure that workers are registered
and have sufficient resources
16/02/29 18:21:27 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources



On Mon, Feb 29, 2016 at 12:04 PM, Tim Chen <t...@mesosphere.io> wrote:

> No you don't have to run Mesos in docker containers to run Spark in docker
> containers.
>
> Once you have Mesos cluster running you can then specfiy the Spark
> configurations in your Spark job (i.e: 
> spark.mesos.executor.docker.image=mesosphere/spark:1.6)
> and Mesos will automatically launch docker containers for you.
>
> Tim
>
> On Mon, Feb 29, 2016 at 7:36 AM, Ashish Soni <asoni.le...@gmail.com>
> wrote:
>
>> Yes i read that and not much details here.
>>
>> Is it true that we need to have spark installed on each mesos docker
>> container ( master and slave ) ...
>>
>> Ashish
>>
>> On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen <t...@mesosphere.io> wrote:
>>
>>> https://spark.apache.org/docs/latest/running-on-mesos.html should be
>>> the best source, what problems were you running into?
>>>
>>> Tim
>>>
>>> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang <yy201...@gmail.com> wrote:
>>>
>>>> Have you read this ?
>>>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>>>
>>>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni <asoni.le...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All ,
>>>>>
>>>>> Is there any proper documentation as how to run spark on mesos , I am
>>>>> trying from the last few days and not able to make it work.
>>>>>
>>>>> Please help
>>>>>
>>>>> Ashish
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Spark 1.5 on Mesos

2016-02-29 Thread Ashish Soni
Yes i read that and not much details here.

Is it true that we need to have spark installed on each mesos docker
container ( master and slave ) ...

Ashish

On Fri, Feb 26, 2016 at 2:14 PM, Tim Chen <t...@mesosphere.io> wrote:

> https://spark.apache.org/docs/latest/running-on-mesos.html should be the
> best source, what problems were you running into?
>
> Tim
>
> On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang <yy201...@gmail.com> wrote:
>
>> Have you read this ?
>> https://spark.apache.org/docs/latest/running-on-mesos.html
>>
>> On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni <asoni.le...@gmail.com>
>> wrote:
>>
>>> Hi All ,
>>>
>>> Is there any proper documentation as how to run spark on mesos , I am
>>> trying from the last few days and not able to make it work.
>>>
>>> Please help
>>>
>>> Ashish
>>>
>>
>>
>


Spark 1.5 on Mesos

2016-02-26 Thread Ashish Soni
Hi All ,

Is there any proper documentation as how to run spark on mesos , I am
trying from the last few days and not able to make it work.

Please help

Ashish


SPARK-9559

2016-02-18 Thread Ashish Soni
Hi All ,

Just wanted to know if there is any work around or resolution for below
issue in Stand alone mode

https://issues.apache.org/jira/browse/SPARK-9559

Ashish


Seperate Log4j.xml for Spark and Application JAR ( Application vs Spark )

2016-02-12 Thread Ashish Soni
Hi All ,

As per my best understanding we can have only one log4j for both spark and
application as which ever comes first in the classpath takes precedence ,
Is there any way we can keep one in application and one in the spark conf
folder .. is it possible ?

Thanks


Re: Spark Submit

2016-02-12 Thread Ashish Soni
it works as below

spark-submit --conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml" --conf
spark.executor.memory=512m

Thanks all for the quick help.



On Fri, Feb 12, 2016 at 10:59 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:

> Try
> spark-submit  --conf "spark.executor.memory=512m" --conf
> "spark.executor.extraJavaOptions=x" --conf "Dlog4j.configuration=log4j.xml"
>
> Sent from Samsung Mobile.
>
>
>  Original message 
> From: Ted Yu <yuzhih...@gmail.com>
> Date:12/02/2016 21:24 (GMT+05:30)
> To: Ashish Soni <asoni.le...@gmail.com>
> Cc: user <user@spark.apache.org>
> Subject: Re: Spark Submit
>
> Have you tried specifying multiple '--conf key=value' ?
>
> Cheers
>
> On Fri, Feb 12, 2016 at 7:44 AM, Ashish Soni <asoni.le...@gmail.com>
> wrote:
>
>> Hi All ,
>>
>> How do i pass multiple configuration parameter while spark submit
>>
>> Please help i am trying as below
>>
>> spark-submit  --conf "spark.executor.memory=512m
>> spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"
>>
>> Thanks,
>>
>
>


Spark Submit

2016-02-12 Thread Ashish Soni
Hi All ,

How do i pass multiple configuration parameter while spark submit

Please help i am trying as below

spark-submit  --conf "spark.executor.memory=512m
spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.xml"

Thanks,


Dynamically Change Log Level Spark Streaming

2016-02-08 Thread Ashish Soni
Hi All ,

How do change the log level for the running spark streaming Job , Any help
will be appriciated.

Thanks,


Example of onEnvironmentUpdate Listener

2016-02-08 Thread Ashish Soni
Are there any examples as how to implement onEnvironmentUpdate method for
customer listener

Thanks,


Redirect Spark Logs to Kafka

2016-02-01 Thread Ashish Soni
Hi All ,

Please let me know how we can redirect spark logging files or tell spark to
log to kafka queue instead of files ..

Ashish


Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Ashish Soni
Hi All ,

What is the best way to tell spark streaming job for the no of partition to
to a given topic -

Should that be provided as a parameter or command line argument
or
We should connect to kafka in the driver program and query it

Map fromOffsets = new HashMap();
fromOffsets.put(new TopicAndPartition(driverArgs.inputTopic, 0), 0L);

Thanks,
Ashish


Re: Determine Topic MetaData Spark Streaming Job

2016-01-25 Thread Ashish Soni
Correct what i am trying to achieve is that before the streaming job starts
query the topic meta data from kafka , determine all the partition and
provide those to direct API.

So my question is should i consider passing all the partition from command
line and query kafka and find and provide , what is the correct approach.

Ashish

On Mon, Jan 25, 2016 at 11:38 AM, Gerard Maas <gerard.m...@gmail.com> wrote:

> What are you trying to achieve?
>
> Looks like you want to provide offsets but you're not managing them
> and I'm assuming you're using the direct stream approach.
>
> In that case, use the simpler constructor that takes the kafka config and
> the topics. Let it figure it out the offsets (it will contact kafka and
> request the partitions for the topics provided)
>
> KafkaUtils.createDirectStream[...](ssc, kafkaConfig, topics)
>
>  -kr, Gerard
>
> On Mon, Jan 25, 2016 at 5:31 PM, Ashish Soni <asoni.le...@gmail.com>
> wrote:
>
>> Hi All ,
>>
>> What is the best way to tell spark streaming job for the no of partition
>> to to a given topic -
>>
>> Should that be provided as a parameter or command line argument
>> or
>> We should connect to kafka in the driver program and query it
>>
>> Map<TopicAndPartition, Long> fromOffsets = new HashMap<TopicAndPartition,
>> Long>();
>> fromOffsets.put(new TopicAndPartition(driverArgs.inputTopic, 0), 0L);
>>
>> Thanks,
>> Ashish
>>
>
>


How to change the no of cores assigned for a Submitted Job

2016-01-12 Thread Ashish Soni
Hi ,

I have a strange behavior when i creating standalone spark container using
docker
Not sure why by default it is assigning 4 cores to the first Job it submit
and then all the other jobs are in wait state  , Please suggest if there is
an setting to change this

i tried --executor-cores 1 but it has no effect

[image: Inline image 1]


Re: question on make multiple external calls within each partition

2015-10-05 Thread Ashish Soni
Need more details but you might want to filter the data first ( create multiple 
RDD) and then process.


> On Oct 5, 2015, at 8:35 PM, Chen Song  wrote:
> 
> We have a use case with the following design in Spark Streaming.
> 
> Within each batch,
> * data is read and partitioned by some key
> * forEachPartition is used to process the entire partition
> * within each partition, there are several REST clients created to connect to 
> different REST services
> * for the list of records within each partition, it will call these services, 
> each service call is independent of others; records are just pre-partitioned 
> to make these calls more efficiently.
> 
> I have a question
> * Since each call is time taking and to prevent the calls to be executed 
> sequentially, how can I parallelize the service calls within processing of 
> each partition? Can I just use Scala future within forEachPartition(or 
> mapPartitions)?
> 
> Any suggestions greatly appreciated.
> 
> Chen
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DStream Transformation to save JSON in Cassandra 2.1

2015-10-05 Thread Ashish Soni
try this


You can use dstream.map to conver it to JavaDstream with only the data you
are interested probably return an Pojo of your JSON

and then call foreachRDD and inside that call below line

javaFunctions(rdd).writerBuilder("table", "keyspace",
mapToRow(Class.class)).saveToCassandra();

On Mon, Oct 5, 2015 at 10:14 AM, Prateek .  wrote:

> Hi,
>
> I am beginner in Spark , this is sample data I get from Kafka stream:
>
> {"id":
> "9f5ccb3d5f4f421392fb98978a6b368f","coordinate":{"ax":"1.20","ay":"3.80","az":"9.90","oa":"8.03","ob":"8.8","og":"9.97"}}
>
>   val lines = KafkaUtils.createStream(ssc, zkQuorum, group,
> topicMap).map(_._2)
>   val jsonf =
> lines.map(JSON.parseFull(_)).map(_.get.asInstanceOf[scala.collection.immutable.Map[String,
> Any]])
>
>   I am getting a, DSTream[Map[String,Any]]. I need to store each
> coordinate values in the below Cassandra schema
>
> CREATE TABLE iotdata.coordinate (
> id text PRIMARY KEY, ax double, ay double, az double, oa double, ob
> double, oz double
> )
>
> For this what transformations I need to apply before I execute
> saveToCassandra().
>
> Thank You,
> Prateek
>
>
> "DISCLAIMER: This message is proprietary to Aricent and is intended solely
> for the use of the individual to whom it is addressed. It may contain
> privileged or confidential information and should not be circulated or used
> for any purpose other than for what it is intended. If you have received
> this message in error, please notify the originator immediately. If you are
> not the intended recipient, you are notified that you are strictly
> prohibited from using, copying, altering, or disclosing the contents of
> this message. Aricent accepts no responsibility for loss or damage arising
> from the use of the information transmitted by this email including damage
> from virus."
>


Re: Spark Streaming Log4j Inside Eclipse

2015-09-29 Thread Ashish Soni
I am using Java Streaming context and it doesnt have method setLogLevel and
also i have tried by passing VM argument in eclipse and it doesnt work

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));

Ashish

On Tue, Sep 29, 2015 at 7:23 AM, Adrian Tanase <atan...@adobe.com> wrote:

> You should set exta java options for your app via Eclipse project and
> specify something like
>
>  -Dlog4j.configuration=file:/tmp/log4j.properties
>
> Sent from my iPhone
>
> On 28 Sep 2015, at 18:52, Shixiong Zhu <zsxw...@gmail.com> wrote:
>
> You can use JavaSparkContext.setLogLevel to set the log level in your
> codes.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-28 22:55 GMT+08:00 Ashish Soni <asoni.le...@gmail.com>:
>
>> I am not running it using spark submit , i am running locally inside
>> Eclipse IDE , how i set this using JAVA Code
>>
>> Ashish
>>
>> On Mon, Sep 28, 2015 at 10:42 AM, Adrian Tanase <atan...@adobe.com>
>> wrote:
>>
>>> You also need to provide it as parameter to spark submit
>>>
>>> http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver
>>>
>>> From: Ashish Soni
>>> Date: Monday, September 28, 2015 at 5:18 PM
>>> To: user
>>> Subject: Spark Streaming Log4j Inside Eclipse
>>>
>>> I need to turn off the verbose logging of Spark Streaming Code when i am
>>> running inside eclipse i tried creating a log4j.properties file and placed
>>> inside /src/main/resources but i do not see it getting any effect , Please
>>> help as not sure what else needs to be done to change the log at DEBUG or
>>> WARN
>>>
>>
>>
>


Re: Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Ashish Soni
I am not running it using spark submit , i am running locally inside
Eclipse IDE , how i set this using JAVA Code

Ashish

On Mon, Sep 28, 2015 at 10:42 AM, Adrian Tanase <atan...@adobe.com> wrote:

> You also need to provide it as parameter to spark submit
>
> http://stackoverflow.com/questions/28840438/how-to-override-sparks-log4j-properties-per-driver
>
> From: Ashish Soni
> Date: Monday, September 28, 2015 at 5:18 PM
> To: user
> Subject: Spark Streaming Log4j Inside Eclipse
>
> I need to turn off the verbose logging of Spark Streaming Code when i am
> running inside eclipse i tried creating a log4j.properties file and placed
> inside /src/main/resources but i do not see it getting any effect , Please
> help as not sure what else needs to be done to change the log at DEBUG or
> WARN
>


Spark Streaming Log4j Inside Eclipse

2015-09-28 Thread Ashish Soni
Hi All ,

I need to turn off the verbose logging of Spark Streaming Code when i am
running inside eclipse i tried creating a log4j.properties file and placed
inside /src/main/resources but i do not see it getting any effect , Please
help as not sure what else needs to be done to change the log at DEBUG or
WARN

Ashish


Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Ashish Soni
Hi All ,

Just wanted to find out if there is an benefits to installing  kafka
brokers and spark nodes on the same machine ?

is it possible that spark can pull data from kafka if it is local to the
node i.e. the broker or partition is on the same machine.

Thanks,
Ashish


Spark Cassandra Filtering

2015-09-16 Thread Ashish Soni
Hi ,

How can i pass an dynamic value inside below function to filter instead of
hardcoded
if have an existing RDD and i would like to use data in that for filter so
instead of doing .where("name=?","Anna") i want to do
.where("name=?",someobject.value)

Please help

JavaRDD rdd3 = javaFunctions(sc).cassandraTable("test", "people",
mapRowTo(Person.class))
.where("name=?", "Anna").map(new Function()
{
@Override
public String call(Person person) throws Exception {
return person.toString();
}
});


Dynamic Workflow Execution using Spark

2015-09-15 Thread Ashish Soni
Hi All ,

Are there any framework which can be used to execute workflows with in
spark or Is it possible to use ML Pipeline for workflow execution but not
doing ML .

Thanks,
Ashish


Re: FlatMap Explanation

2015-09-03 Thread Ashish Soni
Thanks a lot everyone.
Very Helpful.

Ashish

On Thu, Sep 3, 2015 at 2:19 AM, Zalzberg, Idan (Agoda) <
idan.zalzb...@agoda.com> wrote:

> Hi,
>
> Yes, I can explain
>
>
>
> 1 to 3 -> 1,2,3
>
> 2 to 3- > 2,3
>
> 3 to 3 -> 3
>
> 3 to 3 -> 3
>
>
>
> Flat map that concatenates the results, so you get
>
>
>
> 1,2,3, 2,3, 3,3
>
>
>
> You should get the same with any scala collection
>
>
>
> Cheers
>
>
>
> *From:* Ashish Soni [mailto:asoni.le...@gmail.com]
> *Sent:* Thursday, September 03, 2015 9:06 AM
> *To:* user <user@spark.apache.org>
> *Subject:* FlatMap Explanation
>
>
>
> Hi ,
>
> Can some one please explain the output of the flat map
>
> data in RDD as below
>
> {1, 2, 3, 3}
>
> rdd.flatMap(x => x.to(3))
>
> output as below
>
> {1, 2, 3, 2, 3, 3, 3}
>
> i am not able to understand how the output came as above.
>
> Thanks,
>
> --
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
>


FlatMap Explanation

2015-09-02 Thread Ashish Soni
Hi ,

Can some one please explain the output of the flat map
data in RDD as below
{1, 2, 3, 3}

rdd.flatMap(x => x.to(3))

output as below

{1, 2, 3, 2, 3, 3, 3}
i am not able to understand how the output came as above.

Thanks,


Java Streaming Context - File Stream use

2015-08-10 Thread Ashish Soni
Please help as not sure what is incorrect with below code as it gives me
complilaton error in eclipse

 SparkConf sparkConf = new
SparkConf().setMaster(local[4]).setAppName(JavaDirectKafkaWordCount);
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));

*jssc.fileStream(/home/, String.class, String.class,
TextInputFormat.class);*


Class Loading Issue - Spark Assembly and Application Provided

2015-07-21 Thread Ashish Soni
Hi All ,

I am having a class loading issue as Spark Assembly is using google guice
internally and one of Jar i am using uses sisu-guice-3.1.0-no_aop.jar , How
do i load my class first so that it doesn't result in error and tell spark
to load its assembly later on
Ashish


XML Parsing

2015-07-19 Thread Ashish Soni
Hi All ,

I have an XML file with same tag repeated multiple times as below , Please
suggest what would be best way to process this data inside spark as ...

How can i extract each open and closing tag and process them or how can i
combine multiple line into single line

review
/review
review
/review
...
..
..

Thanks,


BroadCast on Interval ( eg every 10 min )

2015-07-16 Thread Ashish Soni
Hi All ,
How can i broadcast a data change to all the executor ever other 10 min or
1 min

Ashish


How Will Spark Execute below Code - Driver and Executors

2015-07-06 Thread Ashish Soni
Hi All ,

If some one can help me understand as which portion of the code gets
executed on Driver and which portion will be executed on executor from the
below code it would be a great help

I have to load data from 10 Tables and then use that data in various
manipulation and i am using SPARK SQL for that please let me know if below
code will be executed on the driver or it will be executed in each executor
node.

And if i do a join on the data frame will it happen on executor or driver ?

options.put(dbtable, (select * from t_table1) as
t_table1);
DataFrame t_gsubmember =
sqlContext.read().format(jdbc).options(options).load();
t_table1.cache();



options.put(dbtable, (select * from t_table2) as
t_table2);
DataFrame t_sub =
sqlContext.read().format(jdbc).options(options).load();
t_table2.cache();



options.put(dbtable, (select * from t_table3) as
t_table3);
DataFrame t_pi =
sqlContext.read().format(jdbc).options(options).load();
t_table3.cache();

   And So on

Thanks


Spark SQL and Streaming - How to execute JDBC Query only once

2015-07-02 Thread Ashish Soni
Hi All  ,

I have and Stream of Event coming in and i want to fetch some additional
data from the database based on the values in the incoming data , For Eg
below is the data coming in

loginName
Email
address
city

Now for each login name i need to go to oracle database and get the userId
from the database *but i do not want to hit the database again and again
instead i want to load the complete table in memory and then find the user
id based on the incoming data*

JavaRDDCharge rdd = sc.textFile(/home/spark/workspace/data.csv).map(new
FunctionString, String() {
@Override
public Charge call(String s) {
String str[] = s.split(,);
*//How to load the complete table in memory and use it as
when i do outside the loop i get stage failure error *
*   DataFrame dbRdd =
sqlContext.read().format(jdbc).options(options).load();*

System.out.println(dbRdd.filter(ogin_nm='+str[0]+').count());

  return str[0];
}
});


How i can achieve this , Please suggest

Thanks


BroadCast Multiple DataFrame ( JDBC Tables )

2015-07-01 Thread Ashish Soni
Hi ,

I need to load 10 tables in memory and have them available to all the
workers , Please let me me know what is the best way to do broadcast them

sc.broadcast(df)  allow only one

Thanks,


Convert CSV lines to List of Objects

2015-07-01 Thread Ashish Soni
Hi ,

How can i use Map function in java to convert all the lines of csv file
into a list of objects , Can some one please help...

JavaRDDListCharge rdd = sc.textFile(data.csv).map(new
FunctionString, ListCharge() {
@Override
public ListCharge call(String s) {

}
});

Thanks,


DataFrame Find/Filter Based on Input - Inside Map function

2015-07-01 Thread Ashish Soni
Hi All ,

I have an DataFrame Created as below

options.put(dbtable, (select * from user) as account);
DataFrame accountRdd =
sqlContext.read().format(jdbc).options(options).load();

and i have another RDD which contains login name and i want to find the
userid from above DF RDD and return it

Not sure how can i do that as when i apply a map function and say filter on
DF i get Null pointor exception.

Please help.


Re: DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Ashish Soni
Thanks , So if i load some static data from database and then i need to use
than in my map function to filter records what will be the best way to do
it,

Ashish

On Wed, Jul 1, 2015 at 10:45 PM, Raghavendra Pandey 
raghavendra.pan...@gmail.com wrote:

 You cannot refer to one rdd inside another rdd.map function...
 Rdd object is not serialiable. Whatever objects you use inside map
 function  should be serializable as they get transferred to executor nodes.
 On Jul 2, 2015 6:13 AM, Ashish Soni asoni.le...@gmail.com wrote:

 Hi All  ,

 I am not sure what is the wrong with below code as it give below error
 when i access inside the map but it works outside

 JavaRDDCharge rdd2 = rdd.map(new FunctionCharge, Charge() {

 @Override
 public Charge call(Charge ch) throws Exception {


* DataFrame df = accountRdd.filter(login=test);*

 return ch;
 }

 });

 5/07/01 20:38:08 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.lang.NullPointerException
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:129)
 at org.apache.spark.sql.DataFrame.org
 $apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)




DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Ashish Soni
Hi All  ,

I am not sure what is the wrong with below code as it give below error when
i access inside the map but it works outside

JavaRDDCharge rdd2 = rdd.map(new FunctionCharge, Charge() {

@Override
public Charge call(Charge ch) throws Exception {


   * DataFrame df = accountRdd.filter(login=test);*

return ch;
}

});

5/07/01 20:38:08 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame.org
$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)


Load Multiple DB Table - Spark SQL

2015-06-29 Thread Ashish Soni
Hi All ,

What is the best possible way to load multiple data tables using spark sql

MapString, String options = new HashMap();
options.put(driver, MYSQLDR);
options.put(url, MYSQL_CN_URL);
options.put(dbtable,(select * from courses);




*can i add multiple tables to options map options.put(dbtable1,(select *
from test1);options.put(dbtable2,(select * from test2);*


DataFrame jdbcDF = sqlContext.load(jdbc, options);


Thanks,
Ashish


Spark-Submit / Spark-Shell Error Standalone cluster

2015-06-27 Thread Ashish Soni
Not sure what is the issue but when i run the spark-submit or spark-shell i
am getting below error

/usr/bin/spark-class: line 24: /usr/bin/load-spark-env.sh: No such file or
directory

Can some one please help

Thanks,


Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Ashish Soni
Hi ,

If i have a below data format , how can i use kafka direct stream to
de-serialize as i am not able to understand all the parameter i need to
pass , Can some one explain what will be the arguments as i am not clear
about this

JavaPairInputDStream
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark.streaming.kafka(KafkaUtils.class%E2%98%83KafkaUtils~createDirectStream~Lorg.apache.spark.streaming.api.java.JavaStreamingContext;~Ljava.lang.Class%5C%3CTK;%3E;~Ljava.lang.Class%5C%3CTV;%3E;~Ljava.lang.Class%5C%3CTKD;%3E;~Ljava.lang.Class%5C%3CTVD;%3E;~Ljava.util.Map%5C%3CLjava.lang.String;Ljava.lang.String;%3E;~Ljava.util.Set%5C%3CLjava.lang.String;%3E;%E2%98%82org.apache.spark.streaming.api.java.JavaPairInputDStream
K
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark.streaming.kafka(KafkaUtils.class%E2%98%83KafkaUtils~createDirectStream~Lorg.apache.spark.streaming.api.java.JavaStreamingContext;~Ljava.lang.Class%5C%3CTK;%3E;~Ljava.lang.Class%5C%3CTV;%3E;~Ljava.lang.Class%5C%3CTKD;%3E;~Ljava.lang.Class%5C%3CTVD;%3E;~Ljava.util.Map%5C%3CLjava.lang.String;Ljava.lang.String;%3E;~Ljava.util.Set%5C%3CLjava.lang.String;%3E;%E2%98%82K,
V
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark.streaming.kafka(KafkaUtils.class%E2%98%83KafkaUtils~createDirectStream~Lorg.apache.spark.streaming.api.java.JavaStreamingContext;~Ljava.lang.Class%5C%3CTK;%3E;~Ljava.lang.Class%5C%3CTV;%3E;~Ljava.lang.Class%5C%3CTKD;%3E;~Ljava.lang.Class%5C%3CTVD;%3E;~Ljava.util.Map%5C%3CLjava.lang.String;Ljava.lang.String;%3E;~Ljava.util.Set%5C%3CLjava.lang.String;%3E;%E2%98%82V
org
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg
.apache
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache
.spark
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark
.streaming
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark.streaming
.kafka
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark.streaming.kafka
.KafkaUtils
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark.streaming.kafka(KafkaUtils.class%E2%98%83KafkaUtils
.createDirectStream(JavaStreamingContext
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark.streaming.kafka(KafkaUtils.class%E2%98%83KafkaUtils~createDirectStream~Lorg.apache.spark.streaming.api.java.JavaStreamingContext;~Ljava.lang.Class%5C%3CTK;%3E;~Ljava.lang.Class%5C%3CTV;%3E;~Ljava.lang.Class%5C%3CTKD;%3E;~Ljava.lang.Class%5C%3CTVD;%3E;~Ljava.util.Map%5C%3CLjava.lang.String;Ljava.lang.String;%3E;~Ljava.util.Set%5C%3CLjava.lang.String;%3E;%E2%98%82org.apache.spark.streaming.api.java.JavaStreamingContext
arg0, Class
eclipse-javadoc:%E2%98%82=cde/C:%5C/Users%5C/eassoni%5C/.m2%5C/repository%5C/org%5C/apache%5C/spark%5C/spark-streaming-kafka_2.10%5C/1.4.0%5C/spark-streaming-kafka_2.10-1.4.0.jar%3Corg.apache.spark.streaming.kafka(KafkaUtils.class%E2%98%83KafkaUtils~createDirectStream~Lorg.apache.spark.streaming.api.java.JavaStreamingContext;~Ljava.lang.Class%5C%3CTK;%3E;~Ljava.lang.Class%5C%3CTV;%3E;~Ljava.lang.Class%5C%3CTKD;%3E;~Ljava.lang.Class%5C%3CTVD;%3E;~Ljava.util.Map%5C%3CLjava.lang.String;Ljava.lang.String;%3E;~Ljava.util.Set%5C%3CLjava.lang.String;%3E;%E2%98%82java.lang.Class
K

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Ashish Soni
my question is why there are similar two parameter String.Class and
StringDecoder.class what is the difference each of them ?

Ashish

On Fri, Jun 26, 2015 at 8:53 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 ​JavaPairInputDStreamString, String messages =
 KafkaUtils.createDirectStream(
 jssc,
 String.class,
 String.class,
 StringDecoder.class,
 StringDecoder.class,
 kafkaParams,
 topicsSet
 );

 Here:

 jssc = JavaStreamingContext
 String.class = Key , Value classes
 StringDecoder = Key, Value decoder classes
 KafkaParams = Map in which you specify all the kafka details (like
 brokers, offset etc)
 topicSet = Set of topics from which you want to consume data.​

 ​Here's a sample program
 https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaDirectKafkaWordCount.java
 for you to start.​



 Thanks
 Best Regards

 On Fri, Jun 26, 2015 at 6:09 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 Hi ,

 If i have a below data format , how can i use kafka direct stream to
 de-serialize as i am not able to understand all the parameter i need to
 pass , Can some one explain what will be the arguments as i am not clear
 about this

 JavaPairInputDStreamK, V org.apache.spark.streaming.kafka.KafkaUtils
 .createDirectStream(JavaStreamingContext arg0, ClassK arg1, ClassV
 arg2, ClassKD arg3, ClassVD arg4, MapString, String arg5, Set
 String arg6)

 ID
 Name
 Unit
 Rate
 Duration





WorkFlow Processing - Spark

2015-06-24 Thread Ashish Soni
Hi All ,

We are looking to use spark as our stream processing framework and it would
be helpful if experts can weigh if we made a right choice given below
requirement

Given a stream of data we need to take those event to multiple stage (
pipeline processing ) and in those stage customer will define there own
logic like custom code which we need to load inside a driver program ...

Any idea the best way to do this ...

Ashish


How Spark Execute chaining vs no chaining statements

2015-06-23 Thread Ashish Soni
Hi All ,

What is difference between below in terms of execution to the cluster with
1 or more worker node

rdd.map(...).map(...)...map(..)

vs

val rdd1 = rdd.map(...)
val rdd2 = rdd1.map(...)
val rdd3 = rdd2.map(...)

Thanks,
Ashish


Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Ashish Soni
Hi All  ,

What is the Best Way to install and Spark Cluster along side with Hadoop
Cluster , Any recommendation for below deployment topology will be a great
help

*Also Is it necessary to put the Spark Worker on DataNodes as when it read
block from HDFS it will be local to the Server / Worker or  I can put the
Worker on any other nodes and if i do that will it affect the performance
of the Spark Data Processing ..*

Hadoop Option 1

Server 1 - NameNodeSpark Master
Server 2 - DataNode 1   Spark Worker
Server 3 - DataNode 2   Spark Worker
Server 4 - DataNode 3   Spark Worker

Hadoop Option 2


Server 1 - NameNode
Server 2 - Spark Master
Server 2 - DataNode 1
Server 3 - DataNode 2
Server 4 - DataNode 3
Server 5 - Spark Worker 1
Server 6 - Spark Worker 2
Server 7 - Spark Worker 3

Thanks.


Spark 1.4 History Server - HDP 2.2

2015-06-20 Thread Ashish Soni
Can any one help i am getting below error when i try to start the History
Server
I do not see any org.apache.spark.deploy.yarn.history.pakage inside the
assembly jar not sure how to get that

java.lang.ClassNotFoundException:
org.apache.spark.deploy.yarn.history.YarnHistoryProvider


Thanks,
Ashish


Re: RE: Spark or Storm

2015-06-19 Thread Ashish Soni
My understanding for exactly once semantics is it is handled into the
framework itself but it is not very clear from the documentation , I
believe documentation needs to be updated with a simple example so that it
is clear to the end user , This is very critical to decide when some one is
evaluating the framework and does not have enough time to validate all the
use cases but to relay on the documentation.

Ashish

On Fri, Jun 19, 2015 at 7:10 AM, bit1...@163.com bit1...@163.com wrote:


 I think your observation is correct, you have to take care of these
 replayed data at your end,eg,each message has a unique id or something else.

 I am using I think in the above sentense, because I am not sure and I
 also have a related question:
 I am wonderring how direct stream + kakfa is implemented when the Driver
 is down and restarted, will it always first replay the checkpointed failed
 batch or will it honor Kafka's offset reset policy(auto.offset.reset). If
 it honors the reset policy and it is set as smallest, then it is the at
 least once semantics;  if it set largest, then it will be at most once
 semantics?


 --
 bit1...@163.com


 *From:* Haopu Wang hw...@qilinsoft.com
 *Date:* 2015-06-19 18:47
 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das t...@databricks.com
 *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org;
 bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs
 wrbri...@gmail.com; Ashish Soni asoni.le...@gmail.com; ayan guha
 guha.a...@gmail.com; user@spark.apache.org; Sateesh Kavuri
 sateesh.kav...@gmail.com; Spark Enthusiast sparkenthusi...@yahoo.in; 
 Sabarish
 Sasidharan sabarish.sasidha...@manthan.com
 *Subject:* RE: RE: Spark or Storm

 My question is not directly related: about the exactly-once semantic,
 the document (copied below) said spark streaming gives exactly-once
 semantic, but actually from my test result, with check-point enabled, the
 application always re-process the files in last batch after gracefully
 restart.



 ==
 *Semantics of Received Data*

 Different input sources provide different guarantees, ranging from *at-least
 once* to *exactly once*. Read for more details.
 *With Files*

 If all of the input data is already present in a fault-tolerant files
 system like HDFS, Spark Streaming can always recover from any failure and
 process all the data. This gives *exactly-once* semantics, that all the
 data will be processed exactly once no matter what fails.




  --

 *From:* Enno Shioji [mailto:eshi...@gmail.com]
 *Sent:* Friday, June 19, 2015 5:29 PM
 *To:* Tathagata Das
 *Cc:* prajod.vettiyat...@wipro.com; Cody Koeninger; bit1...@163.com;
 Jordan Pilat; Will Briggs; Ashish Soni; ayan guha; user@spark.apache.org;
 Sateesh Kavuri; Spark Enthusiast; Sabarish Sasidharan
 *Subject:* Re: RE: Spark or Storm



 Fair enough, on second thought, just saying that it should be idempotent
 is indeed more confusing.



 I guess the crux of the confusion comes from the fact that people tend to
 assume the work you described (store batch id and skip etc.) is handled by
 the framework, perhaps partly because Storm Trident does handle it (you
 just need to let Storm know if the output operation has succeeded or not,
 and it handles the batch id storing  skipping business). Whenever I
 explain people that one needs to do this additional work you described to
 get end-to-end exactly-once semantics, it usually takes a while to convince
 them. In my limited experience, they tend to interpret transactional in
 that sentence to mean that you just have to write to a transactional
 storage like ACID RDB. Pointing them to Semantics of output operations is
 usually sufficient though.



 Maybe others like @Ashish can weigh on this; did you interpret it in this
 way?



 What if we change the statement into:

 end-to-end exactly-once semantics (if your updates to downstream systems
 are idempotent or transactional). To learn how to make your updates
 idempotent or transactional, see the Semantics of output operations
 section in this chapter
 https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
 



 That way, it's clear that it's not sufficient to merely write to a
 transactional storage like ACID store.















 On Fri, Jun 19, 2015 at 9:08 AM, Tathagata Das t...@databricks.com
 wrote:

 If the current documentation is confusing, we can definitely improve the
 documentation. However, I dont not understand why is the term
 transactional confusing. If your output operation has to add 5, then the
 user has to implement the following mechanism



 1. If the unique id of the batch of data is already present in the store,
 then skip the update

 2. Otherwise atomically do both, the update operation as well as store the
 unique id of the batch. This is pretty much the definition of a
 transaction. The user has to be aware of the transactional semantics of the
 data

Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Ashish Soni
Hi ,

Is any one able to install Spark 1.4 on HDP 2.2 , Please let me know how
can i do the same ?

Ashish


Re: Spark 1.4 on HortonWork HDP 2.2

2015-06-19 Thread Ashish Soni
I do not where to start  as Spark 1.2 comes bundled with HDP2.2 but i want
to use 1.4 and i do not know how to update it to 1.4

Ashish

On Fri, Jun 19, 2015 at 8:26 AM, ayan guha guha.a...@gmail.com wrote:

 what problem are you facing? are you trying to build it yurself or
 gettingpre-built version?

 On Fri, Jun 19, 2015 at 10:22 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 Hi ,

 Is any one able to install Spark 1.4 on HDP 2.2 , Please let me know how
 can i do the same ?

 Ashish




 --
 Best Regards,
 Ayan Guha



Spark on Yarn - How to configure

2015-06-19 Thread Ashish Soni
Can some one please let me know what all i need to configure to have Spark
run using Yarn ,

There is lot of documentation but none of it says how and what all files
needs to be changed

Let say i have 4 node for Spark - SparkMaster , SparkSlave1 , SparkSlave2 ,
SparkSlave3

Now in which node which files needs to changed to make sure my master node
is SparkMaster and slave nodes are 1,2,3 and how to tell / configure Yarn

Ashish


Re: Spark or Storm

2015-06-17 Thread Ashish Soni
My Use case is below

We are going to receive lot of event as stream ( basically Kafka Stream )
and then we need to process and compute

Consider you have a phone contract with ATT and every call / sms / data
useage you do is an event and then it needs  to calculate your bill on real
time basis so when you login to your account you can see all those variable
as how much you used and how much is left and what is your bill till date
,Also there are different rules which need to be considered when you
calculate the total bill one simple rule will be 0-500 min it is free but
above it is $1 a min.

How do i maintain a shared state  ( total amount , total min , total data
etc ) so that i know how much i accumulated at any given point as events
for same phone can go to any node / executor.

Can some one please tell me how can i achieve this is spark as in storm i
can have a bolt which can do this ?

Thanks,



On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join out
 of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will stop
 reading from the source. Spark on the other hand, will keep reading from
 the source and spilling it internally. This maybe fine, in fairness, but it
 does mean you have to worry about the persistent store usage in the
 processing cluster, whereas with Storm you don't have to worry because the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this is as
 nice as back pressure because it's very difficult to tune it such that you
 don't cap the cluster's processing power but yet so that it will prevent
 the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex events
 that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed quickly,
 like no task timeout, not being able to read from Kafka using multiple
 nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's
 what you need.

 The pos for Spark is the concise and IMO more intuitive syntax,
 especially if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in Kinesis.





 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on your
 events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.
 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting triggered on events. Is
 that kind of functionality possible with Spark streaming? During each phase
 of the data processing, the transformed data is stored to the database and
 this transformed data should then be sent to a new pipeline for further
 processing

 How can this be achieved using Spark?



 On Wed, Jun 17, 2015 at 10:10 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 I have a use-case where a stream of Incoming events have to be aggregated
 and joined to create Complex events. The aggregation will have to happen at
 an interval of 1 minute (or less).

 The pipeline is :
   send events
  enrich event
 Upstream services --- KAFKA - event Stream
 Processor  Complex 

Twitter Heron: Stream Processing at Scale - Does Spark Address all the issues

2015-06-17 Thread Ashish Soni
Hi Sparkers ,

https://dl.acm.org/citation.cfm?id=2742788


Recently Twitter release a paper on Heron as an replacement of Apache Storm
and i would like to know if currently Apache Spark Does Suffer from the
same issues as they have outlined.


Any input / thought will be helpful.

Thanks,
Ashish


Re: Spark or Storm

2015-06-17 Thread Ashish Soni
Stream can also be processed in micro-batch / batches which is the main
reason behind Spark Steaming so what is the difference ?

Ashish

On Wed, Jun 17, 2015 at 9:04 AM, Enno Shioji eshi...@gmail.com wrote:

 PS just to elaborate on my first sentence, the reason Spark (not
 streaming) can offer exactly once semantics is because its update operation
 is idempotent. This is easy to do in a batch context because the input is
 finite, but it's harder in streaming context.

 On Wed, Jun 17, 2015 at 2:00 PM, Enno Shioji eshi...@gmail.com wrote:

 So Spark (not streaming) does offer exactly once. Spark Streaming
 however, can only do exactly once semantics *if the update operation is
 idempotent*. updateStateByKey's update operation is idempotent, because
 it completely replaces the previous state.

 So as long as you use Spark streaming, you must somehow make the update
 operation idempotent. Replacing the entire state is the easiest way to do
 it, but it's obviously expensive.

 The alternative is to do something similar to what Storm does. At that
 point, you'll have to ask though if just using Storm is easier than that.





 On Wed, Jun 17, 2015 at 1:50 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 As per my Best Understanding Spark Streaming offer Exactly once
 processing , is this achieve only through updateStateByKey or there is
 another way to do the same.

 Ashish

 On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote:

 In that case I assume you need exactly once semantics. There's no
 out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
 not practical with your use case as the state is too large (it'll try to
 dump the entire intermediate state on every checkpoint, which would be
 prohibitively expensive).

 So either you have to implement something yourself, or you can use
 Storm Trident (or transactional low-level API).

 On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka
 Stream ) and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms /
 data useage you do is an event and then it needs  to calculate your bill 
 on
 real time basis so when you login to your account you can see all those
 variable as how much you used and how much is left and what is your bill
 till date ,Also there are different rules which need to be considered when
 you calculate the total bill one simple rule will be 0-500 min it is free
 but above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total
 data etc ) so that i know how much i accumulated at any given point as
 events for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in
 storm i can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com
 wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join
 out of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics 
 are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will
 stop reading from the source. Spark on the other hand, will keep reading
 from the source and spilling it internally. This maybe fine, in fairness,
 but it does mean you have to worry about the persistent store usage in 
 the
 processing cluster, whereas with Storm you don't have to worry because 
 the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this
 is as nice as back pressure because it's very difficult to tune it such
 that you don't cap the cluster's processing power but yet so that it will
 prevent the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji 
 eshi...@gmail.com wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this
 to a certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can
 achieve this semantics, but is not practical if you have any significant
 amount of state because it does

Re: Spark or Storm

2015-06-17 Thread Ashish Soni
As per my Best Understanding Spark Streaming offer Exactly once processing
, is this achieve only through updateStateByKey or there is another way to
do the same.

Ashish

On Wed, Jun 17, 2015 at 8:48 AM, Enno Shioji eshi...@gmail.com wrote:

 In that case I assume you need exactly once semantics. There's no
 out-of-the-box way to do that in Spark. There is updateStateByKey, but it's
 not practical with your use case as the state is too large (it'll try to
 dump the entire intermediate state on every checkpoint, which would be
 prohibitively expensive).

 So either you have to implement something yourself, or you can use Storm
 Trident (or transactional low-level API).

 On Wed, Jun 17, 2015 at 1:26 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka Stream )
 and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms / data
 useage you do is an event and then it needs  to calculate your bill on real
 time basis so when you login to your account you can see all those variable
 as how much you used and how much is left and what is your bill till date
 ,Also there are different rules which need to be considered when you
 calculate the total bill one simple rule will be 0-500 min it is free but
 above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total data
 etc ) so that i know how much i accumulated at any given point as events
 for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in storm i
 can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join
 out of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will stop
 reading from the source. Spark on the other hand, will keep reading from
 the source and spilling it internally. This maybe fine, in fairness, but it
 does mean you have to worry about the persistent store usage in the
 processing cluster, whereas with Storm you don't have to worry because the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this is
 as nice as back pressure because it's very difficult to tune it such that
 you don't cap the cluster's processing power but yet so that it will
 prevent the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji eshi...@gmail.com
 wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs are:
 Spark has no back pressure (receiver rate limit can alleviate this to a
 certain point, but it's far from ideal)
 There is also no exactly-once semantics. (updateStateByKey can achieve
 this semantics, but is not practical if you have any significant amount of
 state because it does so by dumping the entire state on every 
 checkpointing)

 There are also some minor drawbacks that I'm sure will be fixed
 quickly, like no task timeout, not being able to read from Kafka using
 multiple nodes, data loss hazard with Kafka.

 It's also not possible to attain very low latency in Spark, if that's
 what you need.

 The pos for Spark is the concise and IMO more intuitive syntax,
 especially if you compare it with Storm's Java API.

 I admit I might be a bit biased towards Storm tho as I'm more familiar
 with it.

 Also, you can do some processing with Kinesis. If all you need to do is
 straight forward transformation and you are reading from Kinesis to begin
 with, it might be an easier option to just do the transformation in 
 Kinesis.





 On Wed, Jun 17, 2015 at 7:15 AM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 Whatever you write in bolts would be the logic you want to apply on
 your events. In Spark, that logic would be coded in map() or similar such
 transformations and/or actions. Spark doesn't enforce a structure for
 capturing your processing logic like Storm does.
 Regards
 Sab
 Probably overloading the question a bit.

 In Storm, Bolts have the functionality of getting

Re: Spark or Storm

2015-06-17 Thread Ashish Soni
@Enno
As per the latest version and documentation Spark Streaming does offer
exactly once semantics using improved kafka integration , Not i have not
tested yet.

Any feedback will be helpful if anyone is tried the same.

http://koeninger.github.io/kafka-exactly-once/#7

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html



On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com wrote:

 AFAIK KCL is *supposed* to provide fault tolerance and load balancing
 (plus additionally, elastic scaling unlike Storm), Kinesis providing the
 coordination. My understanding is that it's like a naked Storm worker
 process that can consequently only do map.

 I haven't really used it tho, so can't really comment how it compares to
 Spark/Storm. Maybe somebody else will be able to comment.



 On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com wrote:

 Thanks for this. It's kcl based kinesis application. But because its just
 a Java application we are thinking to use spark on EMR or storm for fault
 tolerance and load balancing. Is it a correct approach?
 On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com wrote:

 Hi Ayan,

 Admittedly I haven't done much with Kinesis, but if I'm not mistaken you
 should be able to use their processor interface for that. In this
 example, it's incrementing a counter:
 https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java

 Instead of incrementing a counter, you could do your transformation and
 send it to HBase.






 On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com wrote:

 Great discussion!!

 One qs about some comment: Also, you can do some processing with
 Kinesis. If all you need to do is straight forward transformation and you
 are reading from Kinesis to begin with, it might be an easier option to
 just do the transformation in Kinesis

 - Do you mean KCL application? Or some kind of processing withinKineis?

 Can you kindly share a link? I would definitely pursue this route as
 our transformations are really simple.

 Best

 On Wed, Jun 17, 2015 at 10:26 PM, Ashish Soni asoni.le...@gmail.com
 wrote:

 My Use case is below

 We are going to receive lot of event as stream ( basically Kafka
 Stream ) and then we need to process and compute

 Consider you have a phone contract with ATT and every call / sms /
 data useage you do is an event and then it needs  to calculate your bill 
 on
 real time basis so when you login to your account you can see all those
 variable as how much you used and how much is left and what is your bill
 till date ,Also there are different rules which need to be considered when
 you calculate the total bill one simple rule will be 0-500 min it is free
 but above it is $1 a min.

 How do i maintain a shared state  ( total amount , total min , total
 data etc ) so that i know how much i accumulated at any given point as
 events for same phone can go to any node / executor.

 Can some one please tell me how can i achieve this is spark as in
 storm i can have a bolt which can do this ?

 Thanks,



 On Wed, Jun 17, 2015 at 4:52 AM, Enno Shioji eshi...@gmail.com
 wrote:

 I guess both. In terms of syntax, I was comparing it with Trident.

 If you are joining, Spark Streaming actually does offer windowed join
 out of the box. We couldn't use this though as our event stream can grow
 out-of-sync, so we had to implement something on top of Storm. If your
 event streams don't become out of sync, you may find the built-in join in
 Spark Streaming useful. Storm also has a join keyword but its semantics 
 are
 different.


  Also, what do you mean by No Back Pressure ?

 So when a topology is overloaded, Storm is designed so that it will
 stop reading from the source. Spark on the other hand, will keep reading
 from the source and spilling it internally. This maybe fine, in fairness,
 but it does mean you have to worry about the persistent store usage in 
 the
 processing cluster, whereas with Storm you don't have to worry because 
 the
 messages just remain in the data store.

 Spark came up with the idea of rate limiting, but I don't feel this
 is as nice as back pressure because it's very difficult to tune it such
 that you don't cap the cluster's processing power but yet so that it will
 prevent the persistent storage to get used up.


 On Wed, Jun 17, 2015 at 9:33 AM, Spark Enthusiast 
 sparkenthusi...@yahoo.in wrote:

 When you say Storm, did you mean Storm with Trident or Storm?

 My use case does not have simple transformation. There are complex
 events that need to be generated by joining the incoming event stream.

 Also, what do you mean by No Back PRessure ?





   On Wednesday, 17 June 2015 11:57 AM, Enno Shioji 
 eshi...@gmail.com wrote:


 We've evaluated Spark Streaming vs. Storm and ended up sticking with
 Storm.

 Some of the important draw backs