Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-26 Thread Akhil Das
If you open up the driver UI (running on 4040), you can see multiple tasks per stage which will be happening concurrently. If it is a single task, and you want to increase the parallelism, then you can simply do a re-partition. Thanks Best Regards On Tue, May 26, 2015 at 8:27 AM,

Remove COMPLETED applications and shuffle data

2015-05-26 Thread sayantini
Hi All, Please help me with the below 2 issues: *Environment:* I am running my spark cluster in stand alone mode. I am initializing the spark context from inside my tomcat server. I am setting below properties in environment.sh in $SPARK_HOME/conf directory

Re: HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-05-26 Thread Mohammad Islam
I got a similar problem.I'm not sure if your problem is already resolved. For the record, I solved this type of error by calling sc..setMaster(yarn-cluster);  If you find the solution, please let us know. Regards,Mohammad On Friday, March 6, 2015 2:47 PM, nitinkak001

Re: Remove COMPLETED applications and shuffle data

2015-05-26 Thread Akhil Das
Try these: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set(spark.executor.logs.rolling.strategy, size) .set(spark.executor.logs.rolling.size.maxBytes, 1024) .set(spark.executor.logs.rolling.maxRetainedFiles, 3) You can also look into

Caching parquet table (with GZIP) on Spark 1.3.1

2015-05-26 Thread shshann
we tried to cache table through hiveCtx = HiveContext(sc) hiveCtx.cacheTable(table name) as described on Spark 1.3.1's document and we're on CDH5.3.0 with Spark 1.3.1 built with Hadoop 2.6 following error message would occur if we tried to cache table with parquet format GZIP though we're not

Re: Using Log4j for logging messages inside lambda functions

2015-05-26 Thread Spico Florin
Hello! Thank you all for your answers. Akhil's proposed solution works fine. Thanks. Florin On Tue, May 26, 2015 at 3:08 AM, Wesley Miao wesley.mi...@gmail.com wrote: The reason it didn't work for you is that the function you registered with someRdd.map will be running on the

RE: Websphere MQ as a data source for Apache Spark Streaming

2015-05-26 Thread Chaudhary, Umesh
Thanks for the suggestion, I will try and post the outcome. From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com] Sent: Monday, May 25, 2015 12:24 PM To: Chaudhary, Umesh; user@spark.apache.org Subject: Re: Websphere MQ as a data source for Apache Spark Streaming Hi Umesh, You can write a

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
An Executor is a JVM instance spawned and running on a Cluster Node (Server machine). Task is essentially a JVM Thread – you can have as many Threads as you want per JVM. You will also hear about “Executor Slots” – these are essentially the CPU Cores available on the machine and granted for use

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
This is the first time I hear that “one can specify the RAM per task” – the RAM is granted per Executor (JVM). On the other hand each Task operates on ONE RDD Partition – so you can say that this is “the RAM allocated to the Task to process” – but it is still within the boundaries allocated to

Collabrative Filtering

2015-05-26 Thread Yasemin Kaya
Hi, In CF String path = data/mllib/als/test.data; JavaRDDString data = sc.textFile(path); JavaRDDRating ratings = data.map(new FunctionString, Rating() { public Rating call(String s) { String[] sarray = s.split(,); return new Rating(Integer.parseInt(sarray[0]), Integer .parseInt(sarray[1]),

Parallel parameter tuning: distributed execution of MLlib algorithms

2015-05-26 Thread hugof
Hi, I am currently experimenting with linear regression (SGD) (Spark + MLlib, ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I do this (for now) by an exhaustive grid search of the step size and the number of iterations. Currently I am on a dual core that acts as a

spark-streaming-kafka_2.11 not available yet?

2015-05-26 Thread Petr Novak
Hello, I would like to switch from Scala 2.10 to 2.11 for Spark app development. It seems that the only thing blocking me is a missing spark-streaming-kafka_2.11 maven package. Any plan to add it or am I just blind? Many thanks, Vladimir

How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
Since spark can run multiple tasks in one executor, so I am curious to know how does spark manage memory across these tasks. Say if one executor takes 1GB memory, then if this executor can run 10 tasks simultaneously, then each task can consume 100MB on average. Do I understand it correctly ? It

回复:Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-26 Thread luohui20001
Thanks Akhil, I checked the job UI again ,my app is running concurrently in all the executors. But some of the tasks got I/O exception. I will continue inspecting on this. java.io.IOException: Failed to create local dir in

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
Yes, I know that one task represent a JVM thread. This is what I confused. Usually users want to specify the memory on task level, so how can I do it if task if thread level and multiple tasks runs in the same executor. And even I don't know how many threads there will be. Besides that, if one

Re: How many executors can I acquire in standalone mode ?

2015-05-26 Thread Arush Kharbanda
I believe you would be restricted by the number of cores you have in your cluster. Having a worker running without a core is useless. On Tue, May 26, 2015 at 3:04 PM, canan chen ccn...@gmail.com wrote: In spark standalone mode, there will be one executor per worker. I am wondering how many

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
I think the concept of task in spark should be on the same level of task in MR. Usually in MR, we need to specify the memory the each mapper/reducer task. And I believe executor is not a user-facing concept, it's a spark internal concept. For spark users they don't need to know the concept of

Apache Spark application deployment best practices

2015-05-26 Thread lucas1000001
Hi, I have a couple of use cases for Apache Spark applications/scripts, generally of the following form: *General ETL use case* - more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families.

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-05-26 Thread Iulian Dragoș
On Tue, May 26, 2015 at 10:09 AM, algermissen1971 algermissen1...@icloud.com wrote: Hi, I am setting up a project that requires Kafka support and I wonder what the roadmap is for Scala 2.11 Support (including Kafka). Can we expect to see 2.11 support anytime soon? The upcoming 1.4

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Arush Kharbanda
Hi Evo, Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you would be able to run multiple executors on the same JVM/worker. https://issues.apache.org/jira/browse/SPARK-1706. Thanks Arush On Tue, May 26, 2015 at 2:54 PM, canan chen ccn...@gmail.com wrote: I think the

How many executors can I acquire in standalone mode ?

2015-05-26 Thread canan chen
In spark standalone mode, there will be one executor per worker. I am wondering how many executor can I acquire when I submit app ? Is it greedy mode (as many as I can acquire )?

Re: HiveContext test, Spark Context did not initialize after waiting 10000ms

2015-05-26 Thread Nitin kak
That is a much better solution than how I resolved it. I got around it by placing comma separated jar paths for all the hive related jars in --jars clause. I will try your solution. Thanks for sharing it. On Tue, May 26, 2015 at 4:14 AM, Mohammad Islam misla...@yahoo.com wrote: I got a similar

Re: Implementing custom RDD in Java

2015-05-26 Thread Alex Robbins
I know it isn't exactly what you are asking for, but you could solve it like this: Driver program queries dynamo for the s3 file keys. sc.textFile each of the file keys and .union them all together to make your RDD. You could wrap that up in a function and it wouldn't be too painful to reuse. I

Re: Tasks randomly stall when running on mesos

2015-05-26 Thread Reinis Vicups
Hi, I just configured my cluster to run with 1.4.0-rc2, alas the dependency jungle does not one let just download, config and start. Instead one will have to fiddle with sbt settings for the upcoming couple of nights: 2015-05-26 14:50:52,686 WARN a.r.ReliableDeliverySupervisor -

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
 the link you sent says multiple executors per node Worker is just demon process launching Executors / JVMs so it can execute tasks - it does that by cooperating with the master and the driver  There is a one to one maping between Executor and JVM  Sent from Samsung Mobile div

Re: spark-streaming-kafka_2.11 not available yet?

2015-05-26 Thread Cody Koeninger
It's being added in 1.4 https://repository.apache.org/content/repositories/orgapachespark-1104/org/apache/spark/spark-streaming-kafka_2.11/1.4.0-rc2/ On Tue, May 26, 2015 at 3:14 AM, Petr Novak oss.mli...@gmail.com wrote: Hello, I would like to switch from Scala 2.10 to 2.11 for Spark app

Accumulators in Spark Streaming on UI

2015-05-26 Thread Snehal Nagmote
Hello all, I have accumulator in spark streaming application which counts number of events received from Kafka. From the documentation , It seems Spark UI has support to display it . But I am unable to see it on UI. I am using spark 1.3.1 Do I need to call any method (print) or am I missing

PySpark Unknown Opcode Error

2015-05-26 Thread Nikhil Muralidhar
Hello, I am trying to run a spark job (which runs fine on the master node of the cluster), on a HDFS hadoop cluster using YARN. When I run the job which has a rdd.saveAsTextFile() line in it, I get the following error: *SystemError: unknown opcode* The entire stacktrace has been appended to

Re: Accumulators in Spark Streaming on UI

2015-05-26 Thread Justin Pihony
You need to make sure to name the accumulator. On Tue, May 26, 2015 at 2:23 PM, Snehal Nagmote nagmote.sne...@gmail.com wrote: Hello all, I have accumulator in spark streaming application which counts number of events received from Kafka. From the documentation , It seems Spark UI has

Re: Running Javascript from scala spark

2015-05-26 Thread Ted Yu
Have you looked at https://github.com/spark/sparkjs ? Cheers On Tue, May 26, 2015 at 10:17 AM, marcos rebelo ole...@gmail.com wrote: Hi all, My first message on this mailing list: I need to run JavaScript on Spark. Somehow I would like to use the ScriptEngineManager or any other way that

Re: Running Javascript from scala spark

2015-05-26 Thread Marcelo Vanzin
Is it just me or does that look completely unrelated to Spark-the-Apache-project? On Tue, May 26, 2015 at 10:55 AM, Ted Yu yuzhih...@gmail.com wrote: Have you looked at https://github.com/spark/sparkjs ? Cheers On Tue, May 26, 2015 at 10:17 AM, marcos rebelo ole...@gmail.com wrote: Hi

Re: Recommended Scala version

2015-05-26 Thread Dean Wampler
Most of the 2.11 issues are being resolved in Spark 1.4. For a while, the Spark project has published maven artifacts that are compiled with 2.11 and 2.10, although the downloads at http://spark.apache.org/downloads.html are still all for 2.10. Dean Wampler, Ph.D. Author: Programming Scala, 2nd

Re: Recommended Scala version

2015-05-26 Thread Ted Yu
w.r.t. Kafka library, see https://repository.apache.org/content/repositories/orgapachespark-1104/org/apache/spark/spark-streaming-kafka_2.11/1.4.0-rc2/ FYI On Tue, May 26, 2015 at 8:33 AM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Yes, recommended version is 2.10 as all the

Re: Recommended Scala version

2015-05-26 Thread Koert Kuipers
we are still running into issues with spark-shell not working on 2.11, but we are running on somewhat older master so maybe that has been resolved already. On Tue, May 26, 2015 at 11:48 AM, Dean Wampler deanwamp...@gmail.com wrote: Most of the 2.11 issues are being resolved in Spark 1.4. For a

Need some Cassandra integration help

2015-05-26 Thread Yana Kadiyska
Hi folks, for those of you working with Cassandra, wondering if anyone has been successful processing a mix of Cassandra and hdfs data. I have a dataset which is stored partially in HDFS and partially in Cassandra (schema is the same in both places) I am trying to do the following: val dfHDFS =

Re: PySpark Unknown Opcode Error

2015-05-26 Thread Davies Liu
This should be the case that you run different versions for Python in driver and slaves, Spark 1.4 will double check that will release soon). SPARK_PYTHON should be PYSPARK_PYTHON On Tue, May 26, 2015 at 11:21 AM, Nikhil Muralidhar nmural...@gmail.com wrote: Hello, I am trying to run a

Re: Running Javascript from scala spark

2015-05-26 Thread andy petrella
Yop, why not using like you said a js engine le rhino? But then I would suggest using mapPartition instead si only one engine per partition. Probably broadcasting the script is also a good thing to do. I guess it's for add hoc transformations passed by a remote client, otherwise you could simply

Re: Running Javascript from scala spark

2015-05-26 Thread marcos rebelo
Hi all Let me be clear, I'm speaking of Spark (big data, map/reduce, hadoop, ... related). I have multiple map/flatMap/groupBy and one of the steps needs to be a map passing the item inside a JavaScript code. 2 Questions: - Is this question related to this list? - Did someone do something

Re: DataFrame. Conditional aggregation

2015-05-26 Thread Masf
Hi I don't know how it works. For example: val result = joinedData.groupBy(col1,col2).agg( count(lit(1)).as(counter), min(col3).as(minimum), sum(case when endrscp 100 then 1 else 0 end).as(test) ) How can I do it? Thanks Regards. Miguel. On Tue, May 26, 2015 at 12:35 AM, ayan guha

SparkR Jobs Hanging in collectPartitions

2015-05-26 Thread Eskilson,Aleksander
I’ve been attempting to run a SparkR translation of a similar Scala job that identifies words from a corpus not existing in a newline delimited dictionary. The R code is: dict - SparkR:::textFile(sc, src1) corpus - SparkR:::textFile(sc, src2) words - distinct(SparkR:::flatMap(corpus,

Recommended Scala version

2015-05-26 Thread Punyashloka Biswal
Dear Spark developers and users, Am I correct in believing that the recommended version of Scala to use with Spark is currently 2.10? Is there any plan to switch to 2.11 in future? Are there any advantages to using 2.11 today? Regards, Punya

RE: spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-26 Thread Wang, Ningjun (LNG-NPV)
It is Hadoop-2.4.0 with spark-1.3.0. I found that the problem only happen if there are multi nodes. If the cluster has only one node, it works fine. For example if the cluster has a spark-master on machine A and a spark-worker on machine B, this problem happen. If both spark-master and

Building scaladoc using build/sbt unidoc failure

2015-05-26 Thread Justin Yip
Hello, I am trying to build scala doc from the 1.4 branch. But it failed due to [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: List(object package$DebugNode, object package$DebugNode) I followed the instruction on github

Re: DataFrame. Conditional aggregation

2015-05-26 Thread ayan guha
For this, I can give you a SQL solution: joined data.registerTempTable('j') Res=ssc.sql('select col1,col2, count(1) counter, min(col3) minimum, sum(case when endrscp100 then 1 else 0 end test from j' Let me know if this works. On 26 May 2015 23:47, Masf masfwo...@gmail.com wrote: Hi I don't

Re: Running Javascript from scala spark

2015-05-26 Thread ayan guha
Yes you are in right mailing list, for sure :) Regarding your question, I am sure you are well versed with how spark works. Essentially you can run any arbitrary function with map call and it will run in remote nodes. Hence you need to install any needed dependency in all nodes. You can also pass

DataFrame.explode produces field with wrong type.

2015-05-26 Thread Eugene Morozov
Hi! I create DataFrame using method following JavaRDDRow rows = ... StructType structType = ... Then apply sqlContext.createDataFrame(rows, structType). I have pretty complex schema: root |-- Id: long (nullable = true) |-- attributes: struct (nullable = true) ||-- FirstName: array

Running Javascript from scala spark

2015-05-26 Thread marcos rebelo
Hi all, My first message on this mailing list: I need to run JavaScript on Spark. Somehow I would like to use the ScriptEngineManager or any other way that makes Rhino do the work for me. Consider that I have a Structure that needs to be changed by a JavaScript. I will have a set of Javascript