Re: Issue with zip and partitions

2014-04-02 Thread Xiangrui Meng
From API docs: Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-02 Thread Patrick Wendell
It's this: mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean package On Tue, Apr 1, 2014 at 5:15 PM, Vipul Pandey vipan...@gmail.com wrote: how do you recommend building that - it says ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:assembly

Re: How to index each map operation????

2014-04-02 Thread Shixiong Zhu
Hi Thierry, Your code does not work if @yh18190 wants a global counter. A RDD may have more than one partition. For each partition, cnt will be reset to -1. You can try the following code: scala val rdd = sc.parallelize( (1, 'a') :: (2, 'b') :: (3, 'c') :: (4, 'd') :: Nil) rdd:

Re: possible bug in Spark's ALS implementation...

2014-04-02 Thread Debasish Das
I think multiply by ratings is a heuristic that worked on rating related problems like netflix dataset or any other ratings datasets but the scope of NMF is much more broad than that @Sean please correct me in case you don't agree... Definitely it's good to add all the rating dataset related

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-02 Thread Vipul Pandey
I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus generated also has both shaded and real version of protobuf classes Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv ./assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar |

Re: How to index each map operation????

2014-04-02 Thread yh18190
Hi Therry, Thanks for the above responses..I implemented using RangePartitioner..we need to use any of the custom partitioners in orderto perform this task..Normally u cant maintain a counter becoz count operations should beperformed on each partitioned block ofdata... -- View this message in

CDH5 Spark on EC2

2014-04-02 Thread Denny Lee
I’ve been able to get CDH5 up and running on EC2 and according to Cloudera Manager, Spark is running healthy. But when I try to run spark-shell, I eventually get the error: 14/04/02 07:18:18 INFO client.AppClient$ClientActor: Connecting to master  spark://ip-172-xxx-xxx-xxx:7077... 14/04/02

Re: Status of MLI?

2014-04-02 Thread Krakna H
Thanks for the update Evan! In terms of using MLI, I see that the Github code is linked to Spark 0.8; will it not work with 0.9 (which is what I have set up) or higher versions? On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User List] ml-node+s1001560n3615...@n3.nabble.com

Re: possible bug in Spark's ALS implementation...

2014-04-02 Thread Sean Owen
It should be kept in mind that different implementations are rarely strictly better, and that what works well in one type of data might not in another. It also bears keeping in mind that several of these differences just amount to different amounts of regularization, which need not be a

ActorNotFound problem for mesos driver

2014-04-02 Thread Leon Zhang
Hi, Spark Devs: I encounter a problem which shows error message akka.actor.ActorNotFound on our mesos mini-cluster. mesos : 0.17.0 spark : spark-0.9.0-incubating spark-env.sh: #!/usr/bin/env bash export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so export SPARK_EXECUTOR_URI=hdfs://

Re: ActorNotFound problem for mesos driver

2014-04-02 Thread andy petrella
Heya, Yep this is a problem in the Mesos scheduler implementation that has been fixed after 0.9.0 (https://spark-project.atlassian.net/browse/SPARK-1052 = MesosSchedulerBackend) So several options, like applying the patch, upgrading to 0.9.1 :-/ Cheers, Andy On Wed, Apr 2, 2014 at 5:30 PM,

Re: ActorNotFound problem for mesos driver

2014-04-02 Thread Leon Zhang
Aha, thank you for your kind reply. Upgrading to 0.9.1 is a good choice. :) On Wed, Apr 2, 2014 at 11:35 PM, andy petrella andy.petre...@gmail.comwrote: Heya, Yep this is a problem in the Mesos scheduler implementation that has been fixed after 0.9.0

Re: ActorNotFound problem for mesos driver

2014-04-02 Thread andy petrella
np ;-) On Wed, Apr 2, 2014 at 5:50 PM, Leon Zhang leonca...@gmail.com wrote: Aha, thank you for your kind reply. Upgrading to 0.9.1 is a good choice. :) On Wed, Apr 2, 2014 at 11:35 PM, andy petrella andy.petre...@gmail.comwrote: Heya, Yep this is a problem in the Mesos scheduler

Resilient nature of RDD

2014-04-02 Thread David Thomas
Can someone explain how RDD is resilient? If one of the partition is lost, who is responsible to recreate that partition - is it the driver program?

Print line in JavaNetworkWordCount

2014-04-02 Thread Eduardo Costa Alfaia
Hi Guys I would like printing the content inside of line in : JavaDStreamString lines = ssc.socketTextStream(args[1], Integer.parseInt(args[2])); JavaDStreamString words = lines.flatMap(new FlatMapFunctionString, String() { @Override public IterableString call(String x) {

Re: Need suggestions

2014-04-02 Thread andy petrella
TL;DR Your classes are missing on the workers, pass the jar containing the class main.scala.Utils to the SparkContext Longer: I miss some information, like how the SparkContext is configured but my best guess is that you didn't provided the jars (addJars on SparkConf or use the SC's constructor

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas
Update: I'm now using this ghetto function to partition the RDD I get back when I call textFile() on a gzipped file: # Python 2.6 def partitionRDD(rdd, numPartitions): counter = {'a': 0} def count_up(x): counter['a'] += 1 return counter['a'] return (rdd.keyBy(count_up)

Re: Need suggestions

2014-04-02 Thread andy petrella
Sorry I was not clear perhaps, anyway, could you try with the path in the *List* to be the absolute one; e.g. List(/home/yh/src/pj/spark-stuffs/target/scala-2.10/simple-project_2.10-1.0.jar) In order to provide a relative path, you need first to figure out your CWD, so you can do (to be really

Re: Spark output compression on HDFS

2014-04-02 Thread Patrick Wendell
For textFile I believe we overload it and let you set a codec directly: https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59 For saveAsSequenceFile yep, I think Mark is right, you need an option. On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra

Optimal Server Design for Spark

2014-04-02 Thread Stephen Watt
Hi Folks I'm looking to buy some gear to run Spark. I'm quite well versed in Hadoop Server design but there does not seem to be much Spark related collateral around infrastructure guidelines (or at least I haven't been able to find them). My current thinking for server design is something

Re: Spark output compression on HDFS

2014-04-02 Thread Nicholas Chammas
Is this a Scala-onlyhttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFilefeature? On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote: For textFile I believe we overload it and let you set a codec directly:

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra
There is a repartition method in pyspark master: https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128 On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Update: I'm now using this ghetto function to partition the RDD I get back when I call

Re: Spark output compression on HDFS

2014-04-02 Thread Nicholas Chammas
Thanks for pointing that out. On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra m...@clearstorydata.comwrote: First, you shouldn't be using spark.incubator.apache.org anymore, just spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in the Python API at this point. On Wed,

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas
Ah, now I see what Aaron was referring to. So I'm guessing we will get this in the next release or two. Thank you. On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra m...@clearstorydata.comwrote: There is a repartition method in pyspark master:

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra
Will be in 1.0.0 On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, now I see what Aaron was referring to. So I'm guessing we will get this in the next release or two. Thank you. On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Philip Ogren
What I'd like is a way to capture the information provided on the stages page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark code, it doesn't seem like it is possible to directly query for specific facts such as how many tasks have succeeded or how many total tasks there

Measure the Total Network I/O, Cpu and Memory Consumed by Spark Job

2014-04-02 Thread yxzhao
Hi All, I am intrested in measure the total network I/O, cpu and memory consumed by Spark job. I tried to find the related information in logs and Web UI. But there seems no sufficient information. Could anyone give me any suggestion? Thanks very much in advance. -- View this

Efficient way to aggregate event data at daily/weekly/monthly level

2014-04-02 Thread K Koh
Hi, I want to aggregate (time-stamped) event data at daily, weekly and monthly level stored in a directory in data//mm/dd/dat.gz format. For example: Each dat.gz file contains tuples in (datetime, id, value) format. I can perform aggregation as follows: but this code doesn't seem to be

Re: Resilient nature of RDD

2014-04-02 Thread Patrick Wendell
The driver stores the meta-data associated with the partition, but the re-computation will occur on an executor. So if several partitions are lost, e.g. due to a few machines failing, the re-computation can be striped across the cluster making it fast. On Wed, Apr 2, 2014 at 11:27 AM, David

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Patrick Wendell
Hey Phillip, Right now there is no mechanism for this. You have to go in through the low level listener interface. We could consider exposing the JobProgressListener directly - I think it's been factored nicely so it's fairly decoupled from the UI. The concern is this is a semi-internal piece of

Re: Is there a way to get the current progress of the job?

2014-04-02 Thread Andrew Or
Hi Philip, In the upcoming release of Spark 1.0 there will be a feature that provides for exactly what you describe: capturing the information displayed on the UI in JSON. More details will be provided in the documentation, but for now, anything before 0.9.1 can only go through JobLogger.scala,

Re: Efficient way to aggregate event data at daily/weekly/monthly level

2014-04-02 Thread Nicholas Chammas
Watch out with loading data from gzipped files. Spark cannot parallelize the load of gzipped files, and if you do not explicitly repartition your RDD created from such a file, everything you do on that RDD will run on a single core. On Wed, Apr 2, 2014 at 8:22 PM, K Koh den...@gmail.com wrote:

Re: Status of MLI?

2014-04-02 Thread Evan R. Sparks
Targeting 0.9.0 should work out of the box (just a change to the build.sbt) - I'll push some changes I've been sitting on to the public repo in the next couple of days. On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote: Thanks for the update Evan! In terms of using MLI, I

Re: Optimal Server Design for Spark

2014-04-02 Thread Mayur Rustagi
I would suggest to start with cloud hosting if you can, depending on your usecase, memory requirement may vary a lot . Regards Mayur On Apr 2, 2014 3:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hey Steve, This configuration sounds pretty good. The one thing I would consider is having

Re: Optimal Server Design for Spark

2014-04-02 Thread Debasish Das
Hi Matei, How can I run multiple Spark workers per node ? I am running 8 core 10 node cluster but I do have 8 more cores on each nodeSo having 2 workers per node will definitely help my usecase. Thanks. Deb On Wed, Apr 2, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hey

How to ask questions on Spark usage?

2014-04-02 Thread weida xu
Hi, Shall I send my questions to this Email address? Sorry for bothering, and thanks a lot!

Re: How to ask questions on Spark usage?

2014-04-02 Thread Andrew Or
Yes, please do. :) On Wed, Apr 2, 2014 at 7:36 PM, weida xu xwd0...@gmail.com wrote: Hi, Shall I send my questions to this Email address? Sorry for bothering, and thanks a lot!

Spark RDD to Shark table IN MEMORY conversion

2014-04-02 Thread abhietc31
Hi, We are placing business logic in incoming data stream using Spark streaming. Here I want to point Shark table to use data coming from Spark Streaming. Instead of storing Spark streaming to HDFS or other area, is there a way I can directly point Shark in-memory table to take data from Spark

Shark Direct insert into table value (?)

2014-04-02 Thread abhietc31
Hi, I'm trying to run script in SHARK(0.81) insert into emp (id,name) values (212,Abhi) but it doesn't work. I urgently need direct insert as it is show stopper. I know that we can do insert into emp select * from xyz. Here requirement is direct insert. Does any one tried it ? Or is there

Submitting to yarn cluster

2014-04-02 Thread Ron Gonzalez
Hi,   I have a small program but I cannot seem to make it connect to the right properties of the cluster.   I have the SPARK_YARN_APP_JAR, SPARK_JAR and SPARK_HOME set properly.   If I run this scala file, I am seeing that this is never using the yarn.resourcemanager.address property that I set

Error when run Spark on mesos

2014-04-02 Thread felix
I deployed mesos and test it using the exmaple/test-framework script, mesos seems OK.but when runing spark on the mesos cluster, the mesos slave nodes report the following exception, any one can help me to fix this ? thanks in advance:14/04/03 11:24:39 INFO Slf4jLogger: Slf4jLogger started14/04/03

Re: Error when run Spark on mesos

2014-04-02 Thread panfei
any advice ? 2014-04-03 11:35 GMT+08:00 felix cnwe...@gmail.com: I deployed mesos and test it using the exmaple/test-framework script, mesos seems OK. but when runing spark on the mesos cluster, the mesos slave nodes report the following exception, any one can help me to fix this ? thanks

Re: Error when run Spark on mesos

2014-04-02 Thread Ian Ferreira
I think this is related to a known issue (regression) in 0.9.0. Try using explicit IP other than loop back. Sent from a mobile device On Apr 2, 2014, at 8:53 PM, panfei cnwe...@gmail.com wrote: any advice ? 2014-04-03 11:35 GMT+08:00 felix cnwe...@gmail.com: I deployed mesos and test

Example of creating expressions for SchemaRDD methods

2014-04-02 Thread All In A Days Work
For various schemaRDD functions like select, where, orderby, groupby etc. I would like to create expression objects and pass these to the methods for execution. Can someone show some examples of how to create expressions for case class and execute ? E.g., how to create expressions for select,