1.0.0 Release Date?

2014-05-14 Thread bhusted
Can anyone comment on the anticipated date or worse case timeframe for when Spark 1.0.0 will be released? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-14 Thread zzzzzqf12345
thanks for reply~~ I had solved the problem and found the reason, because I used the Master node to upload files to hdfs, this action may take up a lot of Master's network resources. When I changed to use another computer none of the cluster to upload these files, it got the correct result.

Re: Spark on Yarn - A small issue !

2014-05-14 Thread Tom Graves
You need to look at the logs files for yarn.  Generally this can be done with yarn logs -applicationId your_app_id.  That only works if you have log aggregation enabled though.   You should be able to see atleast the application master logs through the yarn resourcemanager web ui.  I would try

Re: How to run shark?

2014-05-14 Thread Sophia
My configuration is just like this,the slave's node has been configuate,but I donnot know what's happened to the shark?Can you help me Sir? shark-env.sh export SPARK_USER_HOME=/root export SPARK_MEM=2g export SCALA_HOME=/root/scala-2.11.0-RC4 export SHARK_MASTER_MEM=1g export

EndpointWriter: AssociationError

2014-05-14 Thread Laurent Thoulon
Hi, I've been trying to run my newly created spark job on my local master instead of just runing it using maven and i haven't been able to make it work. My main issue seems to be related to that error: 14/05/14 09:34:26 ERROR EndpointWriter: AssociationError

Re: 1.0.0 Release Date?

2014-05-14 Thread Patrick Wendell
Hey Brian, We've had a fairly stable 1.0 branch for a while now. I've started voting on the dev list last night... voting can take some time but it usually wraps up anywhere from a few days to weeks. However, you can get started right now with the release candidates. These are likely to be

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread Xiangrui Meng
I don't know whether this would fix the problem. In v0.9, you need `yarn-standalone` instead of `yarn-cluster`. See https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08 On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng men...@gmail.com wrote: Does v0.9 support

accessing partition i+1 from mapper of partition i

2014-05-14 Thread Mohit Jaggi
Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to slide and zip rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11,

RE: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Stuti Awasthi
The issue of console:12: error: not found: type Text is resolved by import statement.. But still facing issue with imports of VectorWritable. Mahout math jar is added to classpath as I can check on WebUI as well on shell scala System.getenv res1: java.util.Map[String,String] = {TERM=xterm,

Proper way to create standalone app with custom Spark version

2014-05-14 Thread Andrei
We can create standalone Spark application by simply adding spark-core_2.x to build.sbt/pom.xml and connecting it to Spark master. We can also compile custom version of Spark (e.g. compiled against Hadoop 2.x) from source and deploy it to cluster manually. But what is a proper way to use _custom

Re: logging in pyspark

2014-05-14 Thread Diana Carroll
foreach vs. map isn't the issue. Both require serializing the called function, so the pickle error would still apply, yes? And at the moment, I'm just testing. Definitely wouldn't want to log something for each element, but may want to detect something and log for SOME elements. So my question

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread DB Tsai
Hi Xiangrui, I actually used `yarn-standalone`, sorry for misleading. I did debugging in the last couple days, and everything up to updateDependency in executor.scala works. I also checked the file size and md5sum in the executors, and they are the same as the one in driver. Gonna do more testing

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread Nicholas Chammas
Would cache() + count() every N iterations work just as well as checkPoint() + count() to get around this issue? We're basically trying to get Spark to avoid working on too lengthy a lineage at once, right? Nick On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng men...@gmail.com wrote: After

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread lalit1303
If we do cache() + count() after say every 50 iterations. The whole process becomes very slow. I have tried checkpoint() , cache() + count(), saveAsObjectFiles(). Nothing works. Materializing RDD's lead to drastic decrease in performance if we don't materialize, we face stackoverflowerror. On

Re: How to run shark?

2014-05-14 Thread Mayur Rustagi
Is your Spark working .. can you try running spark shell? http://spark.apache.org/docs/0.9.1/quick-start.html If spark is working we can move this to shark user list(copied here) Also I am anything but a sir :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-14 Thread wxhsdp
Hi, DB i've add breeze jars to workers using sc.addJar() breeze jars include : breeze-natives_2.10-0.7.jar breeze-macros_2.10-0.3.jar breeze-macros_2.10-0.3.1.jar breeze_2.10-0.8-SNAPSHOT.jar breeze_2.10-0.7.jar almost all the jars about breeze i can find, but still

Re: Packaging a spark job using maven

2014-05-14 Thread Laurent T
Hi, Thanks François but this didn't change much. I'm not even sure what this reference.conf is. It isn't mentioned in any of spark documentation. Should i have one in my resources ? Thanks Laurent -- View this message in context:

Re: Spark LIBLINEAR

2014-05-14 Thread Debasish Das
Hi Professor Lin, On our internal datasets, I am getting accuracy at par with glmnet-R for sparse feature selection from liblinear. The default mllib based gradient descent was way off. I did not tune learning rate but I run with varying lambda. Ths feature selection was weak. I used liblinear

RE: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Stuti Awasthi
Hi Xiangrui, Thanks for the response .. I tried few ways to include mahout-math jar while launching Spark shell.. but no success.. Can you please point what I am doing wrong 1. mahout-math.jar exported in CLASSPATH, and PATH 2. Tried Launching Spark Shell by : MASTER=spark://HOSTNAME:PORT

saveAsTextFile with replication factor in HDFS

2014-05-14 Thread Sai Prasanna
Hi, Can we override the default file-replication factor while using saveAsTextFile() to HDFS. My default repl.factor is 1. But intermediate files that i want to put in HDFS while running a SPARK query need not be replicated, so is there a way ? Thanks !

Worker re-spawn and dynamic node joining

2014-05-14 Thread Han JU
Hi all, Just 2 questions: 1. Is there a way to automatically re-spawn spark workers? We've situations where executor OOM causes worker process to be DEAD and it does not came back automatically. 2. How to dynamically add (or remove) some worker machines to (from) the cluster? We'd like to

Re: Spark unit testing best practices

2014-05-14 Thread Andrew Ash
There's an undocumented mode that looks like it simulates a cluster: SparkContext.scala: // Regular expression for simulating a Spark cluster of [N, cores, memory] locally val LOCAL_CLUSTER_REGEX = local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r can you running your tests

Re: Spark unit testing best practices

2014-05-14 Thread Philip Ogren
Have you actually found this to be true? I have found Spark local mode to be quite good about blowing up if there is something non-serializable and so my unit tests have been great for detecting this. I have never seen something that worked in local mode that didn't work on the cluster

little confused about SPARK_JAVA_OPTS alternatives

2014-05-14 Thread Koert Kuipers
i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS=-Dspark.akka.frameSize=1 now this is deprecated. the alternatives mentioned are: * some

Re: Packaging a spark job using maven

2014-05-14 Thread François Le Lay
I have a similar objective to use maven as our build tool and ran into the same issue. The idea is that your config file is actually not found, your fat jar assembly does not contain the reference.conf resource. I added the following the resources section of my pom to make it work : resource

NotSerializableException in Spark Streaming

2014-05-14 Thread Diana Carroll
Hey all, trying to set up a pretty simple streaming app and getting some weird behavior. First, a non-streaming job that works fine: I'm trying to pull out lines of a log file that match a regex, for which I've set up a function: def getRequestDoc(s: String): String = {

No configuration setting found for key 'akka.zeromq'

2014-05-14 Thread Francis . Hu
hi,all When i run ZeroMQWordCount example on cluster, the worker log says: Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.zeromq' Actually, i can see that the reference.conf in spark-examples-assembly-0.9.1.jar contains below

Re: How to use spark-submit

2014-05-14 Thread phoenix bai
I used spark-submit to run the MovieLensALS example from the examples module. here is the command: $spark-submit --master local /home/phoenix/spark/spark-dev/examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.mllib.MovieLensALS u.data also,

spark on yarn-standalone, throws StackOverflowError and fails somtimes and succeed for the rest

2014-05-14 Thread phoenix bai
Hi all, My spark code is running on yarn-standalone. the last three lines of the code as below, val result = model.predict(prdctpairs) result.map(x = x.user+,+x.product+,+x.rating).saveAsTextFile(output) sc.stop() the same code, sometimes be able to run successfully and could give

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-14 Thread Gerard Maas
Hi Jacob, Thanks for the help answer on the docker question. Have you already experimented with the new link feature in Docker? That does not help the HDFS issue as the DataNode needs the namenode and vice-versa but it does facilitate simpler client-server interactions. My issue described at

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-14 Thread Siyuan he
Hi Cheney Which mode you are running? YARN or standalone? I got the same exception when I ran spark on YARN. On Tue, May 6, 2014 at 10:06 PM, Cheney Sun sun.che...@gmail.com wrote: Hi Nan, In worker's log, I see the following exception thrown when try to launch on executor. (The SPARK_HOME

Re: Unable to load native-hadoop library problem

2014-05-14 Thread Shivani Rao
Hello Sophia You are only providing the Spark jar here (nevertheless, a spark jar that contains hadoop libraries in it, but that is not sufficient). Where is your hadoop installed? (Most probably: /usr/lib/hadoop/*) So you need to add that to your class path (by using -cp) I guess. Let me know

Express VMs - good idea?

2014-05-14 Thread Marco Shaw
Hi, I've wanted to play with Spark. I wanted to fast track things and just use one of the vendor's express VMs. I've tried Cloudera CDH 5.0 and Hortonworks HDP 2.1. I've not written down all of my issues, but for certain, when I try to run spark-shell it doesn't work. Cloudera seems to crash,

Re: instalation de spark

2014-05-14 Thread Madhu
J'ai oublié la plupart de mes français. You can download a Spark binary or build from source. This is how I build from source: Download and install sbt: http://www.scala-sbt.org/ I installed in C:\sbt Check C:\sbt\conf\sbtconfig.txt, use these options: -Xmx512M -XX:MaxPermSize=256m