Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming mailingl...@ltsai.com wrote: Hi, At the reduceBuyKey stage, it takes

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Hi, This is on a 4 nodes cluster each with 32 cores/256GB Ram. (0.9.0) is deployed in a stand alone mode. Each worker is configured with 192GB. Spark executor memory is also 192GB. This is on the first iteration. K=50. Here’s the code I use: http://pastebin.com/2yXL3y8i , which is a

spark executor/driver log files management

2014-03-24 Thread Sourav Chandra
Hi, I have few questions regarding log file management in spark: 1. Currently I did not find any way to modify the lof file name for executor/drivers). Its hardcoded as stdout and stderr. Also there is no log rotation. In case of streaming application this will grow forever and become

GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sai Prasanna
Hi All !! I am getting the following error in interactive spark-shell [0.8.1] *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more than 0 times; aborting job java.lang.OutOfMemoryError: GC overhead limit exceeded* But i had set the following in the spark.env.sh and

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
K = 50 is certainly a large number for k-means. If there is no particular reason to have 50 clusters, could you try to reduce it to, e.g, 100 or 1000? Also, the example code is not for large-scale problems. You should use the KMeans algorithm in mllib clustering for your problem.

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Tsai Li Ming
Thanks, Let me try with a smaller K. Does the size of the input data matters for the example? Currently I have 50M rows. What is a reasonable size to demonstrate the capability of Spark? On 24 Mar, 2014, at 3:38 pm, Xiangrui Meng men...@gmail.com wrote: K = 50 is certainly a large

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson
To be clear on what your configuration will do: - SPARK_DAEMON_MEMORY=8g will make your standalone master and worker schedulers have a lot of memory. These do not impact the actual amount of useful memory given to executors or your driver, however, so you probably don't need to set this. -

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Sean Owen
PS you have a typo in DEAMON - its DAEMON. Thanks Latin. On Mar 24, 2014 7:25 AM, Sai Prasanna ansaiprasa...@gmail.com wrote: Hi All !! I am getting the following error in interactive spark-shell [0.8.1] *org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed more than 0 times;

Re: Java API - Serialization Issue

2014-03-24 Thread santhoma
I am also facing the same problem. I have implemented Serializable for my code, but the exception is thrown from third party libraries on which I have no control . Exception in thread main org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: (lib

Re: java.io.NotSerializableException Of dependent Java lib.

2014-03-24 Thread santhoma
Can someone answer this question please? Specifically about the Serializable implementation of dependent jars .. ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-Of-dependent-Java-lib-tp1973p3087.html Sent from the Apache

Re: Spark worker threads waiting

2014-03-24 Thread sparrow
Yes my input data is partitioned in a completely random manner, so each worker that produces shuffle data processes only a part of it. The way I understand it is that before each stage each workers needs to distribute correct partitions (based on hash key ranges?) to other workers. And this is

Re: Spark Java example using external Jars

2014-03-24 Thread dmpour23
Hello, Has anyone got any ideas? I am not quite sure if my problem is an exact fit for Spark. Since in reality in this section of my program i am not really doing a reduce job simply a group by and partition. Would calling pipe on the Partiotined JavaRDD do the trick? Are there any examples

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, the inode usage was about 50%. On two of the slaves, one had inode usage of 96% and on the other it was 100%. When i went into /tmp on these two nodes - there were a bunch of /tmp/spark* subdirectories which I deleted. This

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Oh, glad to know it's that simple! Patrick, in your last comment did you mean filter in? As in I start with one year of data and filter it so I have one day left? I'm assuming in that case the empty partitions would be for all the days that got filtered out. Nick 2014년 3월 24일 월요일, Patrick

Re: java.io.NotSerializableException Of dependent Java lib.

2014-03-24 Thread yaoxin
We now have a method to work this around. For the classes that can't easily implement serialized, we wrap this class a scala object. For example: class A {} // This class is not serializable, object AHolder { private val a: A = new A() def get: A = a } This

mapPartitions use case

2014-03-24 Thread Jaonary Rabarisoa
Dear all, Sorry for asking such a basic question, but someone can explain when one should use mapPartiontions instead of map. Thanks Jaonary

Akka error with largish job (works fine for smaller versions)

2014-03-24 Thread Nathan Kronenfeld
What does this error mean: @hadoop-s2.oculus.local:45186]: Error [Association failed with [akka.tcp://spark@hadoop-s2.oculus.local:45186]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-s2.oculus.local:45186] Caused by:

Re: No space left on device exception

2014-03-24 Thread Ognen Duzlevski
Another thing I have noticed is that out of my master+15 slaves, two slaves always carry a higher inode load. So for example right now I am running an intensive job that takes about an hour to finish and two slaves have been showing an increase in inode consumption (they are about 10% above

Re: mapPartitions use case

2014-03-24 Thread Nathan Kronenfeld
I've seen two cases most commonly: The first is when I need to create some processing object to process each record. If that object creation is expensive, creating one per record becomes prohibitive. So instead, we use mapPartition, and create one per partition, and use it on each record in the

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Sanjay Awatramani
Hi All, I found out why this problem exists. Consider the following scenario: - a DStream is created from any source. (I've checked with file and socket) - No actions are applied to this DStream - Sliding Window operation is applied to this DStream and an action is applied to the sliding window.

Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Mark, This appears to be a Scala-only feature. :( Patrick, Are we planning to add this to PySpark? Nick On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote: It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas

Re: Problem starting worker processes in standalone mode

2014-03-24 Thread Yonathan Perez
Oh, I also forgot to mention: I start the master and workers (call ./sbin/start-all.sh), and then start the shell: MASTER=spark://localhost:7077 ./bin/spark-shell Then I get the exceptions... Thanks -- View this message in context:

remove duplicates

2014-03-24 Thread Adrian Mocanu
I have a DStream like this: ..RDD[a,b],RDD[b,c].. Is there a way to remove duplicates across the entire DStream? Ie: I would like the output to be (by removing one of the b's): ..RDD[a],RDD[b,c].. or ..RDD[a,b],RDD[c].. Thanks -Adrian

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-24 Thread Debasish Das
After a long time (may be a month) I could do a fresh build for 2.0.0-mr1-cdh4.5.0...I was using the cached files in .ivy2/cache My case is especially painful since I have to build behind a firewall... @Sean thanks for the fix...I think we should put a test for https/firewall compilation as

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Number of rows doesn't matter much as long as you have enough workers to distribute the work. K-means has complexity O(n * d * k), where n is number of points, d is the dimension, and k is the number of clusters. If you use the KMeans implementation from MLlib, the initialization stage is done on

distinct in data frame in spark

2014-03-24 Thread Chengi Liu
Hi, I have a very simple use case: I have an rdd as following: d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]] Now, I want to remove all the duplicates from a column and return the remaining frame.. For example: If i want to remove the duplicate based on column 1. Then basically I would remove either row

Re: GC overhead limit exceeded in Spark-interactive shell

2014-03-24 Thread Aaron Davidson
1. Note sure on this, I don't believe we change the defaults from Java. 2. SPARK_JAVA_OPTS can be used to set the various Java properties (other than memory heap size itself) 3. If you want to have 8 GB executors then, yes, only two can run on each 16 GB node. (In fact, you should also keep a

Comparing GraphX and GraphLab

2014-03-24 Thread Niko Stahl
Hello, I'm interested in extending the comparison between GraphX and GraphLab presented in Xin et. al (2013). The evaluation presented there is rather limited as it only compares the frameworks for one algorithm (PageRank) on a cluster with a fixed number of nodes. Are there any graph algorithms

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Debasish Das
Niko, Comparing some other components will be very useful as wellsvd++ from graphx vs the same algorithm in graphlabalso mllib.recommendation.als implicit/explicit compared to the collaborative filtering toolkit in graphlab... To stress test what's the biggest sparse dataset that you

Re: Transitive dependency incompatibility

2014-03-24 Thread Jaka Jančar
Just found this issue and wanted to link it here, in case somebody finds this thread later: https://spark-project.atlassian.net/browse/SPARK-939 On Thursday, March 20, 2014 at 11:14 AM, Matei Zaharia wrote: Hi Jaka, I’d recommend rebuilding Spark with a new version of the HTTPClient

quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Has anyone successfully followed the instructions on the Quick Start page of the Spark home page to run a standalone Scala application? I can't, and I figure I must be missing something obvious! I'm trying to follow the instructions here as close to word for word as possible:

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Hi, Diana, See my inlined answer -- Nan Zhu On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote: Has anyone successfully followed the instructions on the Quick Start page of the Spark home page to run a standalone Scala application? I can't, and I figure I must be missing

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
I am able to run standalone apps. I think you are making one mistake that throws you off from there onwards. You don't need to put your app under SPARK_HOME. I would create it in its own folder somewhere, it follows the rules of any standalone scala program (including the layout). In the giude,

Re: Comparing GraphX and GraphLab

2014-03-24 Thread Ankur Dave
Hi Niko, The GraphX team recently wrote a longer paper with more benchmarks and optimizations: http://arxiv.org/abs/1402.2394 Regarding the performance of GraphX vs. GraphLab, I believe GraphX currently outperforms GraphLab only in end-to-end benchmarks of pipelines involving both graph-parallel

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Yana: Thanks. Can you give me a transcript of the actual commands you are running? THanks! Diana On Mon, Mar 24, 2014 at 3:59 PM, Yana Kadiyska yana.kadiy...@gmail.comwrote: I am able to run standalone apps. I think you are making one mistake that throws you off from there onwards. You

[bug?] streaming window unexpected behaviour

2014-03-24 Thread Adrian Mocanu
I have what I would call unexpected behaviour when using window on a stream. I have 2 windowed streams with a 5s batch interval. One window stream is (5s,5s)=smallWindow and the other (10s,5s)=bigWindow What I've noticed is that the 1st RDD produced by bigWindow is incorrect and is of the size

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski
Diana, Anywhere on the filesystem you have read/write access (you need not be in your spark home directory): mkdir myproject cd myproject mkdir project mkdir target mkdir -p src/main/scala cp $mypath/$mymysource.scala src/main/scala/ cp $mypath/myproject.sbt . Make sure that myproject.sbt

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks, Nan Zhu. You say that my problems are because you are in Spark directory, don't need to do that actually , the dependency on Spark is resolved by sbt I did try it initially in what I thought was a much more typical place, e.g. ~/mywork/sparktest1. But as I said in my email: (Just for

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks Ongen. Unfortunately I'm not able to follow your instructions either. In particular: sbt compile sbt run arguments if any This doesn't work for me because there's no program on my path called sbt. The instructions in the Quick Start guide are specific that I should call

Re: How many partitions is my RDD split into?

2014-03-24 Thread Shivaram Venkataraman
There is no direct way to get this in pyspark, but you can get it from the underlying java rdd. For example a = sc.parallelize([1,2,3,4], 2) a._jrdd.splits().size() On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Mark, This appears to be a Scala-only

question about partitions

2014-03-24 Thread Walrus theCat
Hi, Quick question about partitions. If my RDD is partitioned into 5 partitions, does that mean that I am constraining it to exist on at most 5 machines? Thanks

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Yeah, that's exactly what I did. Unfortunately it doesn't work: $SPARK_HOME/sbt/sbt package awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for reading (No such file or directory) Attempting to fetch sbt /usr/lib/spark/sbt/sbt: line 33: sbt/sbt-launch-.jar: No such file or

Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell
Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote. - Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman shivaram.venkatara...@gmail.com wrote: There is no direct way to get this in pyspark, but you can get it from the underlying java

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Ognen Duzlevski
Ah crud, I guess you are right, I am using the sbt I installed manually with my Scala installation. Well, here is what you can do: mkdir ~/bin cd ~/bin wget http://repo.typesafe.com/typesafe/ivy-releases/org.scala-sbt/sbt-launch/0.13.1/sbt-launch.jar vi sbt Put the following contents into

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Hi, Diana, You don’t need to use spark-distributed sbt just download sbt from its official website and set your PATH to the right place Best, -- Nan Zhu On Monday, March 24, 2014 at 4:30 PM, Diana Carroll wrote: Yeah, that's exactly what I did. Unfortunately it doesn't work:

Re: question about partitions

2014-03-24 Thread Walrus theCat
For instance, I need to work with an RDD in terms of N parts. Will calling RDD.coalesce(N) possibly cause processing bottlenecks? On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat walrusthe...@gmail.comwrote: Hi, Quick question about partitions. If my RDD is partitioned into 5 partitions,

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Thanks for your help, everyone. Several folks have explained that I can surely solve the problem by installing sbt. But I'm trying to get the instructions working *as written on the Spark website*. The instructions not only don't have you install sbt separately...they actually specifically have

Re: question about partitions

2014-03-24 Thread Syed A. Hashmi
RDD.coalesce should be fine for rebalancing data across all RDD partitions. Coalesce is pretty handy in situations where you have sparse data and want to compact it (e.g. data after applying a strict filter) OR you know the magic number of partitions according to your cluster which will be

Re: Writing RDDs to HDFS

2014-03-24 Thread Diana Carroll
Ongen: I don't know why your process is hanging, sorry. But I do know that the way saveAsTextFile works is that you give it a path to a directory, not a file. The file is saved in multiple parts, corresponding to the partitions. (part-0, part-1 etc.) (Presumably it does this because it

Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Diana, thanks. I am not very well acquainted with HDFS. I use hdfs -put to put things as files into the filesystem (and sc.textFile to get stuff out of them in Spark) and I see that they appear to be saved as files that are replicated across 3 out of the 16 nodes in the hdfs cluster (which is

Re: error loading large files in PySpark 0.9.0

2014-03-24 Thread Jeremy Freeman
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10, 100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls at the same point in each case. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Mar 23, 2014, at 9:56

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Tathagata Das
Hello Sanjay, Yes, your understanding of lazy semantics is correct. But ideally every batch should read based on the batch interval provided in the StreamingContext. Can you open a JIRA on this? On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani sanjay_a...@yahoo.com wrote: Hi All, I found

Re: [bug?] streaming window unexpected behaviour

2014-03-24 Thread Tathagata Das
Yes, I believe that is current behavior. Essentially, the first few RDDs will be partial windows (assuming window duration sliding interval). TD On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu amoc...@verticalscope.com wrote: I have what I would call unexpected behaviour when using window on a

Re: question about partitions

2014-03-24 Thread Walrus theCat
Syed, Thanks for the tip. I'm not sure if coalesce is doing what I'm intending to do, which is, in effect, to subdivide the RDD into N parts (by calling coalesce and doing operations on the partitions.) It sounds like, however, this won't bottleneck my processing power. If this sets off any

Re: Writing RDDs to HDFS

2014-03-24 Thread Ognen Duzlevski
Just so I can close this thread (in case anyone else runs into this stuff) - I did sleep through the basics of Spark ;). The answer on why my job is in waiting state (hanging) is here: http://spark.incubator.apache.org/docs/latest/spark-standalone.html#resource-scheduling Ognen On 3/24/14,

Re: combining operations elegantly

2014-03-24 Thread Richard Siebeling
Hi guys, thanks for the information, I'll give it a try with Algebird, thanks again, Richard @Patrick, thanks for the release calendar On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell pwend...@gmail.comwrote: Hey All, I think the old thread is here:

Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190
Hi,I have large data set of numbers ie RDD and wanted to perform a computation only on groupof two values at a time.For example1,2,3,4,5,6,7... is an RDDCan i group the RDD into (1,2),(3,4),(5,6)...?? and perform the respective computations ?in an efficient manner?As we do'nt have a way to index

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Yana Kadiyska
Diana, I think you are correct - I just installed wget http://mirror.symnds.com/software/Apache/incubator/spark/spark-0.9.0-incubating/spark-0.9.0-incubating-bin-cdh4.tgz and indeed I see the same error that you see It looks like in previous versions sbt-launch used to just come down in the

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
I found that I never read the document carefully and I never find that Spark document is suggesting you to use Spark-distributed sbt…… Best, -- Nan Zhu On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote: Thanks for your help, everyone. Several folks have explained that I can

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread yh18190
We need some one who can explain us with short code snippet on given example so that we get clear cut idea on RDDs indexing.. Guys please help us -- View this message in context:

coalescing RDD into equally sized partitions

2014-03-24 Thread Walrus theCat
Hi, sc.parallelize(Array.tabulate(100)(i=i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect yields Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array()) How do I get something more like: Array(Array(0), Array(20), Array(40), Array(60), Array(80))

Re: N-Fold validation and RDD partitions

2014-03-24 Thread Holden Karau
There is also https://github.com/apache/spark/pull/18 against the current repo which may be easier to apply. On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh a...@adatao.com wrote: Hi Jaonary, You can find the code for k-fold CV in https://github.com/apache/incubator-spark/pull/448. I have

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
It is suggested implicitly in giving you the command ./sbt/sbt. The separately installed sbt isn't in a folder called sbt, whereas Spark's version is. And more relevantly, just a few paragraphs earlier in the tutorial you execute the command sbt/sbt assembly which definitely refers to the spark

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu
partition your input into even number partitions use mapPartition to operate on Iterator[Int] maybe there are some more efficient way…. Best, -- Nan Zhu On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote: Hi, I have large data set of numbers ie RDD and wanted to perform a

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Nan Zhu
Yes, actually even for spark, I mostly use the sbt I installed…..so always missing this issue…. If you can reproduce the problem with a spark-distribtued sbt…I suggest proposing a PR to fix the document, before 0.9.1 is officially released Best, -- Nan Zhu On Monday, March 24, 2014

Re: Writing RDDs to HDFS

2014-03-24 Thread Yana Kadiyska
Ognen, can you comment if you were actually able to run two jobs concurrently with just restricting spark.cores.max? I run Shark on the same cluster and was not able to see a standalone job get in (since Shark is a long running job) until I restricted both spark.cores.max _and_

Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Nan Zhu
I didn’t group the integers, but process them in group of two, partition that scala val a = sc.parallelize(List(1, 2, 3, 4), 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 process each partition and process elements in the partition in group

Re: RDD usage

2014-03-24 Thread hequn cheng
points.foreach(p=p.y = another_value) will return a new modified RDD. 2014-03-24 18:13 GMT+08:00 Chieh-Yen r01944...@csie.ntu.edu.tw: Dear all, I have a question about the usage of RDD. I implemented a class called AppDataPoint, it looks like: case class AppDataPoint(input_y : Double,

答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, a relative question, is that mean the memory usage will doubled? And further more, if the compute function in a rdd is not idempotent, rdd will changed during the job running, is that right? -原始邮件- 发件人: hequn cheng chenghe...@gmail.com 发送时间: ‎2014/‎3/‎25 9:35 收件人:

Re: RDD usage

2014-03-24 Thread Mark Hamstra
No, it won't. The type of RDD#foreach is Unit, so it doesn't return an RDD. The utility of foreach is purely for the side effects it generates, not for its return value -- and modifying an RDD in place via foreach is generally not a very good idea. On Mon, Mar 24, 2014 at 6:35 PM, hequn cheng

Re: Kmeans example reduceByKey slow

2014-03-24 Thread Xiangrui Meng
Sorry, I meant the master branch of https://github.com/apache/spark. -Xiangrui On Mon, Mar 24, 2014 at 6:27 PM, Tsai Li Ming mailingl...@ltsai.com wrote: Thanks again. If you use the KMeans implementation from MLlib, the initialization stage is done on master, The master here is the

Re: 答复: RDD usage

2014-03-24 Thread hequn cheng
First question: If you save your modified RDD like this: points.foreach(p=p.y = another_value).collect() or points.foreach(p=p.y = another_value).saveAsTextFile(...) the modified RDD will be materialized and this will not use any work's memory. If you have more transformatins after the map(), the

答复: 答复: RDD usage

2014-03-24 Thread 林武康
Hi hequn, I dig into the source of spark a bit deeper, and I got some ideas, firstly, immutable is a feather of rdd but not a solid rule, there are ways to change it, for excample, a rdd with non-idempotent compute function, though it is really a bad design to make that function non-idempotent