KMeans takeSample jobs and RDD cached

podioss Sat, 25 Apr 2015 06:38:10 -0700

Hi,
i am running k-means algorithm with initialization mode set to random and
various dataset sizes and values for clusters and i have a question
regarding the takeSample job of the algorithm.
More specific i notice that in every application there are  two sampling
jobs. The first one is consuming the most time compared to all others while
the second one is much quicker and that sparked my interest to investigate
what is actually happening. 
In order to explain it, i  checked the source code of the takeSample
operation and i saw that there is a count action involved and then the
computation of a PartiotionwiseSampledRDD with a PoissonSampler.
So my question is,if that count action corresponds to the first takeSample
job and if the second takeSample job is the one doing the actual sampling.


I also have a question for the RDDs that are created for the k-means. In the
middle of the execution under the storage tab of the web ui i can see 3 RDDs
with their partitions cached in memory across all nodes which is very
helpful for monitoring reasons. The problem is that after the completion i
can only see one of them and the portion of the cache memory it used and i
would like to ask why the web ui doesn't display all the RDDs involded in
the computation.

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

KMeans takeSample jobs and RDD cached

Reply via email to