Hi, i am running k-means algorithm with initialization mode set to random and various dataset sizes and values for clusters and i have a question regarding the takeSample job of the algorithm. More specific i notice that in every application there are two sampling jobs. The first one is consuming the most time compared to all others while the second one is much quicker and that sparked my interest to investigate what is actually happening. In order to explain it, i checked the source code of the takeSample operation and i saw that there is a count action involved and then the computation of a PartiotionwiseSampledRDD with a PoissonSampler. So my question is,if that count action corresponds to the first takeSample job and if the second takeSample job is the one doing the actual sampling.
I also have a question for the RDDs that are created for the k-means. In the middle of the execution under the storage tab of the web ui i can see 3 RDDs with their partitions cached in memory across all nodes which is very helpful for monitoring reasons. The problem is that after the completion i can only see one of them and the portion of the cache memory it used and i would like to ask why the web ui doesn't display all the RDDs involded in the computation. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org