The reason it is nondeterministic is because tasks are not always scheduled to the same nodes -- so I don't think you can make this deterministic.
If you assume no failures and tasks take a while to run (so it runs slower than the scheduler can schedule them), then I think you can make it deterministic by setting spark.locality.wait to a really high number, and coalescing everything into just N partitions, where N = the number of machines. On Fri, Sep 18, 2015 at 5:53 PM, Ulanov, Alexander <alexander.ula...@hpe.com > wrote: > Sounds interesting! Is it possible to make it deterministic by using > global long value and get the element on partition only if > someFunction(partitionId, globalLong)==true? Or by using some specific > partitioner that creates such partitionIds that can be decomposed into > nodeId and number of partitions per node? > > > > *From:* Reynold Xin [mailto:r...@databricks.com] > *Sent:* Friday, September 18, 2015 4:37 PM > *To:* Ulanov, Alexander > *Cc:* Feynman Liang; dev@spark.apache.org > *Subject:* Re: One element per node > > > > Use a global atomic boolean and return nothing from that partition if the > boolean is true. > > > > Note that your result won't be deterministic. > > > On Sep 18, 2015, at 4:11 PM, Ulanov, Alexander <alexander.ula...@hpe.com> > wrote: > > Thank you! How can I guarantee that I have only one element per executor > (per worker, or per physical node)? > > > > *From:* Feynman Liang [mailto:fli...@databricks.com > <fli...@databricks.com>] > *Sent:* Friday, September 18, 2015 4:06 PM > *To:* Ulanov, Alexander > *Cc:* dev@spark.apache.org > *Subject:* Re: One element per node > > > > rdd.mapPartitions(x => new Iterator(x.head)) > > > > On Fri, Sep 18, 2015 at 3:57 PM, Ulanov, Alexander < > alexander.ula...@hpe.com> wrote: > > Dear Spark developers, > > > > Is it possible (and how to do it if possible) to pick one element per > physical node from an RDD? Let’s say the first element of any partition on > that node. The result would be an RDD[element], the count of elements is > equal to the N of nodes that has partitions of the initial RDD. > > > > Best regards, Alexander > > > >