[
https://issues.apache.org/jira/browse/SPARK-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-954.
-----------------------------
Resolution: Won't Fix
>From the discussion, and later ones about guarantees of determinism in RDDs,
>sounds like this is working as intended.
> One repeated sampling, and I am not sure if it is correct.
> ----------------------------------------------------------
>
> Key: SPARK-954
> URL: https://issues.apache.org/jira/browse/SPARK-954
> Project: Spark
> Issue Type: Story
> Affects Versions: 0.7.3
> Reporter: caizhua
>
> This piece of code reads the dataset, and then has two operations on the
> dataset. If I consider the RDD as a view definition, I think the result is
> correct. However, since the first iteration does result_sample.count(), then
> I was wondering whether we should repeat the computation in the
> initialize_doc_topic_word_count(.) function, when we run the the second
> result_sample.map(lambda (block_id, doc_prob): doc_prob).count(). Since
> people write Spark as a program not as a database view, sometimes it is
> confusing. For example, considering there initialize_doc_topic_word_count(.)
> is a statistical function with runtime seeds, I am not sure if this have
> impact on the result.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]