[jira] [Commented] (SPARK-8565) TF-IDF drops records

Sean Owen (JIRA) Tue, 23 Jun 2015 11:43:27 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598154#comment-14598154
 ]


Sean Owen commented on SPARK-8565:
----------------------------------

Well, it means that every time you go to evaluate your ES source, you have more 
records. So if you count() it, then do some transform and count() that, you're 
actually reevaluating the source twice and getting different numbers of records 
from the source. To prove/disprove, cache() the source RDD and then try this.

> TF-IDF drops records
> --------------------
>
>                 Key: SPARK-8565
>                 URL: https://issues.apache.org/jira/browse/SPARK-8565
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.3.1
>            Reporter: PJ Van Aeken
>
> When applying TFIDF on an RDD[Seq[String]] with 1213 records, I get an 
> RDD[Vector] back with only 1204 records. This prevents me from zipping it 
> with the original so I can reattach the document ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-8565) TF-IDF drops records

Reply via email to