Github user MLnick commented on the pull request:

    https://github.com/apache/spark/pull/10152#issuecomment-163587959
  
    Ok, so after some more digging into this and the original Google code and 
mailing list, actually I was incorrect in my original read of the Google impl - 
sorry for the confusion. It does in fact treat newlines (`\n`) as a sentence 
boundary, and additionally imposes the limit of 1000 for `MAX_SENTENCE_LENGTH`.
    
    This happens in L81-84, where `\n` is allocated a special sentence 
delimiter of `</s>`, which is later kept at position `0` in the vocab (see 
comment L148), and in turn results in breaking out of the sentence construction 
loop in L393.
    
    See also these two Google group posts which make it more clear that 
sentences are newline-delimited - 
https://groups.google.com/forum/#!searchin/word2vec-toolkit/line$20braks/word2vec-toolkit/2elkT3cOMqo/DL_CsF1p8H8J
 and https://groups.google.com/forum/#!topic/word2vec-toolkit/3LAooMdrCl0
    
    Given this, in fact our current implementation is not correct and using 
`words` instead of `dataset` in the `mapPartitions` block is a bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to