Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/7760#issuecomment-126202241
  
    About scaling by counts in 
[https://github.com/apache/spark/pull/7760/files#diff-965c75b823b8cbfb304a6f6774681ccaR277]:
    
    > I don't think so; the bound is for the entire document (joint over all 
words in document), not per-word. This is needed when doing online alpha 
hyperparameter estimate updates. bound also is not part of the public API.
    
    That bound should count each token (instance of a word), but its current 
implementation would treat a document "blah" and a document "blah blah" as 
identical.
    
    > Scaling by count for perplexity is done here
    
    This is to make the term per-word, i.e., average over all tokens.  The 
issue is that the value computed earlier is not quite per-word.
    
    I think this problem might be caught by modifying the unit test to have 
some terms have multiple copies.  (I'm trying this out currently.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to