Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/7760#issuecomment-126202241
About scaling by counts in
[https://github.com/apache/spark/pull/7760/files#diff-965c75b823b8cbfb304a6f6774681ccaR277]:
> I don't think so; the bound is for the entire document (joint over all
words in document), not per-word. This is needed when doing online alpha
hyperparameter estimate updates. bound also is not part of the public API.
That bound should count each token (instance of a word), but its current
implementation would treat a document "blah" and a document "blah blah" as
identical.
> Scaling by count for perplexity is done here
This is to make the term per-word, i.e., average over all tokens. The
issue is that the value computed earlier is not quite per-word.
I think this problem might be caught by modifying the unit test to have
some terms have multiple copies. (I'm trying this out currently.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]