Github user rnowling commented on the pull request:
https://github.com/apache/spark/pull/2494#issuecomment-56449372
@Ishiihara If you look at the original JIRA, this was the functionality
requested by the user. For the case you mention (high TF in a couple of
documents), you would want to handle that separately in the transform()
function where you could consider both the IDF and TF values.
As per space, it could be beneficial to create sparser vectors as a result
of the filtering. However, I chose not to make that change since it may cause
problems for some users since they would expect the resulting TF-IDF vectors to
have the same values as the sparse or dense TF vectors. The way I've
implemented the changes minimizes the overall effect on the user. I believe a
separate PR should be created for considering space optimizations if they are
going to change the API.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]