[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

rnowling Mon, 22 Sep 2014 14:53:58 -0700

Github user rnowling commented on the pull request:

    https://github.com/apache/spark/pull/2494#issuecomment-56449372
  
    @Ishiihara If you look at the original JIRA, this was the functionality 
requested by the user.  For the case you mention (high TF in a couple of 
documents), you would want to handle that separately in the transform() 
function where you could consider both the IDF and TF values.
    
    As per space, it could be beneficial to create sparser vectors as a result 
of the filtering.  However, I chose not to make that change since it may cause 
problems for some users since they would expect the resulting TF-IDF vectors to 
have the same values as the sparse or dense TF vectors.  The way I've 
implemented the changes minimizes the overall effect on the user.  I believe a 
separate PR should be created for considering space optimizations if they are 
going to change the API.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3614][MLLIB] Add minimumOccurence filte...

Reply via email to