[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142827#comment-14142827
 ] 

Andrew Ash commented on SPARK-3614:
-----------------------------------

Great! I assigned this ticket to you RJ.  Please try to have a draft commit 
within a couple weeks for review so others who might want to work on this can 
see progress being made.  Otherwise it's best to leave tickets unassigned while 
no one is actively working on them.

Thanks!

> Filter on minimum occurrences of a term in IDF 
> -----------------------------------------------
>
>                 Key: SPARK-3614
>                 URL: https://issues.apache.org/jira/browse/SPARK-3614
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Jatinpreet Singh
>            Assignee: RJ Nowling
>            Priority: Minor
>              Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to