Tim Allison created TIKA-2822:
---------------------------------

             Summary: Update common tokens files
                 Key: TIKA-2822
                 URL: https://issues.apache.org/jira/browse/TIKA-2822
             Project: Tika
          Issue Type: Improvement
            Reporter: Tim Allison
            Assignee: Tim Allison


We initially created the common tokens files (top 20k tokens by document 
frequency) in Wikipedia with Lucene 6.x.  We should rerun that code with an 
updated Lucene on the off chance that there are slight changes in tokenization. 
 

While doing this work, I found a trivial bug in filtering common tokens that we 
should fix as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to