Tim Allison created TIKA-2822:
---------------------------------
Summary: Update common tokens files
Key: TIKA-2822
URL: https://issues.apache.org/jira/browse/TIKA-2822
Project: Tika
Issue Type: Improvement
Reporter: Tim Allison
Assignee: Tim Allison
We initially created the common tokens files (top 20k tokens by document
frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an
updated Lucene on the off chance that there are slight changes in tokenization.
While doing this work, I found a trivial bug in filtering common tokens that we
should fix as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)