[
https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison reopened TIKA-2822:
-------------------------------
I forgot to remove common html markup terms. Rerunning now.
See:
https://issues.apache.org/jira/browse/TIKA-2267?focusedCommentId=15872055&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15872055
> Update common tokens files for tika-eval
> ----------------------------------------
>
> Key: TIKA-2822
> URL: https://issues.apache.org/jira/browse/TIKA-2822
> Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Trivial
> Fix For: 1.21
>
>
> We initially created the common tokens files (top 20k tokens by document
> frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an
> updated Lucene on the off chance that there are slight changes in
> tokenization.
> While doing this work, I found a trivial bug in filtering common tokens that
> we should fix as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)