[ https://issues.apache.org/jira/browse/TIKA-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2800. ------------------------------- Resolution: Fixed Assignee: Tim Allison Fix Version/s: 1.20 2.0.0 > Include num of unique common/alphabetic tokens (types) in tika-eval > ------------------------------------------------------------------- > > Key: TIKA-2800 > URL: https://issues.apache.org/jira/browse/TIKA-2800 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Major > Fix For: 2.0.0, 1.20 > > > We include token and unique token (type) counts in tika-eval. We should > include type counts for alphabetic and common words. If one tool is > incorrectly duplicating/triplicating content dramatically, that would > incorrectly inflate the "common_tokens" sum for that tool. -- This message was sent by Atlassian JIRA (v7.6.3#76005)