Tim Allison created TIKA-2800: --------------------------------- Summary: Include num of unique common/alphabetic tokens (types) in tika-eval Key: TIKA-2800 URL: https://issues.apache.org/jira/browse/TIKA-2800 Project: Tika Issue Type: Improvement Reporter: Tim Allison
We include token and unique token (type) counts in tika-eval. We should include type counts for alphabetic and common words. If one tool is incorrectly duplicating/triplicating content dramatically, that would incorrectly inflate the "common_tokens" sum for that tool. -- This message was sent by Atlassian JIRA (v7.6.3#76005)