[
https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756400#comment-16756400
]
Hudson commented on TIKA-2822:
------------------------------
UNSTABLE: Integrated in Jenkins build tika-2.x-windows #379 (See
[https://builds.apache.org/job/tika-2.x-windows/379/])
TIKA-2822 -- remove common >=4 letter html markup entities (tallison: rev
8c22f054ea94526e6d22a3f4c923e0b8724f2831)
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/tools/TopCommonTokenCounter.java
* (edit) tika-eval/src/main/resources/common_tokens/nl
* (edit) tika-eval/src/main/resources/common_tokens/el
* (edit) tika-eval/src/main/resources/common_tokens/ru
* (edit) tika-eval/src/main/resources/common_tokens/zh-cn
* (edit) tika-eval/src/main/resources/common_tokens/es
* (edit) tika-eval/src/main/resources/common_tokens/de
* (edit) tika-eval/src/main/resources/common_tokens/en
* (edit) tika-eval/src/main/resources/common_tokens/he
* (edit) tika-eval/src/main/resources/common_tokens/pt
* (edit) tika-eval/src/main/resources/common_tokens/fa
* (edit) tika-eval/src/main/resources/common_tokens/ar
* (edit) tika-eval/src/main/resources/common_tokens/id
* (edit) tika-eval/src/main/resources/common_tokens/ko
* (edit) tika-eval/src/main/resources/common_tokens/bn
* (edit) tika-eval/src/main/resources/common_tokens/hi
* (edit) tika-eval/src/main/resources/common_tokens/it
* (edit) tika-eval/src/main/resources/common_tokens/ur
* (edit) tika-eval/src/main/resources/common_tokens/zh-tw
* (edit) tika-eval/src/main/resources/common_tokens/ja
* (edit) tika-eval/src/main/resources/common_tokens/fr
* (edit) tika-eval/src/main/resources/common_tokens/vi
> Update common tokens files for tika-eval
> ----------------------------------------
>
> Key: TIKA-2822
> URL: https://issues.apache.org/jira/browse/TIKA-2822
> Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Trivial
> Fix For: 1.21
>
>
> We initially created the common tokens files (top 20k tokens by document
> frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an
> updated Lucene on the off chance that there are slight changes in
> tokenization.
> While doing this work, I found a trivial bug in filtering common tokens that
> we should fix as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)