[
https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754206#comment-16754206
]
Hudson commented on TIKA-2822:
------------------------------
SUCCESS: Integrated in Jenkins build tika-branch-1x #156 (See
[https://builds.apache.org/job/tika-branch-1x/156/])
TIKA-2822 -- update common tokens lists with 7.x Lucene. (tallison:
[https://github.com/apache/tika/commit/ea23d254b7ab1ff75b71cc2f8dcc03d9baa7f1ab])
* (edit) tika-eval/src/main/resources/common_tokens/fr
* (edit) tika-eval/src/main/resources/common_tokens/el
* (edit) tika-eval/src/main/resources/common_tokens/fa
* (add)
tika-eval/src/main/java/org/apache/tika/eval/tools/SlowCompositeReaderWrapper.java
* (edit) tika-eval/src/main/resources/common_tokens/nl
* (add)
tika-eval/src/main/java/org/apache/tika/eval/tools/TopCommonTokenCounter.java
* (edit) tika-eval/src/main/resources/common_tokens/en
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
* (edit) tika-eval/src/main/resources/common_tokens/ru
* (edit) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
* (edit) tika-eval/src/main/resources/lucene-analyzers.json
* (edit)
tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
* (edit) tika-eval/src/main/resources/common_tokens/hi
* (edit) tika-eval/src/main/resources/common_tokens/ja
* (edit) CHANGES.txt
* (add)
tika-eval/src/main/java/org/apache/tika/eval/tools/BatchTopCommonTokenCounter.java
* (edit) tika-eval/src/main/resources/common_tokens/es
* (edit) tika-eval/src/main/resources/common_tokens/zh-cn
* (edit) tika-eval/src/main/resources/common_tokens/ur
* (edit)
tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
* (edit) tika-eval/src/main/resources/common_tokens/vi
* (edit) tika-eval/src/main/resources/common_tokens/it
* (add)
tika-eval/src/test/java/org/apache/tika/tools/TopCommonTokenCounterTest.java
* (edit) tika-eval/src/main/resources/common_tokens/ar
* (edit) tika-eval/src/main/resources/common_tokens/he
* (edit) tika-eval/src/main/resources/common_tokens/zh-tw
* (edit) tika-eval/src/main/resources/common_tokens/de
* (edit) tika-eval/src/main/resources/common_tokens/id
* (edit) tika-eval/src/main/resources/common_tokens/ko
* (add) tika-eval/src/main/resources/common_tokens/bn
* (edit) tika-eval/src/main/resources/common_tokens/pt
> Update common tokens files for tika-eval
> ----------------------------------------
>
> Key: TIKA-2822
> URL: https://issues.apache.org/jira/browse/TIKA-2822
> Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Trivial
> Fix For: 1.21
>
>
> We initially created the common tokens files (top 20k tokens by document
> frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an
> updated Lucene on the off chance that there are slight changes in
> tokenization.
> While doing this work, I found a trivial bug in filtering common tokens that
> we should fix as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)