[ 
https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754206#comment-16754206
 ] 

Hudson commented on TIKA-2822:
------------------------------

SUCCESS: Integrated in Jenkins build tika-branch-1x #156 (See 
[https://builds.apache.org/job/tika-branch-1x/156/])
TIKA-2822 -- update common tokens lists with 7.x Lucene. (tallison: 
[https://github.com/apache/tika/commit/ea23d254b7ab1ff75b71cc2f8dcc03d9baa7f1ab])
* (edit) tika-eval/src/main/resources/common_tokens/fr
* (edit) tika-eval/src/main/resources/common_tokens/el
* (edit) tika-eval/src/main/resources/common_tokens/fa
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tools/SlowCompositeReaderWrapper.java
* (edit) tika-eval/src/main/resources/common_tokens/nl
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tools/TopCommonTokenCounter.java
* (edit) tika-eval/src/main/resources/common_tokens/en
* (edit) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
* (edit) tika-eval/src/main/resources/common_tokens/ru
* (edit) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
* (edit) tika-eval/src/main/resources/lucene-analyzers.json
* (edit) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java
* (edit) tika-eval/src/main/resources/common_tokens/hi
* (edit) tika-eval/src/main/resources/common_tokens/ja
* (edit) CHANGES.txt
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tools/BatchTopCommonTokenCounter.java
* (edit) tika-eval/src/main/resources/common_tokens/es
* (edit) tika-eval/src/main/resources/common_tokens/zh-cn
* (edit) tika-eval/src/main/resources/common_tokens/ur
* (edit) 
tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
* (edit) tika-eval/src/main/resources/common_tokens/vi
* (edit) tika-eval/src/main/resources/common_tokens/it
* (add) 
tika-eval/src/test/java/org/apache/tika/tools/TopCommonTokenCounterTest.java
* (edit) tika-eval/src/main/resources/common_tokens/ar
* (edit) tika-eval/src/main/resources/common_tokens/he
* (edit) tika-eval/src/main/resources/common_tokens/zh-tw
* (edit) tika-eval/src/main/resources/common_tokens/de
* (edit) tika-eval/src/main/resources/common_tokens/id
* (edit) tika-eval/src/main/resources/common_tokens/ko
* (add) tika-eval/src/main/resources/common_tokens/bn
* (edit) tika-eval/src/main/resources/common_tokens/pt


> Update common tokens files for tika-eval
> ----------------------------------------
>
>                 Key: TIKA-2822
>                 URL: https://issues.apache.org/jira/browse/TIKA-2822
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-eval
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Trivial
>             Fix For: 1.21
>
>
> We initially created the common tokens files (top 20k tokens by document 
> frequency) in Wikipedia with Lucene 6.x.  We should rerun that code with an 
> updated Lucene on the off chance that there are slight changes in 
> tokenization.  
> While doing this work, I found a trivial bug in filtering common tokens that 
> we should fix as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to