[
https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754132#comment-16754132
]
Tim Allison commented on TIKA-2822:
-----------------------------------
Last time I did this, IIRC, there were separate {{zh-tw}} and {{zh-cn}} wiki
dumps. They are now unifying these into a single {{zh}} dump, but then running
some mapping code for presentation. The character/term/word mappings are
available here:
https://phab.wmfusercontent.org/file/data/ycg62tzo5qyv5txmiamh/PHID-FILE-66gf4k72tgxhksd5j36x/ZhConversion.php
> Update common tokens files for tika-eval
> ----------------------------------------
>
> Key: TIKA-2822
> URL: https://issues.apache.org/jira/browse/TIKA-2822
> Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Trivial
>
> We initially created the common tokens files (top 20k tokens by document
> frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an
> updated Lucene on the off chance that there are slight changes in
> tokenization.
> While doing this work, I found a trivial bug in filtering common tokens that
> we should fix as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)