[
https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754160#comment-16754160
]
Tim Allison commented on TIKA-2822:
-----------------------------------
The code I used for wiki-munging is here:
https://github.com/tballison/hodgepodge/tree/master/wiki-munging/src/main/java
Steps:
1) Download the articles/pages dumps (or a subset for English) from
http://dumps.wikimedia.org/<LANGUAGE_CODE>wiki, e.g.
http://dumps.wikimedia.org/enwiki
2) Run WikiToTable, which relies on Jimmy Lin's {{org.wikiclean:wikiclean}} ,
to strip out wiki markup and output a gz table, one row per document.
3) Run WikiZhConverter on the {{zh}} dump, once for {{zh-tw}} and once for
{{zh-cn}}.
4) Place all table files in a directory and run
{{org.apache.tika.eval.tools.BatchTopCommonTokenCounter}} on the directory of
table files
> Update common tokens files for tika-eval
> ----------------------------------------
>
> Key: TIKA-2822
> URL: https://issues.apache.org/jira/browse/TIKA-2822
> Project: Tika
> Issue Type: Improvement
> Components: tika-eval
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Trivial
>
> We initially created the common tokens files (top 20k tokens by document
> frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an
> updated Lucene on the off chance that there are slight changes in
> tokenization.
> While doing this work, I found a trivial bug in filtering common tokens that
> we should fix as well.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)