[ 
https://issues.apache.org/jira/browse/TIKA-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-2267.
-------------------------------
       Resolution: Fixed
    Fix Version/s: 1.15
                   2.0

Took top 20k tokens by document frequency from wikipedia dumps per language.  

I ignored wikipedia pages that conflicted with Optimaize's language id (e.g. if 
I was processing the ptwiki, and Optimaize identified it as "es", I ignored the 
page).

I used some heuristics to try to ignore pages that were link/reference articles 
or other non-content articles.

I attempted to randomly sample 500k articles.  For English, I only pulled the 
first 10 bzips.  For the other languages, I pulled all.

I removed common html markup tokens (e.g. body, html, script).  If we allowed 
those, then if html extraction fails and we get a bunch of markup, we would see 
an incorrectly inflated "common tokens" count.

I removed terms that were < 4 characters long except for CJK.

I added ___url___ and ___email___ so that those would exist for every language 
model.

If we change the underlying Lucene analysis chain, we'll have to reprocess the 
wikidumps.

The files are sorted in descending document frequency.  It is clear that the 
wiki markup stripper wasn't perfect (words for links/references show up 
frequently), but this seems like a reasonable start.

For posterity, I used this analysis chain:

{noformat}
      "tokenizer": {
        "factory": "oala.standard.UAX29URLEmailTokenizerFactory",
        "params": {}
      },
      "tokenfilters": [
        {
          "factory": "oala.icu.ICUFoldingFilterFactory",
          "params": {}
        },
        {
          "factory": "org.apache.tika.eval.tokens.AlphaIdeographFilterFactory",
          "params": {}
        },
        {
          "factory": "oala.pattern.PatternReplaceFilterFactory",
          "params": {
            "pattern": "^[\\w+\\.]{1,30}@(?:\\w+\\.){1,10}\\w+$",
            "replacement": "___email___",
            "replace": "all"
          }
        },
        {
          "factory": "oala.pattern.PatternReplaceFilterFactory",
          "params": {
            "pattern": "^(?:(?:ftp|https?):\\/\\/)?(?:\\w+\\.){1,10}\\w+$",
            "replacement": "___url___",
            "replace": "all"
          }
        },
        {
          "factory": "oala.cjk.CJKBigramFilterFactory",
          "params": {
            "outputUnigrams": "false"
          }
        },
        {
          "factory": 
"org.apache.tika.eval.tokens.CJKBigramAwareLengthFilterFactory",
          "params": {
            "min": 4,
            "max": 20
          }
        }
      ]
    }

{noformat}

full list of words removed:
{noformat}
span
table
href
head
title
body
html
tagname
lang
style
script
strong
blockquote
form
iframe
section
colspan
{noformat}

> Add common tokens files for tika-eval
> -------------------------------------
>
>                 Key: TIKA-2267
>                 URL: https://issues.apache.org/jira/browse/TIKA-2267
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-eval
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 2.0, 1.15
>
>
> We should add some common tokens files for popular languages for tika-eval so 
> that users don't have to generate their own.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to