[jira] Created: (LUCENE-1103) WikipediaTokenizer

Grant Ingersoll (JIRA) Fri, 28 Dec 2007 09:21:14 -0800

WikipediaTokenizer
------------------

                 Key: LUCENE-1103
                 URL: https://issues.apache.org/jira/browse/LUCENE-1103
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Analysis
            Reporter: Grant Ingersoll
            Assignee: Grant Ingersoll
            Priority: Minor



I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens 
with certain attributes.  It isn't necessarily complete, but it does a good 
enough job for it to be consumed and improved by others.

It sets the Token.type() value depending on the Wikipedia syntax (links, 
internal links, bold, italics, etc.) based on my pass at 
http://en.wikipedia.org/wiki/Wikipedia:Tutorial

I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and 
it seems to do a decent job.

Caveats:  I am not sure how to best handle testing, since the content is 
licensed under GNU Free Doc License, I believe I can't copy and paste a whole 
document into the unit test.  I have hand coded one doc and have another one 
that just generally runs over the benchmark Wikipedia download.

One more question is where to put it.  It could go in analysis, but the tests 
at least will have a dependency on Benchmark.  I am thinking of adding a new 
contrib/wikipedia where this could grow to have other wikipedia things (perhaps 
we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.

I will post a patch over the next few days.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1103) WikipediaTokenizer

Reply via email to