[ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556057#action_12556057 ]
Grant Ingersoll commented on LUCENE-1103: ----------------------------------------- I updated the link slightly to increment the first token of a link (i.e. the URL or the Wiki link) and then not increment the next token in the link, such that the link and the first display token will be at the same position instead of the first way I had it which put the link token at the same position as the previous token. I also modified the EXTERNAL link state to recognize https > WikipediaTokenizer > ------------------ > > Key: LUCENE-1103 > URL: https://issues.apache.org/jira/browse/LUCENE-1103 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, > LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch > > > I have extended StandardTokenizer to recognize Wikipedia syntax and mark > tokens with certain attributes. It isn't necessarily complete, but it does a > good enough job for it to be consumed and improved by others. > It sets the Token.type() value depending on the Wikipedia syntax (links, > internal links, bold, italics, etc.) based on my pass at > http://en.wikipedia.org/wiki/Wikipedia:Tutorial > I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff > and it seems to do a decent job. > Caveats: I am not sure how to best handle testing, since the content is > licensed under GNU Free Doc License, I believe I can't copy and paste a whole > document into the unit test. I have hand coded one doc and have another one > that just generally runs over the benchmark Wikipedia download. > One more question is where to put it. It could go in analysis, but the tests > at least will have a dependency on Benchmark. I am thinking of adding a new > contrib/wikipedia where this could grow to have other wikipedia things > (perhaps we would move EnwikiDocMaker there????) and reverse the dependency > on Benchmark. > I will post a patch over the next few days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]