[ https://issues.apache.org/jira/browse/LUCENE-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555642#action_12555642 ]
Doug Cutting commented on LUCENE-1103: -------------------------------------- Should the position increment be zero for link urls, so that phrase searches work correctly with anchors? One might even index URLs in a separate field... > WikipediaTokenizer > ------------------ > > Key: LUCENE-1103 > URL: https://issues.apache.org/jira/browse/LUCENE-1103 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1103.patch, LUCENE-1103.patch, LUCENE-1103.patch, > LUCENE-1103.patch > > > I have extended StandardTokenizer to recognize Wikipedia syntax and mark > tokens with certain attributes. It isn't necessarily complete, but it does a > good enough job for it to be consumed and improved by others. > It sets the Token.type() value depending on the Wikipedia syntax (links, > internal links, bold, italics, etc.) based on my pass at > http://en.wikipedia.org/wiki/Wikipedia:Tutorial > I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff > and it seems to do a decent job. > Caveats: I am not sure how to best handle testing, since the content is > licensed under GNU Free Doc License, I believe I can't copy and paste a whole > document into the unit test. I have hand coded one doc and have another one > that just generally runs over the benchmark Wikipedia download. > One more question is where to put it. It could go in analysis, but the tests > at least will have a dependency on Benchmark. I am thinking of adding a new > contrib/wikipedia where this could grow to have other wikipedia things > (perhaps we would move EnwikiDocMaker there????) and reverse the dependency > on Benchmark. > I will post a patch over the next few days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]