[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15828939#comment-15828939 ]
Michael McCandless commented on LUCENE-5012: -------------------------------------------- [~mattweber], I realized I had more private changes that I never pushed to that old branch, so I recovered them, fixed to apply to current master, and pushed here: https://github.com/mikemccand/lucene-solr/commits/graph_token_filters I also removed the controversial {{InsertDeletedPunctuationStage}}. Some tests are still failing ... I'll try to fix them. I think the ideas here are very promising. The write-once attributes (LUCENE-2450, folded into this branch) is cleaner than what Lucene has today, and the ease of making new positions without having to re-number previous ones makes graph token streams much easier. I tried to add the equivalent of {{CharFilter}} here, by using a new {{TextAttribute}} that stages before tokenization can use to read from a {{Reader}} or a {{String}}, and remap; I like that this makes offset correction more local than what the {{correctOffset}} exposes today. And it means char filtering is simply another stage, not a separate class. I also added {{int[] parts}} to {{OffsetAttribute}}; the idea here is to empower token filters (not just tokenizers) to properly correct offsets, so that e.g. WDGF could work "correctly", but I'm not sure it's worth the hassle: I haven't fully implemented it, and doing so is surprisingly tricky. > Make graph-based TokenFilters easier > ------------------------------------ > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: LUCENE-5012.patch, LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org