[ https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-5012: --------------------------------------- Attachment: LUCENE-5012.patch Initial patch showing the approach. This patch also includes write-once attr bindings (LUCENE-2450). There are some big changes here: * All the changes from LUCENE-2450: no global attribute bindings; instead each stage owns/controls what the next stage can see. * Instead of PosInc/LenAtt, there is ArcAttribute, that has from and to node (ints). Tokens are arcs, and are free to have arbitrary to/from. * An "adapter" StageAnalyzer that takes the write-once Stage and creates an analyzer, converting the ArcAtribute into PosIncAtt in the end. So, positions are only "assigned" in the final adapter stage (StageToTokenStream). * A SynonymFilterStage that fixes the above two issues (and is also quite a bit simpler). * A SplitOnDashStage that shows how a decompounder works. * Holes are done with a new DeletedAttribute, i.e. the token still runs through the entire chain, but it's marked as deleted so that stages along the way know to ignore it. E.g. this would make it possible for a tokenizer to produce punctuation tokens that are skipped for indexing but prevent a SynonymFilter from matching "over" the punctuation. There is some added tracking of nodes that are not "done yet", necessary to allow incremental consumption of the graph by all stages. It adds some hair to graph stages but I don't see how to simplify it while keeping incrementality... One nice side effect of this change is it's no longer possible to create a first token with position=-1, since the mapping of node id -> position is done for you. Also, the graph is intact throughout the chain, until the very end where it is "cast" to a sausage (what indexer requires), vs today where SynonynmFilter does its own sausagizing. While the patch is a just a prototype and there's still tons to do (long, long ways from committing, very much "exploratory"), I think it's far enough along that it shows the promise of both write-once attr bindings and an easier API for graph-based analysis components. Tricky cases that don't work with TokenStream today, e.g. a decompounder followed by a syn filter, do work in the patch. > Make graph-based TokenFilters easier > ------------------------------------ > > Key: LUCENE-5012 > URL: https://issues.apache.org/jira/browse/LUCENE-5012 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: Michael McCandless > Assignee: Michael McCandless > Attachments: LUCENE-5012.patch > > > SynonymFilter has two limitations today: > * It cannot create positions, so eg dns -> domain name service > creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and > others). > * It cannot consume a graph, so e.g. if you try to apply synonyms > after Kuromoji tokenizer I'm not sure what will happen. > I've thought about how to fix these issues but it's really quite > difficult with the current PosInc/PosLen graph representation, so I'd > like to explore an alternative approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org