[jira] [Updated] (LUCENE-5012) Make graph-based TokenFilters easier

Michael McCandless (JIRA) Tue, 21 May 2013 13:07:16 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-5012:
---------------------------------------

    Attachment: LUCENE-5012.patch

Initial patch showing the approach.  This patch also includes
write-once attr bindings (LUCENE-2450).  There are some big changes
here:

  * All the changes from LUCENE-2450: no global attribute bindings;
    instead each stage owns/controls what the next stage can see.

  * Instead of PosInc/LenAtt, there is ArcAttribute, that has from and
    to node (ints).  Tokens are arcs, and are free to have arbitrary
    to/from.

  * An "adapter" StageAnalyzer that takes the write-once Stage and
    creates an analyzer, converting the ArcAtribute into PosIncAtt in
    the end.  So, positions are only "assigned" in the final adapter
    stage (StageToTokenStream).

  * A SynonymFilterStage that fixes the above two issues (and is also
    quite a bit simpler).

  * A SplitOnDashStage that shows how a decompounder works.

  * Holes are done with a new DeletedAttribute, i.e. the token still
    runs through the entire chain, but it's marked as deleted so that
    stages along the way know to ignore it.  E.g. this would make it
    possible for a tokenizer to produce punctuation tokens that are
    skipped for indexing but prevent a SynonymFilter from matching
    "over" the punctuation.

There is some added tracking of nodes that are not "done yet",
necessary to allow incremental consumption of the graph by all
stages.  It adds some hair to graph stages but I don't see how to
simplify it while keeping incrementality...

One nice side effect of this change is it's no longer possible to
create a first token with position=-1, since the mapping of node id ->
position is done for you.

Also, the graph is intact throughout the chain, until the very end
where it is "cast" to a sausage (what indexer requires), vs today
where SynonynmFilter does its own sausagizing.

While the patch is a just a prototype and there's still tons to do
(long, long ways from committing, very much "exploratory"), I think
it's far enough along that it shows the promise of both write-once
attr bindings and an easier API for graph-based analysis components.
Tricky cases that don't work with TokenStream today, e.g. a
decompounder followed by a syn filter, do work in the patch.

                
> Make graph-based TokenFilters easier
> ------------------------------------
>
>                 Key: LUCENE-5012
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5012
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-5012.patch
>
>
> SynonymFilter has two limitations today:
>   * It cannot create positions, so eg dns -> domain name service
>     creates blatantly wrong highlights (SOLR-3390, LUCENE-4499 and
>     others).
>   * It cannot consume a graph, so e.g. if you try to apply synonyms
>     after Kuromoji tokenizer I'm not sure what will happen.
> I've thought about how to fix these issues but it's really quite
> difficult with the current PosInc/PosLen graph representation, so I'd
> like to explore an alternative approach.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-5012) Make graph-based TokenFilters easier

Reply via email to