[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492904#comment-16492904
 ] 

David Smiley commented on LUCENE-8332:
--------------------------------------

Oh I wanted to mention one thing; perhaps just here though I could put in the 
docs.

An alternative approach to this tagger might be to use the SynonymGraphFilter 
(with other steps/configuration),
 which has a lot of similarities with the Tagger's algorithm.  I've heard of 
others that have done this (Dice.com?), and before I created the tagger I 
thought about this approach too.  There are some issues/barriers to "just" 
using the synonym filter::
* if the filter finds multiple overlapping matches, it only returns one without 
any control over its choice.  (compare to the STT's "overlaps" param with 
several choices and it's pluggable)
* the filter doesn't hold any metadata; it's just a set of names.  Though you 
could use synonyms to map to an ID that you then lookup in something else (e.g. 
some DB or Solr index).
* the synonym filter must re-construct its FST on startup each time; 
customizations are necessary to load an existing one from disk.
* you have to arrange for any text processing/analysis (e.g. tokenization rules 
or phonetic filters) of the dictionary to create synonym entries.  With the STT 
this is all configurable in a standard way like any text field.
* and of course you'd have to glue it all together somehow.

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-8332
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8332
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to