[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

Ted Sullivan (JIRA) Fri, 20 Nov 2015 08:14:00 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018207#comment-15018207
 ]


Ted Sullivan edited comment on SOLR-7136 at 11/20/15 4:12 PM:
--------------------------------------------------------------

Thanks for this submission [[email protected]]! I think that this really 
helps to scale the autophrasing solution. Also the improvement in dealing with 
PositionLength is a big plus, as are the improvements in the query parser. 
Great work, thanks.

I have seen some reports on the github version of my code about memory leaks. 
Have you looked at that? I will take your patch and try to do some A/B 
comparisons on this to see if the new FSM implementation (hopefully) removes 
that problem too. But in general, have you done any performance/scaling tests 
on your version of the autophrasing filter? Obviously, this goes along with the 
production-readiness that your new implementation makes possible. Thanks again 
for submitting this patch.

As to complementarity with SOLR-4381 - I would agree - nice to hear that the 
two solutions play nicely with each other :) IMO this is an important problem 
that needs a committed solution. If we give Solr users more than one way to 
"skin the cat" - the better the chance that they will find a solution for their 
own problem set.  

As to the acronym 'DC' - this is also ambiguous because it also stands for 
"District of Columbia" - certainly domain context will clear this up some but 
not if you have a global search problem like Google or Bing. I'll look into 
this problem too.


was (Author: tedsullivan):
Thanks for this submission [[email protected]]! I think that this really 
helps to scale the autophrasing solution. Also the improvement in dealing with 
PositionLength is a big plus, as are the improvements in the query parser. 
Great work, thanks.

I have seen some reports on the github version of my code about memory leaks. 
Have you looked at that? I will take your patch and try to do some A/B 
comparisons on this to see if the new FSM implementation (hopefully) removes 
that problem too. But in general, have you done any performance/scaling tests 
on your version of the autofilter? Obviously, this goes along with the 
production-readiness that your new implementation makes possible. Thanks again 
for submitting this patch.

As to complementarity with SOLR-4381 - I would agree - nice to hear that the 
two solutions play nicely with each other :) IMO this is an important problem 
that needs a committed solution. If we give Solr users more than one way to 
"skin the cat" - the better the chance that they will find a solution for their 
own problem set.  

As to the acronym 'DC' - this is also ambiguous because it also stands for 
"District of Columbia" - certainly domain context will clear this up some but 
not if you have a global search problem like Google or Bing. I'll look into 
this problem too.

> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: AutoPhaseFiniteStateDiagram.pdf, SOLR-7136.patch, 
> SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases 
> that represent a single entity to be tokenized in a singular fashion. Adds 
> support for ManagedResources and Query parser auto-phrasing support given 
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing 
> multi-term synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter

Reply via email to