[
https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018207#comment-15018207
]
Ted Sullivan edited comment on SOLR-7136 at 11/20/15 4:12 PM:
--------------------------------------------------------------
Thanks for this submission [[email protected]]! I think that this really
helps to scale the autophrasing solution. Also the improvement in dealing with
PositionLength is a big plus, as are the improvements in the query parser.
Great work, thanks.
I have seen some reports on the github version of my code about memory leaks.
Have you looked at that? I will take your patch and try to do some A/B
comparisons on this to see if the new FSM implementation (hopefully) removes
that problem too. But in general, have you done any performance/scaling tests
on your version of the autophrasing filter? Obviously, this goes along with the
production-readiness that your new implementation makes possible. Thanks again
for submitting this patch.
As to complementarity with SOLR-4381 - I would agree - nice to hear that the
two solutions play nicely with each other :) IMO this is an important problem
that needs a committed solution. If we give Solr users more than one way to
"skin the cat" - the better the chance that they will find a solution for their
own problem set.
As to the acronym 'DC' - this is also ambiguous because it also stands for
"District of Columbia" - certainly domain context will clear this up some but
not if you have a global search problem like Google or Bing. I'll look into
this problem too.
was (Author: tedsullivan):
Thanks for this submission [[email protected]]! I think that this really
helps to scale the autophrasing solution. Also the improvement in dealing with
PositionLength is a big plus, as are the improvements in the query parser.
Great work, thanks.
I have seen some reports on the github version of my code about memory leaks.
Have you looked at that? I will take your patch and try to do some A/B
comparisons on this to see if the new FSM implementation (hopefully) removes
that problem too. But in general, have you done any performance/scaling tests
on your version of the autofilter? Obviously, this goes along with the
production-readiness that your new implementation makes possible. Thanks again
for submitting this patch.
As to complementarity with SOLR-4381 - I would agree - nice to hear that the
two solutions play nicely with each other :) IMO this is an important problem
that needs a committed solution. If we give Solr users more than one way to
"skin the cat" - the better the chance that they will find a solution for their
own problem set.
As to the acronym 'DC' - this is also ambiguous because it also stands for
"District of Columbia" - certainly domain context will clear this up some but
not if you have a global search problem like Google or Bing. I'll look into
this problem too.
> Add an AutoPhrasing TokenFilter
> -------------------------------
>
> Key: SOLR-7136
> URL: https://issues.apache.org/jira/browse/SOLR-7136
> Project: Solr
> Issue Type: New Feature
> Reporter: Ted Sullivan
> Attachments: AutoPhaseFiniteStateDiagram.pdf, SOLR-7136.patch,
> SOLR-7136.patch, SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases
> that represent a single entity to be tokenized in a singular fashion. Adds
> support for ManagedResources and Query parser auto-phrasing support given
> LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing
> multi-term synonym problem in Lucene Solr has been documented online.
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]