[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-10-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13783998#comment-13783998
 ] 

Michael McCandless commented on LUCENE-5214:


bq. I was curious to know why you did not implement the load and store methods 
for your AnalyzingInfixSuggester rather build the index at the ctor? 

Well ... once you .build() the AnalyzingInfixSuggester, it's already stored 
since it's backed by an on-disk index.  So this suggester is somewhat different 
from others (it's not RAM resident ... hmm unless you provide a RAMDir in 
getDirectory).

In the ctor, if there's already a previously built suggester, I just open the 
searcher there.  I suppose we could move that code into load() instead?

bq. was it because of the fact that they take a Input/output stream?

That is sort of weird; I think we have an issue open to change that to 
Directory or maybe IndexInput/Output or something ...

bq. What are your thoughts on generalizing the interface so that the index can 
be loaded up and stored as it is done by all the other suggesters?

+1 to somehow improve the suggester APIs (I think there's yet another issue 
opened for that).

Do you mean loaded into a RAMDir?

bq. this may not be the most relevant place to ask but will do anyways.

That's fine :)  Just send an email to dev@ next time ...

bq.  your AnalyzingInfixSuggester 

It's not mine :)  Anyone can and should go fix it!

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch, LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-10-02 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784065#comment-13784065
 ] 

ASF subversion and git services commented on LUCENE-5214:
-

Commit 1528517 from [~mikemccand] in branch 'dev/trunk'
[ https://svn.apache.org/r1528517 ]

LUCENE-5214: add FreeTextSuggester

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch, LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-10-02 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784071#comment-13784071
 ] 

ASF subversion and git services commented on LUCENE-5214:
-

Commit 1528521 from [~mikemccand] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1528521 ]

LUCENE-5214: add FreeTextSuggester

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch, LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-10-02 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784221#comment-13784221
 ] 

ASF subversion and git services commented on LUCENE-5214:
-

Commit 1528579 from [~mikemccand] in branch 'dev/branches/branch_4x'
[ https://svn.apache.org/r1528579 ]

LUCENE-5214: remove java-7 only @SafeVarargs

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch, LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-10-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13783179#comment-13783179
 ] 

Robert Muir commented on LUCENE-5214:
-

+1

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch, LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-10-01 Thread Areek Zillur (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13783660#comment-13783660
 ] 

Areek Zillur commented on LUCENE-5214:
--

Hey Michael, had a question for you, this may not be the most relevant place to 
ask but will do anyways.

I was curious to know why you did not implement the load and store methods for 
your AnalyzingInfixSuggester rather build the index at the ctor? was it because 
of the fact that they take a Input/output stream? What are your thoughts on 
generalizing the interface so that the index can be loaded up and stored as it 
is done by all the other suggesters?

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch, LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-09-18 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13770487#comment-13770487
 ] 

Dawid Weiss commented on LUCENE-5214:
-

Pretty cool, thanks Mike.

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-09-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769459#comment-13769459
 ] 

Michael McCandless commented on LUCENE-5214:


The build method basically just runs all incoming text through the
indexAnalyzer, appending ShingleFilter on the end to generate the
ngrams.  To aggregate the ngrams it simply writes them to the
offline sorter; this is nice and simple but somewhat inefficient in
how much transient disk and CPU it needs to sort all the ngrams, but
it works (thanks Rob)!  It may be better to have an in-memory hash
that holds the frequent ngrams, and periodically flushes the long
tail to free up RAM.  But this gets more complex... the current code
is very simple.

After sorting the ngrams, it walks them, counting up how many times
each gram occurred and then adding that to the FST.  Currently, I do
nothing with the surface form, i.e. the suggester only suggests the
analyzed forms, which may be too ... weird?  Though in playing around,
I think the analysis you generally want to do should be very light,
so maybe this is OK.

It can also save the surface form in the FST (I was doing that before;
it's commented out now), but ... how to disambiguate?  Currently it
saves the shortest one.  This also makes the FST even larger.

At lookup time I again just run through your analyzer + ShingleFilter,
and then try first to lookup 3grams, failing that to lookup 2grams,
etc.  I need to improve this to do some sort of smoothing like real
ngram language models do; it shouldn't be this hard backoff.

Anyway, it's great fun playing with the suggester live (using the simplistic
command-line tool in luceneutil, freedb/suggest.py) to explore the
ngram language model.  This is how I discovered LUCENE-5180.


 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-09-16 Thread Dawid Weiss (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768737#comment-13768737
 ] 

Dawid Weiss commented on LUCENE-5214:
-

I looked through the patch but I didn't get it, too late ;) I'll give it 
another shot later.

Anyway, the idea is very interesting though -- I wonder how much left-context 
(regardless of this implementation) one needs for the right prediction (reminds 
me of Markov chains and generative poetry :)

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5214) Add new FreeTextSuggester, to handle long tail suggestions

2013-09-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13767820#comment-13767820
 ] 

Robert Muir commented on LUCENE-5214:
-

This looks awesome: I think LUCENE-5180 will resolve a lot of the TODOs?

I'm glad these corner cases of trailing stopwords etc were fixed properly in 
the analysis chain.

And I like the name...

 Add new FreeTextSuggester, to handle long tail suggestions
 

 Key: LUCENE-5214
 URL: https://issues.apache.org/jira/browse/LUCENE-5214
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spellchecker
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-5214.patch


 The current suggesters are all based on a finite space of possible
 suggestions, i.e. the ones they were built on, so they can only
 suggest a full suggestion from that space.
 This means if the current query goes outside of that space then no
 suggestions will be found.
 The goal of FreeTextSuggester is to address this, by giving
 predictions based on an ngram language model, i.e. using the last few
 tokens from the user's query to predict likely following token.
 I got the idea from this blog post about Google's suggest:
 http://googleblog.blogspot.com/2011/04/more-predictions-in-autocomplete.html
 This is very much still a work in progress, but it seems to be
 working.  I've tested it on the AOL query logs, using an interactive
 tool from luceneutil to show the suggestions, and it seems to work well.
 It's fun to use that tool to explore the word associations...
 I don't think this suggester would be used standalone; rather, I think
 it'd be a fallback for times when the primary suggester fails to find
 anything.  You can see this behavior on google.com, if you type the
 fast and the , you see entire queries being suggested, but then if
 the next word you type is burning then suddenly you see the
 suggestions are only based on the last word, not the entire query.
 It uses ShingleFilter under-the-hood to generate the token ngrams;
 once LUCENE-5180 is in it will be able to properly handle a user query
 that ends with stop-words (e.g. wizard of ), and then stores the
 ngrams in an FST.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org