[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-05-01 Thread Alexey Kudinov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646468#comment-13646468
 ] 

Alexey Kudinov commented on LUCENE-3842:


I tried building the analyzing suggester model from an external file containing 
1mln short phrases taken from Wikipedia titles.
2Gb of memory seems not enough, it runs forever and dies with OOM.
What is the expected dictionary size? What is the benchmark behavior?

Thanks!

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-05-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646511#comment-13646511
 ] 

Michael McCandless commented on LUCENE-3842:


The building process is unfortunately RAM intensive, but there are 
settings/knobs in the FST Builder API to tradeoff RAM required during building 
vs how small the resulting FST is.  Maybe we need to expose control for these 
in AnalyzingSuggester ...

Can you share those 1M short phrases?  What is the total number of characters 
across them?


 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-05-01 Thread Alexey Kudinov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646543#comment-13646543
 ] 

Alexey Kudinov commented on LUCENE-3842:


Setting maxGraphExpansions to some value  0 (say, 30) ends with null reference 
exception. paths is null here:
maxAnalyzedPathsForOneInput = Math.max(maxAnalyzedPathsForOneInput, 
paths.size());
Fixing this, the model loads after a while.
With maxGraphExpansions  0 it doesn't load regardless the dictionary size.
I'm using the wordnet synonyms, so I guess this causes a lot of paths, I 
suspect loops.
The total dictionary file size is about 20Mb, but this doesn't really matter as 
I get similar behavior for even smaller one (2Mb).
The dataset is from here: http://wiki.dbpedia.org/Downloads32, Titles in 
english. I took the values only and tried different sizes (10mln-1mln-0.1mln).



 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-05-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646559#comment-13646559
 ] 

Michael McCandless commented on LUCENE-3842:


bq. I'm using the wordnet synonyms, so I guess this causes a lot of paths, I 
suspect loops.

A :)  Yes this will cause lots of expansions / RAM used.

But NPE because paths is null sounds like a real bug.

OK I see why it's happening ... when we try to enumerate all finite strings 
from the expanded graph, if it exceeds the limit (maxGraphExpansions), 
SpecialOperations.getFiniteStrings returns null but the code assumes it will 
return the N finite strings it had found so far.  Can you open a new issue 
for this?  We should separately fix it.

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-05-01 Thread Alexey Kudinov (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646574#comment-13646574
 ] 

Alexey Kudinov commented on LUCENE-3842:


I opened an issue for NPE - LUCENE-4971

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-05-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646610#comment-13646610
 ] 

Michael McCandless commented on LUCENE-3842:


Thank you Alexey!

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-03-22 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610728#comment-13610728
 ] 

Commit Tag Bot commented on LUCENE-3842:


[branch_4x commit] Michael McCandless
http://svn.apache.org/viewvc?view=revisionrevision=1391704

LUCENE-3842: refactor: don't make spooky State methods public


 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-03-22 Thread Commit Tag Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610730#comment-13610730
 ] 

Commit Tag Bot commented on LUCENE-3842:


[branch_4x commit] Michael McCandless
http://svn.apache.org/viewvc?view=revisionrevision=1391686

LUCENE-3842: add AnalyzingSuggester


 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: [jira] [Commented] (LUCENE-3842) Analyzing Suggester

2013-03-22 Thread Uwe Schindler
What is going on here? Those are very old issues and very old commits!

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Commit Tag Bot (JIRA) [mailto:j...@apache.org]
 Sent: Friday, March 22, 2013 5:31 PM
 To: dev@lucene.apache.org
 Subject: [jira] [Commented] (LUCENE-3842) Analyzing Suggester
 
 
 [ https://issues.apache.org/jira/browse/LUCENE-
 3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
 tabpanelfocusedCommentId=13610728#comment-13610728 ]
 
 Commit Tag Bot commented on LUCENE-3842:
 
 
 [branch_4x commit] Michael McCandless
 http://svn.apache.org/viewvc?view=revisionrevision=1391704
 
 LUCENE-3842: refactor: don't make spooky State methods public
 
 
  Analyzing Suggester
  ---
 
  Key: LUCENE-3842
  URL: https://issues.apache.org/jira/browse/LUCENE-3842
  Project: Lucene - Core
   Issue Type: New Feature
   Components: modules/spellchecker
 Affects Versions: 3.6, 4.0-ALPHA
 Reporter: Robert Muir
 Assignee: Michael McCandless
  Fix For: 4.1, 5.0
 
  Attachments: LUCENE-3842.patch, LUCENE-3842.patch,
  LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch,
  LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch,
  LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch,
  LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch,
  LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch,
  LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch
 
 
  Since we added shortest-path wFSA search in LUCENE-3714, and
  generified the comparator in LUCENE-3801, I think we should look at
 implementing suggesters that have more capabilities than just basic prefix
 matching.
  In particular I think the most flexible approach is to integrate with
  Analyzer at both build and query time, such that we build a wFST with:
  input: analyzed text such as ghost0christmas0past -- byte 0 here is
  an optional token separator
  output: surface form such as the ghost of christmas past
  weight: the weight of the suggestion
  we make an FST with PairOutputsweight,output, but only do the
  shortest path operation on the weight side (like the test in LUCENE-3801),
 at the same time accumulating the output (surface form), which will be the
 actual suggestion.
  This allows a lot of flexibility:
  * Using even standardanalyzer means you can offer suggestions that
 ignore stopwords, e.g. if you type in ghost of chr...,
it will suggest the ghost of christmas past
  * we can add support for synonyms/wdf/etc at both index and query time
  (there are tradeoffs here, and this is not implemented!)
  * this is a basis for more complicated suggesters such as Japanese
 suggesters, where the analyzed form is in fact the reading,
so we would add a TokenFilter that copies ReadingAttribute into term text
 to support that...
  * other general things like offering suggestions that are more fuzzy like
 using a plural stemmer or ignoring accents or whatever.
  According to my benchmarks, suggestions are still very fast with the
  prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short
 of twice that of a regular wFST, but this is still far smaller than TST or 
 JaSpell,
 etc).
 
 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA administrators
 For more information on JIRA, see: http://www.atlassian.com/software/jira
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-11-10 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494793#comment-13494793
 ] 

David Smiley commented on LUCENE-3842:
--

That TokenStreamToAutomaton is cool Mike!  I can put that to use in my FST text 
tagger work.

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Fix For: 4.1, 5.0

 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-10-04 Thread Sudarshan Gaikaiwari (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13469608#comment-13469608
 ] 

Sudarshan Gaikaiwari commented on LUCENE-3842:
--

+1. This is awesome. It would be great to get this in trunk.

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-10-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13469612#comment-13469612
 ] 

Michael McCandless commented on LUCENE-3842:


Thanks Sudarshan!  It's actually already committed (will be in 4.1) ... I just 
forgot to resolve ...

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465767#comment-13465767
 ] 

Robert Muir commented on LUCENE-3842:
-

in TStoA:
{code}
if (pos == -1  posInc == 0) {
  // TODO: hmm are TS's still allowed to do this...?
  posInc = 1;
}
{code}

NO they are not! :)

As far as the limitations, i feel like if the last token's endOffset != length 
of input
that might be pretty safe in general (e.g. standardtokenizer) because of how 
unicode
works... i have to think about it.

strange the FST size increased so much? If i run the benchmark:
{noformat}
[junit4:junit4]   2 -- RAM consumption
[junit4:junit4]   2 JaspellLookup   size[B]:9,815,152
[junit4:junit4]   2 TSTLookup   size[B]:9,858,792
[junit4:junit4]   2 FSTCompletionLookup size[B]:  466,520
[junit4:junit4]   2 WFSTCompletionLookup size[B]:  507,640
[junit4:junit4]   2 AnalyzingCompletionLookup size[B]:1,832,952
{noformat}

I don't know if we should worry about that, but it seems somewhat large
for just using KeywordTokenizer.

{code}
 * bNOTE/b: Although the {@link TermFreqIterator} API specifies
 * floating point weights
{code}

Thats obselete. See WFSTSuggester in trunk where I fixed this already.



 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-28 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465778#comment-13465778
 ] 

Michael McCandless commented on LUCENE-3842:


Thanks Rob, good feedback ... I'll post new patch changing that posInc check to 
an assert, and removing that obsolete NOTE.

{quote}
As far as the limitations, i feel like if the last token's endOffset != length 
of input
that might be pretty safe in general (e.g. standardtokenizer) because of how 
unicode
works... i have to think about it.
{quote}

I think we should try that!  This way the suggester can guess whether the 
input text is still inside the last token.

But this won't help the StopFilter case, ie if user types 'a' then StopFilter 
will still delete it even though the token isn't done (ie maybe user intends 
to type 'apple').

Still it's progress so I think we should try it ...

I'm not sure why FST is so much larger ... the outputs should share very well 
with KeywordTokenizer ... hmm what weights do we use for the benchmark?

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465865#comment-13465865
 ] 

Robert Muir commented on LUCENE-3842:
-

Can we split Analyzer into indexAnalyzer and queryAnalyzer?

Can we also add 1 or 2 sugar ctors that use default values?

I'm thinking:
{code}
ctor(Analyzer analyzer) {
  this(analyzer, analyzer);
}

ctor(Analyzer index, Analyzer query) {
  this(index, query, default, default, default);
}

ctor(Analyzer index, Analyzer query, int option, int option, int option) {
  // this is the full ctor!
}
{code}

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-28 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465934#comment-13465934
 ] 

Robert Muir commented on LUCENE-3842:
-

+1 thanks Mike. lets get it in!

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
Assignee: Michael McCandless
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457552#comment-13457552
 ] 

Robert Muir commented on LUCENE-3842:
-

{quote}
I think it would be better/cleaner to append unique (disambiguating) bytes to 
the end of the analyzed bytes (this was Robert's original idea): then each path 
is a single result. The only downside I can think of is we will have to reserve 
a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, 
... but since these input BytesRefs are typically UTF-8 ... this seems not so 
bad? Then can of course in general be arbitrary bytes since they are produced 
by the analysis process...
{quote}

I don't understand why we have to reserve any bytes. We can append arbitrary 
bytes of any sort to the end of the input side, this will have no effect on the 
actual surface form that we suggest.


 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457556#comment-13457556
 ] 

Robert Muir commented on LUCENE-3842:
-

and as far as exactFirst: lets just keep it simple and have it as a surface 
form comparison?

This is really what I think most people will expect anyway: in the case of 
duplicates and exactFirst,
nobody really cares nor sees the underlying analyzed form.

So I don't think we should have multiple outputs for the FST here.

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457562#comment-13457562
 ] 

Robert Muir commented on LUCENE-3842:
-

Sorry mike, I'm still catching up and sadly just creating a lot of noise and 
thinking out loud. 

I think you actually have a point with reserving a byte when there are 
duplicates :)

But at the same time i still think the surface form is valuable for this option 
too... when we reserve a byte can we also sort the 
duplicate outputs up front, in such a way that we can start traversing the 
output side to look for an exactly-matching surface form?

So its like within that 'subtree' of the FST (the duplicates for the exact 
input), we can binary search?

Otherwise exactFirst would be inefficient in some cases (as we have to do a 
special 'walk' here). Christian Moen
showed me some scary stuff on his Japanese phone as far as readings-kanji 
forms ... i think if we can it might be 
good to be cautious and keep this fast...


 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-16 Thread Bill Bell (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456728#comment-13456728
 ] 

Bill Bell commented on LUCENE-3842:
---

A common Suggester use case is to boost the results by closest (auto suggest 
the whole USA but boost the results in the suggester by geodistance). WOuld 
love to get faster response with that. At the Lucene Revolution 2012 in Boston 
a speaker did discuss using WFST to do this, but I have yet to figure out how 
to do it).


 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-16 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456732#comment-13456732
 ] 

Robert Muir commented on LUCENE-3842:
-

Bill can you start a different thread for that. Its unrelated to this issue.

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456443#comment-13456443
 ] 

Michael McCandless commented on LUCENE-3842:


OK I created branch: 
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3842

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456468#comment-13456468
 ] 

Robert Muir commented on LUCENE-3842:
-

Thanks for resurrecting this from the dead! I had forgotten just how fun this 
issue was :)

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-09-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456477#comment-13456477
 ] 

Robert Muir commented on LUCENE-3842:
-

+1 for that. lets keep this as simple as possible and leave the responsibility 
to the analyzer as much as possible.

My main concern for the PRESERVE_SEPS was for the japanese use case: we don't 
much care what the actual tokenization
of the japanese words was, only the concatenated reading string. If the 
tokenization is a little off but the concatenation
of all the readings is still correct, then we are ok. So it makes it more 
robust against tokenization differences,
especially considering its partial inputs going into this thing (not whole 
words)

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0-ALPHA
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-06-30 Thread Sudarshan Gaikaiwari (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404537#comment-13404537
 ] 

Sudarshan Gaikaiwari commented on LUCENE-3842:
--

bq. maybe we can tie-break instead by the surface form?

The FST construction guarantees that the input paths leading to different nodes 
are unique, while I don't think we have such a guarantee about the surface form.

I have attached a patch that modifies the intersectPrefixPaths method to keep 
track of the input paths as the automaton and FST are intersected. Please let 
me know if that look ok to you?

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-06-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287369#comment-13287369
 ] 

Michael McCandless commented on LUCENE-3842:


Hi Sudarshan, thanks for raising this ... I'll have a look...

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-05-30 Thread Sudarshan Gaikaiwari (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286249#comment-13286249
 ] 

Sudarshan Gaikaiwari commented on LUCENE-3842:
--

Hi Michael,

Thanks a lot for opening up Util.shortestPaths, now that I can seed the queue 
with the intial nodes using addStartPaths the performance of the 
GeoSpatialSuggest that I presented at Lucene Revolution has been improved by 2x.

While migrating my code to use this patch, I noticed that I would hit the 
following assertion in addIfCompetitive.

{code}

  path.input.length--;
  assert cmp != 0;
  if (cmp  0) {
{code}

This assert fires when it is not possible to differentiate between the path 
that we are trying to add to the queue and the bottom. This happens because the 
different paths that lead to FST nodes during the automata FST intersection are 
not stored. So the inputpath used to differentiate path contains only the 
characters that have been consumed from one of the initial FST nodes.

From your comments for the addStartPaths method I think that you have foreseen 
this problem.

{code}
// nocommit this should also take the starting
// weight...?

/** Adds all leaving arcs, including 'finished' arc, if
 *  the node is final, from this node into the queue.  */
public void addStartPaths(FST.ArcT node, T startOutput, boolean 
allowEmptyString) throws IOException {
{code}

Here is a unit test that causes the assert to be triggered.

{code}
  public void testInputPathRequired() throws Exception {
TermFreq keys[] = new TermFreq[] {
new TermFreq(fast ghost, 50),
new TermFreq(quick gazelle, 50),
new TermFreq(fast ghoul, 50),
new TermFreq(fast gizzard, 50),
};

SynonymMap.Builder b = new SynonymMap.Builder(false);
b.add(new CharsRef(fast), new CharsRef(quick), true);
final SynonymMap map = b.build();

final Analyzer analyzer = new Analyzer() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader 
reader) {
Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.SIMPLE, 
true);
TokenStream stream = new SynonymFilter(tokenizer, map, true);
return new TokenStreamComponents(tokenizer, new 
RemoveDuplicatesTokenFilter(stream));
  }
};
AnalyzingCompletionLookup suggester = new 
AnalyzingCompletionLookup(analyzer);
suggester.build(new TermFreqArrayIterator(keys));
ListLookupResult results = suggester.lookup(fast g, false, 2);
  }
{code}

Please let me know if the above analysis looks correct to you and I will start 
trying to fix this by storing paths during the FST automata intersection.

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice 

[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-05-17 Thread Sudarshan Gaikaiwari (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278166#comment-13278166
 ] 

Sudarshan Gaikaiwari commented on LUCENE-3842:
--

I was not able to apply the latest patch cleanly

{quote}
smg@dev21:~/lucene_trunk$ patch -p0  ~/LUCENE-3842.patch 
patching file 
lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java
patching file 
lucene/core/src/java/org/apache/lucene/util/automaton/SpecialOperations.java
patching file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java
Hunk #1 FAILED at 109.
1 out of 1 hunk FAILED -- saving rejects to file 
lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java.rej
patching file lucene/core/src/java/org/apache/lucene/util/fst/Util.java
patching file 
lucene/core/src/java/org/apache/lucene/analysis/TokenStreamToAutomaton.java
patching file 
lucene/core/src/test/org/apache/lucene/analysis/TestGraphTokenizers.java
patching file 
lucene/core/src/test/org/apache/lucene/util/automaton/TestSpecialOperations.java
patching file 
lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionTest.java
patching file 
lucene/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java
patching file 
lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionLookup.java
patching file 
lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/FSTUtil.java
{quote}

I needed to copy RollingBuffer.java to from test-framework to core for the 
patch to apply cleanly.

{quote}
cp lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java 
lucene/core/src/java/org/apache/lucene/util/
{quote}




 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278267#comment-13278267
 ] 

Michael McCandless commented on LUCENE-3842:


Hi Sudarshan, sorry that was my bad: I had svn mv'd RollingBuffer but when I 
created the patch I failed to pass --show-copies-as-adds to svn ... so you have 
to do that mv yourself before applying the patch...

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-05-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13275895#comment-13275895
 ] 

Robert Muir commented on LUCENE-3842:
-

Nice catch on the exactFirst dup problem!

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-05-15 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276326#comment-13276326
 ] 

Robert Muir commented on LUCENE-3842:
-

When running the benchmark (LookupBenchmarkTest) i noticed the FST size has 
increased since the original patch. I wonder why this is? the benchmark uses 
KeywordAnalyzer... it could be (likely even) that the original patch had a bug 
and now its correct, but maybe its worth investigation...

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-05-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274068#comment-13274068
 ] 

Michael McCandless commented on LUCENE-3842:


OK, when I pass false for enablePositionIncrements to MockAnalyzer in 
testStandard then both cases pass... and I added a 3rd case testing for ghost 
chris and it also passes... cool!

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-05-11 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273419#comment-13273419
 ] 

Robert Muir commented on LUCENE-3842:
-

testStandard is also bogus: it has 2 asserts. the first one should pass, but 
the second one should
really only work if you disable positionincrements in the (mock) stopfilter.

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-05-10 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272413#comment-13272413
 ] 

Robert Muir commented on LUCENE-3842:
-

i see the problem. it actually happens on the second term (we have ghost/2 
christmas/2).
The problem is it tries to find the last state to connect the new node, but it 
uses
a hashmap based on position for that... so if there are holes this returns null.

I think for this code we would add nodes for holes (text=POS_SEP) to simplify 
the logic?

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-14 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229584#comment-13229584
 ] 

Robert Muir commented on LUCENE-3842:
-

I also don't think we really need this generic getFiniteStrings. its just to 
get it off the ground.
we can just write the possibilities on the fly i think and it will be simpler...

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch, LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-04 Thread Michael McCandless (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221844#comment-13221844
 ] 

Michael McCandless commented on LUCENE-3842:


Once posLength is in, I think a very simple way to handle multiple paths at 
query time is to open up the TopNSearcher class in oal.fst.Util.

Currently the API only allows you to pass in a single starting FST node, but we 
can easily improve this by adding eg a addStartNode(FST.ArcT, int 
startingCost) instead.  This way the app could create a TopNSearcher, add any 
number of start nodes with the initial path cost, then call .search() to get 
the best completions.

The only limitation of this is that all differences must be pre-computed as an 
initial path cost that's consistent with how the path costs are accumulated 
(with the Outputs.add) during searching; I'm not sure if that'd be overly 
restrictive here?

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-04 Thread Dawid Weiss (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221929#comment-13221929
 ] 

Dawid Weiss commented on LUCENE-3842:
-

bq. Patch with a static utility method to translate a TokenStream to a 
byte-by-byte automaton.

I looked at the patch but I don't fully get what it does. Looks like a 
combination of state sequence unions, am I right?

bq. Brics has some code for this (puts all the accepted strings into a set). 

It's probably a naive walk with an acceptor. I've always wanted to see what 
Brics returns from that method for an automaton equivalent to .* :)

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-04 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221930#comment-13221930
 ] 

Robert Muir commented on LUCENE-3842:
-

{quote}
I've always wanted to see what Brics returns from that method for an automaton 
equivalent to .* 
{quote}

Oh, so it is only for finite automata, so it returns null in that case:

http://www.brics.dk/automaton/doc/dk/brics/automaton/SpecialOperations.html#getFiniteStrings%28dk.brics.automaton.Automaton%29

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-04 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221932#comment-13221932
 ] 

Robert Muir commented on LUCENE-3842:
-

{quote}
I looked at the patch but I don't fully get what it does. Looks like a 
combination of state sequence unions, am I right?
{quote}

Well the conversion should ultimately be more useful for the suggester to 
intersect with the FST than a tokenstream, because a tokenstream is like a 
word-level automaton, if dogs is a synonym for dog, then we have:
smelly dog|dogs(positionIncrement=0). 

So for our intersection, we would prefer it to be a deterministic at 
'character' (byte) level instead. So the conversion should produce an automaton 
of: smelly dog(s?) in regex notation... this is easier to work with.

at index time its useful too, because in the FST we only care about all the 
possible byte strings, so this should be easier to enumerate than a tokenstream 
(especially if you consider multiword synonyms, decompounded terms etc where 
some span across many).


 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-04 Thread Dawid Weiss (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221934#comment-13221934
 ] 

Dawid Weiss commented on LUCENE-3842:
-

Ok, get it, thanks. I wonder if it's always possible, but I bet you can write a 
random test to ensure that :)

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-04 Thread Dawid Weiss (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221935#comment-13221935
 ] 

Dawid Weiss commented on LUCENE-3842:
-

It also occurred to me that it would be interesting to have a naive 
minimalization technique for Brics Automata which would (for automata with a 
finite language):
- enumarate the language
- sort
- build the minimal automaton

While it may seem like an idea with crazy overhead it may actually be a viable 
alternative to minimization algorithms in Brics for very large automata. 
Interesting.

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-04 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221936#comment-13221936
 ] 

Robert Muir commented on LUCENE-3842:
-

Dawid: hmm, well the transitions in brics are ranges in sorted order, so, if 
its finite, couldn't we just enumerate the language in sorted order while 
building the minimal automaton incrementally in parallel?

Or am i missing something... its sunday... :)

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-04 Thread Dawid Weiss (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221940#comment-13221940
 ] 

Dawid Weiss commented on LUCENE-3842:
-

Yep, sure you could (I admit I haven't looked at Brics in a long time so I 
don't remember the details, but I do remember the overhead was significant on 
optimization; this was a while ago - maybe it's improved).

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, 
 LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-03 Thread Robert Muir (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221691#comment-13221691
 ] 

Robert Muir commented on LUCENE-3842:
-

also just so i dont forget, we should allow separate specification of 
index-time and query-time analyzers...
because in cases where you are adding synonyms/wdf/etc there is a tradeoff 
(bigger FST, versus slower query-time perf)

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3842) Analyzing Suggester

2012-03-03 Thread Michael McCandless (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221748#comment-13221748
 ] 

Michael McCandless commented on LUCENE-3842:


VERY cool :)

 Analyzing Suggester
 ---

 Key: LUCENE-3842
 URL: https://issues.apache.org/jira/browse/LUCENE-3842
 Project: Lucene - Java
  Issue Type: New Feature
  Components: modules/spellchecker
Affects Versions: 3.6, 4.0
Reporter: Robert Muir
 Attachments: LUCENE-3842.patch


 Since we added shortest-path wFSA search in LUCENE-3714, and generified the 
 comparator in LUCENE-3801,
 I think we should look at implementing suggesters that have more capabilities 
 than just basic prefix matching.
 In particular I think the most flexible approach is to integrate with 
 Analyzer at both build and query time,
 such that we build a wFST with:
 input: analyzed text such as ghost0christmas0past -- byte 0 here is an 
 optional token separator
 output: surface form such as the ghost of christmas past
 weight: the weight of the suggestion
 we make an FST with PairOutputsweight,output, but only do the shortest path 
 operation on the weight side (like
 the test in LUCENE-3801), at the same time accumulating the output (surface 
 form), which will be the actual suggestion.
 This allows a lot of flexibility:
 * Using even standardanalyzer means you can offer suggestions that ignore 
 stopwords, e.g. if you type in ghost of chr...,
   it will suggest the ghost of christmas past
 * we can add support for synonyms/wdf/etc at both index and query time (there 
 are tradeoffs here, and this is not implemented!)
 * this is a basis for more complicated suggesters such as Japanese 
 suggesters, where the analyzed form is in fact the reading,
   so we would add a TokenFilter that copies ReadingAttribute into term text 
 to support that...
 * other general things like offering suggestions that are more fuzzy like 
 using a plural stemmer or ignoring accents or whatever.
 According to my benchmarks, suggestions are still very fast with the 
 prototype (e.g. ~ 100,000 QPS), and the FST size does not
 explode (its short of twice that of a regular wFST, but this is still far 
 smaller than TST or JaSpell, etc).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org