[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646468#comment-13646468 ] Alexey Kudinov commented on LUCENE-3842: I tried building the analyzing suggester model from an external file containing 1mln short phrases taken from Wikipedia titles. 2Gb of memory seems not enough, it runs forever and dies with OOM. What is the expected dictionary size? What is the benchmark behavior? Thanks! Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646511#comment-13646511 ] Michael McCandless commented on LUCENE-3842: The building process is unfortunately RAM intensive, but there are settings/knobs in the FST Builder API to tradeoff RAM required during building vs how small the resulting FST is. Maybe we need to expose control for these in AnalyzingSuggester ... Can you share those 1M short phrases? What is the total number of characters across them? Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646543#comment-13646543 ] Alexey Kudinov commented on LUCENE-3842: Setting maxGraphExpansions to some value 0 (say, 30) ends with null reference exception. paths is null here: maxAnalyzedPathsForOneInput = Math.max(maxAnalyzedPathsForOneInput, paths.size()); Fixing this, the model loads after a while. With maxGraphExpansions 0 it doesn't load regardless the dictionary size. I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops. The total dictionary file size is about 20Mb, but this doesn't really matter as I get similar behavior for even smaller one (2Mb). The dataset is from here: http://wiki.dbpedia.org/Downloads32, Titles in english. I took the values only and tried different sizes (10mln-1mln-0.1mln). Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646559#comment-13646559 ] Michael McCandless commented on LUCENE-3842: bq. I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops. A :) Yes this will cause lots of expansions / RAM used. But NPE because paths is null sounds like a real bug. OK I see why it's happening ... when we try to enumerate all finite strings from the expanded graph, if it exceeds the limit (maxGraphExpansions), SpecialOperations.getFiniteStrings returns null but the code assumes it will return the N finite strings it had found so far. Can you open a new issue for this? We should separately fix it. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646574#comment-13646574 ] Alexey Kudinov commented on LUCENE-3842: I opened an issue for NPE - LUCENE-4971 Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13646610#comment-13646610 ] Michael McCandless commented on LUCENE-3842: Thank you Alexey! Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610728#comment-13610728 ] Commit Tag Bot commented on LUCENE-3842: [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revisionrevision=1391704 LUCENE-3842: refactor: don't make spooky State methods public Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13610730#comment-13610730 ] Commit Tag Bot commented on LUCENE-3842: [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revisionrevision=1391686 LUCENE-3842: add AnalyzingSuggester Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: [jira] [Commented] (LUCENE-3842) Analyzing Suggester
What is going on here? Those are very old issues and very old commits! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Commit Tag Bot (JIRA) [mailto:j...@apache.org] Sent: Friday, March 22, 2013 5:31 PM To: dev@lucene.apache.org Subject: [jira] [Commented] (LUCENE-3842) Analyzing Suggester [ https://issues.apache.org/jira/browse/LUCENE- 3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=13610728#comment-13610728 ] Commit Tag Bot commented on LUCENE-3842: [branch_4x commit] Michael McCandless http://svn.apache.org/viewvc?view=revisionrevision=1391704 LUCENE-3842: refactor: don't make spooky State methods public Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494793#comment-13494793 ] David Smiley commented on LUCENE-3842: -- That TokenStreamToAutomaton is cool Mike! I can put that to use in my FST text tagger work. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Fix For: 4.1, 5.0 Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13469608#comment-13469608 ] Sudarshan Gaikaiwari commented on LUCENE-3842: -- +1. This is awesome. It would be great to get this in trunk. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13469612#comment-13469612 ] Michael McCandless commented on LUCENE-3842: Thanks Sudarshan! It's actually already committed (will be in 4.1) ... I just forgot to resolve ... Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465767#comment-13465767 ] Robert Muir commented on LUCENE-3842: - in TStoA: {code} if (pos == -1 posInc == 0) { // TODO: hmm are TS's still allowed to do this...? posInc = 1; } {code} NO they are not! :) As far as the limitations, i feel like if the last token's endOffset != length of input that might be pretty safe in general (e.g. standardtokenizer) because of how unicode works... i have to think about it. strange the FST size increased so much? If i run the benchmark: {noformat} [junit4:junit4] 2 -- RAM consumption [junit4:junit4] 2 JaspellLookup size[B]:9,815,152 [junit4:junit4] 2 TSTLookup size[B]:9,858,792 [junit4:junit4] 2 FSTCompletionLookup size[B]: 466,520 [junit4:junit4] 2 WFSTCompletionLookup size[B]: 507,640 [junit4:junit4] 2 AnalyzingCompletionLookup size[B]:1,832,952 {noformat} I don't know if we should worry about that, but it seems somewhat large for just using KeywordTokenizer. {code} * bNOTE/b: Although the {@link TermFreqIterator} API specifies * floating point weights {code} Thats obselete. See WFSTSuggester in trunk where I fixed this already. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465778#comment-13465778 ] Michael McCandless commented on LUCENE-3842: Thanks Rob, good feedback ... I'll post new patch changing that posInc check to an assert, and removing that obsolete NOTE. {quote} As far as the limitations, i feel like if the last token's endOffset != length of input that might be pretty safe in general (e.g. standardtokenizer) because of how unicode works... i have to think about it. {quote} I think we should try that! This way the suggester can guess whether the input text is still inside the last token. But this won't help the StopFilter case, ie if user types 'a' then StopFilter will still delete it even though the token isn't done (ie maybe user intends to type 'apple'). Still it's progress so I think we should try it ... I'm not sure why FST is so much larger ... the outputs should share very well with KeywordTokenizer ... hmm what weights do we use for the benchmark? Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465865#comment-13465865 ] Robert Muir commented on LUCENE-3842: - Can we split Analyzer into indexAnalyzer and queryAnalyzer? Can we also add 1 or 2 sugar ctors that use default values? I'm thinking: {code} ctor(Analyzer analyzer) { this(analyzer, analyzer); } ctor(Analyzer index, Analyzer query) { this(index, query, default, default, default); } ctor(Analyzer index, Analyzer query, int option, int option, int option) { // this is the full ctor! } {code} Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13465934#comment-13465934 ] Robert Muir commented on LUCENE-3842: - +1 thanks Mike. lets get it in! Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Assignee: Michael McCandless Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457552#comment-13457552 ] Robert Muir commented on LUCENE-3842: - {quote} I think it would be better/cleaner to append unique (disambiguating) bytes to the end of the analyzed bytes (this was Robert's original idea): then each path is a single result. The only downside I can think of is we will have to reserve a byte (0xFF?), ie we'd append 0xFF 0x00, then 0xFF 0x01 to the next duplicate, ... but since these input BytesRefs are typically UTF-8 ... this seems not so bad? Then can of course in general be arbitrary bytes since they are produced by the analysis process... {quote} I don't understand why we have to reserve any bytes. We can append arbitrary bytes of any sort to the end of the input side, this will have no effect on the actual surface form that we suggest. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457556#comment-13457556 ] Robert Muir commented on LUCENE-3842: - and as far as exactFirst: lets just keep it simple and have it as a surface form comparison? This is really what I think most people will expect anyway: in the case of duplicates and exactFirst, nobody really cares nor sees the underlying analyzed form. So I don't think we should have multiple outputs for the FST here. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13457562#comment-13457562 ] Robert Muir commented on LUCENE-3842: - Sorry mike, I'm still catching up and sadly just creating a lot of noise and thinking out loud. I think you actually have a point with reserving a byte when there are duplicates :) But at the same time i still think the surface form is valuable for this option too... when we reserve a byte can we also sort the duplicate outputs up front, in such a way that we can start traversing the output side to look for an exactly-matching surface form? So its like within that 'subtree' of the FST (the duplicates for the exact input), we can binary search? Otherwise exactFirst would be inefficient in some cases (as we have to do a special 'walk' here). Christian Moen showed me some scary stuff on his Japanese phone as far as readings-kanji forms ... i think if we can it might be good to be cautious and keep this fast... Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456728#comment-13456728 ] Bill Bell commented on LUCENE-3842: --- A common Suggester use case is to boost the results by closest (auto suggest the whole USA but boost the results in the suggester by geodistance). WOuld love to get faster response with that. At the Lucene Revolution 2012 in Boston a speaker did discuss using WFST to do this, but I have yet to figure out how to do it). Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456732#comment-13456732 ] Robert Muir commented on LUCENE-3842: - Bill can you start a different thread for that. Its unrelated to this issue. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456443#comment-13456443 ] Michael McCandless commented on LUCENE-3842: OK I created branch: https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3842 Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456468#comment-13456468 ] Robert Muir commented on LUCENE-3842: - Thanks for resurrecting this from the dead! I had forgotten just how fun this issue was :) Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13456477#comment-13456477 ] Robert Muir commented on LUCENE-3842: - +1 for that. lets keep this as simple as possible and leave the responsibility to the analyzer as much as possible. My main concern for the PRESERVE_SEPS was for the japanese use case: we don't much care what the actual tokenization of the japanese words was, only the concatenated reading string. If the tokenization is a little off but the concatenation of all the readings is still correct, then we are ok. So it makes it more robust against tokenization differences, especially considering its partial inputs going into this thing (not whole words) Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Core Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0-ALPHA Reporter: Robert Muir Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404537#comment-13404537 ] Sudarshan Gaikaiwari commented on LUCENE-3842: -- bq. maybe we can tie-break instead by the surface form? The FST construction guarantees that the input paths leading to different nodes are unique, while I don't think we have such a guarantee about the surface form. I have attached a patch that modifies the intersectPrefixPaths method to keep track of the input paths as the automaton and FST are intersected. Please let me know if that look ok to you? Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287369#comment-13287369 ] Michael McCandless commented on LUCENE-3842: Hi Sudarshan, thanks for raising this ... I'll have a look... Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13286249#comment-13286249 ] Sudarshan Gaikaiwari commented on LUCENE-3842: -- Hi Michael, Thanks a lot for opening up Util.shortestPaths, now that I can seed the queue with the intial nodes using addStartPaths the performance of the GeoSpatialSuggest that I presented at Lucene Revolution has been improved by 2x. While migrating my code to use this patch, I noticed that I would hit the following assertion in addIfCompetitive. {code} path.input.length--; assert cmp != 0; if (cmp 0) { {code} This assert fires when it is not possible to differentiate between the path that we are trying to add to the queue and the bottom. This happens because the different paths that lead to FST nodes during the automata FST intersection are not stored. So the inputpath used to differentiate path contains only the characters that have been consumed from one of the initial FST nodes. From your comments for the addStartPaths method I think that you have foreseen this problem. {code} // nocommit this should also take the starting // weight...? /** Adds all leaving arcs, including 'finished' arc, if * the node is final, from this node into the queue. */ public void addStartPaths(FST.ArcT node, T startOutput, boolean allowEmptyString) throws IOException { {code} Here is a unit test that causes the assert to be triggered. {code} public void testInputPathRequired() throws Exception { TermFreq keys[] = new TermFreq[] { new TermFreq(fast ghost, 50), new TermFreq(quick gazelle, 50), new TermFreq(fast ghoul, 50), new TermFreq(fast gizzard, 50), }; SynonymMap.Builder b = new SynonymMap.Builder(false); b.add(new CharsRef(fast), new CharsRef(quick), true); final SynonymMap map = b.build(); final Analyzer analyzer = new Analyzer() { @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer tokenizer = new MockTokenizer(reader, MockTokenizer.SIMPLE, true); TokenStream stream = new SynonymFilter(tokenizer, map, true); return new TokenStreamComponents(tokenizer, new RemoveDuplicatesTokenFilter(stream)); } }; AnalyzingCompletionLookup suggester = new AnalyzingCompletionLookup(analyzer); suggester.build(new TermFreqArrayIterator(keys)); ListLookupResult results = suggester.lookup(fast g, false, 2); } {code} Please let me know if the above analysis looks correct to you and I will start trying to fix this by storing paths during the FST automata intersection. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278166#comment-13278166 ] Sudarshan Gaikaiwari commented on LUCENE-3842: -- I was not able to apply the latest patch cleanly {quote} smg@dev21:~/lucene_trunk$ patch -p0 ~/LUCENE-3842.patch patching file lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java patching file lucene/core/src/java/org/apache/lucene/util/automaton/SpecialOperations.java patching file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java Hunk #1 FAILED at 109. 1 out of 1 hunk FAILED -- saving rejects to file lucene/core/src/java/org/apache/lucene/util/RollingBuffer.java.rej patching file lucene/core/src/java/org/apache/lucene/util/fst/Util.java patching file lucene/core/src/java/org/apache/lucene/analysis/TokenStreamToAutomaton.java patching file lucene/core/src/test/org/apache/lucene/analysis/TestGraphTokenizers.java patching file lucene/core/src/test/org/apache/lucene/util/automaton/TestSpecialOperations.java patching file lucene/suggest/src/test/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionTest.java patching file lucene/suggest/src/test/org/apache/lucene/search/suggest/LookupBenchmarkTest.java patching file lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/AnalyzingCompletionLookup.java patching file lucene/suggest/src/java/org/apache/lucene/search/suggest/analyzing/FSTUtil.java {quote} I needed to copy RollingBuffer.java to from test-framework to core for the patch to apply cleanly. {quote} cp lucene/test-framework/src/java/org/apache/lucene/util/RollingBuffer.java lucene/core/src/java/org/apache/lucene/util/ {quote} Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13278267#comment-13278267 ] Michael McCandless commented on LUCENE-3842: Hi Sudarshan, sorry that was my bad: I had svn mv'd RollingBuffer but when I created the patch I failed to pass --show-copies-as-adds to svn ... so you have to do that mv yourself before applying the patch... Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13275895#comment-13275895 ] Robert Muir commented on LUCENE-3842: - Nice catch on the exactFirst dup problem! Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13276326#comment-13276326 ] Robert Muir commented on LUCENE-3842: - When running the benchmark (LookupBenchmarkTest) i noticed the FST size has increased since the original patch. I wonder why this is? the benchmark uses KeywordAnalyzer... it could be (likely even) that the original patch had a bug and now its correct, but maybe its worth investigation... Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13274068#comment-13274068 ] Michael McCandless commented on LUCENE-3842: OK, when I pass false for enablePositionIncrements to MockAnalyzer in testStandard then both cases pass... and I added a 3rd case testing for ghost chris and it also passes... cool! Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13273419#comment-13273419 ] Robert Muir commented on LUCENE-3842: - testStandard is also bogus: it has 2 asserts. the first one should pass, but the second one should really only work if you disable positionincrements in the (mock) stopfilter. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13272413#comment-13272413 ] Robert Muir commented on LUCENE-3842: - i see the problem. it actually happens on the second term (we have ghost/2 christmas/2). The problem is it tries to find the last state to connect the new node, but it uses a hashmap based on position for that... so if there are holes this returns null. I think for this code we would add nodes for holes (text=POS_SEP) to simplify the logic? Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229584#comment-13229584 ] Robert Muir commented on LUCENE-3842: - I also don't think we really need this generic getFiniteStrings. its just to get it off the ground. we can just write the possibilities on the fly i think and it will be simpler... Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221844#comment-13221844 ] Michael McCandless commented on LUCENE-3842: Once posLength is in, I think a very simple way to handle multiple paths at query time is to open up the TopNSearcher class in oal.fst.Util. Currently the API only allows you to pass in a single starting FST node, but we can easily improve this by adding eg a addStartNode(FST.ArcT, int startingCost) instead. This way the app could create a TopNSearcher, add any number of start nodes with the initial path cost, then call .search() to get the best completions. The only limitation of this is that all differences must be pre-computed as an initial path cost that's consistent with how the path costs are accumulated (with the Outputs.add) during searching; I'm not sure if that'd be overly restrictive here? Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221929#comment-13221929 ] Dawid Weiss commented on LUCENE-3842: - bq. Patch with a static utility method to translate a TokenStream to a byte-by-byte automaton. I looked at the patch but I don't fully get what it does. Looks like a combination of state sequence unions, am I right? bq. Brics has some code for this (puts all the accepted strings into a set). It's probably a naive walk with an acceptor. I've always wanted to see what Brics returns from that method for an automaton equivalent to .* :) Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221930#comment-13221930 ] Robert Muir commented on LUCENE-3842: - {quote} I've always wanted to see what Brics returns from that method for an automaton equivalent to .* {quote} Oh, so it is only for finite automata, so it returns null in that case: http://www.brics.dk/automaton/doc/dk/brics/automaton/SpecialOperations.html#getFiniteStrings%28dk.brics.automaton.Automaton%29 Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221932#comment-13221932 ] Robert Muir commented on LUCENE-3842: - {quote} I looked at the patch but I don't fully get what it does. Looks like a combination of state sequence unions, am I right? {quote} Well the conversion should ultimately be more useful for the suggester to intersect with the FST than a tokenstream, because a tokenstream is like a word-level automaton, if dogs is a synonym for dog, then we have: smelly dog|dogs(positionIncrement=0). So for our intersection, we would prefer it to be a deterministic at 'character' (byte) level instead. So the conversion should produce an automaton of: smelly dog(s?) in regex notation... this is easier to work with. at index time its useful too, because in the FST we only care about all the possible byte strings, so this should be easier to enumerate than a tokenstream (especially if you consider multiword synonyms, decompounded terms etc where some span across many). Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221934#comment-13221934 ] Dawid Weiss commented on LUCENE-3842: - Ok, get it, thanks. I wonder if it's always possible, but I bet you can write a random test to ensure that :) Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221935#comment-13221935 ] Dawid Weiss commented on LUCENE-3842: - It also occurred to me that it would be interesting to have a naive minimalization technique for Brics Automata which would (for automata with a finite language): - enumarate the language - sort - build the minimal automaton While it may seem like an idea with crazy overhead it may actually be a viable alternative to minimization algorithms in Brics for very large automata. Interesting. Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221936#comment-13221936 ] Robert Muir commented on LUCENE-3842: - Dawid: hmm, well the transitions in brics are ranges in sorted order, so, if its finite, couldn't we just enumerate the language in sorted order while building the minimal automaton incrementally in parallel? Or am i missing something... its sunday... :) Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221940#comment-13221940 ] Dawid Weiss commented on LUCENE-3842: - Yep, sure you could (I admit I haven't looked at Brics in a long time so I don't remember the details, but I do remember the overhead was significant on optimization; this was a while ago - maybe it's improved). Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842-TokenStream_to_Automaton.patch, LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221691#comment-13221691 ] Robert Muir commented on LUCENE-3842: - also just so i dont forget, we should allow separate specification of index-time and query-time analyzers... because in cases where you are adding synonyms/wdf/etc there is a tradeoff (bigger FST, versus slower query-time perf) Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3842) Analyzing Suggester
[ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221748#comment-13221748 ] Michael McCandless commented on LUCENE-3842: VERY cool :) Analyzing Suggester --- Key: LUCENE-3842 URL: https://issues.apache.org/jira/browse/LUCENE-3842 Project: Lucene - Java Issue Type: New Feature Components: modules/spellchecker Affects Versions: 3.6, 4.0 Reporter: Robert Muir Attachments: LUCENE-3842.patch Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, such that we build a wFST with: input: analyzed text such as ghost0christmas0past -- byte 0 here is an optional token separator output: surface form such as the ghost of christmas past weight: the weight of the suggestion we make an FST with PairOutputsweight,output, but only do the shortest path operation on the weight side (like the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. This allows a lot of flexibility: * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in ghost of chr..., it will suggest the ghost of christmas past * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, so we would add a TokenFilter that copies ReadingAttribute into term text to support that... * other general things like offering suggestions that are more fuzzy like using a plural stemmer or ignoring accents or whatever. According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org