[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-6664: --- Attachment: LUCENE-6664.patch Here's another patch, just modernizing the last one to apply to current master, renaming {{SausageGraphFilter}} to {{FlattenGraphFilter}} and fixing a few javadocs. I think it's ready. > Replace SynonymFilter with SynonymGraphFilter > - > > Key: LUCENE-6664 > URL: https://issues.apache.org/jira/browse/LUCENE-6664 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael McCandless >Assignee: Michael McCandless > Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, > LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png > > > Spinoff from LUCENE-6582. > I created a new SynonymGraphFilter (to replace the current buggy > SynonymFilter), that produces correct graphs (does no "graph > flattening" itself). I think this makes it simpler. > This means you must add the FlattenGraphFilter yourself, if you are > applying synonyms during indexing. > Index-time syn expansion is a necessarily "lossy" graph transformation > when multi-token (input or output) synonyms are applied, because the > index does not store {{posLength}}, so there will always be phrase > queries that should match but do not, and then phrase queries that > should not match but do. > http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html > goes into detail about this. > However, with this new SynonymGraphFilter, if instead you do synonym > expansion at query time (and don't do the flattening), and you use > TermAutomatonQuery (future: somehow integrated into a query parser), > or maybe just "enumerate all paths and make union of PhraseQuery", you > should get 100% correct matches (not sure about "proper" scoring > though...). > This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-6664: --- Fix Version/s: (was: 5.3) 5.4 Replace SynonymFilter with SynonymGraphFilter - Key: LUCENE-6664 URL: https://issues.apache.org/jira/browse/LUCENE-6664 Project: Lucene - Core Issue Type: New Feature Reporter: Michael McCandless Assignee: Michael McCandless Fix For: Trunk, 5.4 Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png Spinoff from LUCENE-6582. I created a new SynonymGraphFilter (to replace the current buggy SynonymFilter), that produces correct graphs (does no graph flattening itself). I think this makes it simpler. This means you must add the FlattenGraphFilter yourself, if you are applying synonyms during indexing. Index-time syn expansion is a necessarily lossy graph transformation when multi-token (input or output) synonyms are applied, because the index does not store {{posLength}}, so there will always be phrase queries that should match but do not, and then phrase queries that should not match but do. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html goes into detail about this. However, with this new SynonymGraphFilter, if instead you do synonym expansion at query time (and don't do the flattening), and you use TermAutomatonQuery (future: somehow integrated into a query parser), or maybe just enumerate all paths and make union of PhraseQuery, you should get 100% correct matches (not sure about proper scoring though...). This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-6664: --- Fix Version/s: (was: 5.4) (was: Trunk) Replace SynonymFilter with SynonymGraphFilter - Key: LUCENE-6664 URL: https://issues.apache.org/jira/browse/LUCENE-6664 Project: Lucene - Core Issue Type: New Feature Reporter: Michael McCandless Assignee: Michael McCandless Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png Spinoff from LUCENE-6582. I created a new SynonymGraphFilter (to replace the current buggy SynonymFilter), that produces correct graphs (does no graph flattening itself). I think this makes it simpler. This means you must add the FlattenGraphFilter yourself, if you are applying synonyms during indexing. Index-time syn expansion is a necessarily lossy graph transformation when multi-token (input or output) synonyms are applied, because the index does not store {{posLength}}, so there will always be phrase queries that should match but do not, and then phrase queries that should not match but do. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html goes into detail about this. However, with this new SynonymGraphFilter, if instead you do synonym expansion at query time (and don't do the flattening), and you use TermAutomatonQuery (future: somehow integrated into a query parser), or maybe just enumerate all paths and make union of PhraseQuery, you should get 100% correct matches (not sure about proper scoring though...). This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-6664: --- Attachment: LUCENE-6664.patch New patch, making the new filters public and experimental again. I also improved the naming. [~rcmuir] is this OK? Or do you think which attributes to use should block committing this? I can also put this in sandbox? Replace SynonymFilter with SynonymGraphFilter - Key: LUCENE-6664 URL: https://issues.apache.org/jira/browse/LUCENE-6664 Project: Lucene - Core Issue Type: New Feature Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.3, Trunk Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png Spinoff from LUCENE-6582. I created a new SynonymGraphFilter (to replace the current buggy SynonymFilter), that produces correct graphs (does no graph flattening itself). I think this makes it simpler. This means you must add the FlattenGraphFilter yourself, if you are applying synonyms during indexing. Index-time syn expansion is a necessarily lossy graph transformation when multi-token (input or output) synonyms are applied, because the index does not store {{posLength}}, so there will always be phrase queries that should match but do not, and then phrase queries that should not match but do. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html goes into detail about this. However, with this new SynonymGraphFilter, if instead you do synonym expansion at query time (and don't do the flattening), and you use TermAutomatonQuery (future: somehow integrated into a query parser), or maybe just enumerate all paths and make union of PhraseQuery, you should get 100% correct matches (not sure about proper scoring though...). This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-6664: --- Attachment: LUCENE-6664.patch New patch with Rob's idea: I made the new SynonymGraphFilter and SausageFilter package private, and replaced the old SynonymFilter with these two filters. But TestSynonymMapFilter (the existing unit test) fails, because there are some changes in behavior with the new filter: * Syn output order is different: with the new syn filter, the syn comes out before the original token. This is necessary to ensure offsets never go backwards... * When there are more output tokens for a syn than input tokens, then new syn filter makes new positions for the extra tokens, but the old one didn't. * The new syn filter does more captureState() calls I think we need to keep the old behavior available, maybe using a Version constant or a separate class (SynFilterPre53, LegacySynFilter) or something? Replace SynonymFilter with SynonymGraphFilter - Key: LUCENE-6664 URL: https://issues.apache.org/jira/browse/LUCENE-6664 Project: Lucene - Core Issue Type: New Feature Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.3, Trunk Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png Spinoff from LUCENE-6582. I created a new SynonymGraphFilter (to replace the current buggy SynonymFilter), that produces correct graphs (does no graph flattening itself). I think this makes it simpler. This means you must add the FlattenGraphFilter yourself, if you are applying synonyms during indexing. Index-time syn expansion is a necessarily lossy graph transformation when multi-token (input or output) synonyms are applied, because the index does not store {{posLength}}, so there will always be phrase queries that should match but do not, and then phrase queries that should not match but do. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html goes into detail about this. However, with this new SynonymGraphFilter, if instead you do synonym expansion at query time (and don't do the flattening), and you use TermAutomatonQuery (future: somehow integrated into a query parser), or maybe just enumerate all paths and make union of PhraseQuery, you should get 100% correct matches (not sure about proper scoring though...). This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-6664: --- Attachment: usa.png usa_flat.png Example syn graph, and flattened version. Replace SynonymFilter with SynonymGraphFilter - Key: LUCENE-6664 URL: https://issues.apache.org/jira/browse/LUCENE-6664 Project: Lucene - Core Issue Type: New Feature Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.3, Trunk Attachments: LUCENE-6664.patch, usa.png, usa_flat.png Spinoff from LUCENE-6582. I created a new SynonymGraphFilter (to replace the current buggy SynonymFilter), that produces correct graphs (does no graph flattening itself). I think this makes it simpler. This means you must add the FlattenGraphFilter yourself, if you are applying synonyms during indexing. Index-time syn expansion is a necessarily lossy graph transformation when multi-token (input or output) synonyms are applied, because the index does not store {{posLength}}, so there will always be phrase queries that should match but do not, and then phrase queries that should not match but do. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html goes into detail about this. However, with this new SynonymGraphFilter, if instead you do synonym expansion at query time (and don't do the flattening), and you use TermAutomatonQuery (future: somehow integrated into a query parser), or maybe just enumerate all paths and make union of PhraseQuery, you should get 100% correct matches (not sure about proper scoring though...). This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-6664: --- Attachment: LUCENE-6664.patch New patch, fixing all nocommits, folding in all the nice test cases from LUCENE-6582 (thanks [~ianribas]!), fixing some offsets bugs. I think it's finally ready. This issue absorbs LUCENE-6638. I also wrote a fun test method ({{toDot(TokenStream)}}) that converts a {{TokenStream}} to a dot file which you can then render with graphviz. E.g. here's the un-flattened expansion for various syns of usa: !usa.png! and the corresponding flattened version: !usa_flat.png! (red arcs are the inserted synonym tokens) With {{SynonymGraphFilter}}, multi token synonyms can finally be correctly represented in the token stream, and using query-time synonyms with either {{TermAutomatonQuery}} or some other approach (e.g. expanding all paths and making OR of PhraseQuery), the correct results should be returned. Index-time synonyms will still be incorrect (fail to match some phrase queries, and incorrectly match other phrase queries) since we don't index the PosLenAttribute. Replace SynonymFilter with SynonymGraphFilter - Key: LUCENE-6664 URL: https://issues.apache.org/jira/browse/LUCENE-6664 Project: Lucene - Core Issue Type: New Feature Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.3, Trunk Attachments: LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png Spinoff from LUCENE-6582. I created a new SynonymGraphFilter (to replace the current buggy SynonymFilter), that produces correct graphs (does no graph flattening itself). I think this makes it simpler. This means you must add the FlattenGraphFilter yourself, if you are applying synonyms during indexing. Index-time syn expansion is a necessarily lossy graph transformation when multi-token (input or output) synonyms are applied, because the index does not store {{posLength}}, so there will always be phrase queries that should match but do not, and then phrase queries that should not match but do. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html goes into detail about this. However, with this new SynonymGraphFilter, if instead you do synonym expansion at query time (and don't do the flattening), and you use TermAutomatonQuery (future: somehow integrated into a query parser), or maybe just enumerate all paths and make union of PhraseQuery, you should get 100% correct matches (not sure about proper scoring though...). This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter
[ https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-6664: --- Attachment: LUCENE-6664.patch Patch, still work in progress. It includes the FlattenGraphFilter from LUCENE-6638. I put everything in sandbox for now, so I could add a test case that TermAutomatonQuery works correctly for query-time syn expansion. But this added a dep from sandbox on analyzers ... I think I'll move the new filters back to analyzers module and comment on the TAQ test case as an example. Replace SynonymFilter with SynonymGraphFilter - Key: LUCENE-6664 URL: https://issues.apache.org/jira/browse/LUCENE-6664 Project: Lucene - Core Issue Type: New Feature Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.3, Trunk Attachments: LUCENE-6664.patch Spinoff from LUCENE-6582. I created a new SynonymGraphFilter (to replace the current buggy SynonymFilter), that produces correct graphs (does no graph flattening itself). I think this makes it simpler. This means you must add the FlattenGraphFilter yourself, if you are applying synonyms during indexing. Index-time syn expansion is a necessarily lossy graph transformation when multi-token (input or output) synonyms are applied, because the index does not store {{posLength}}, so there will always be phrase queries that should match but do not, and then phrase queries that should not match but do. http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html goes into detail about this. However, with this new SynonymGraphFilter, if instead you do synonym expansion at query time (and don't do the flattening), and you use TermAutomatonQuery (future: somehow integrated into a query parser), or maybe just enumerate all paths and make union of PhraseQuery, you should get 100% correct matches (not sure about proper scoring though...). This new syn filter still cannot consume an arbitrary graph. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org