[jira] [Updated] (LUCENE-4499) Multi-word synonym filter (synonym expansion)
[ https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Høydahl updated LUCENE-4499: Priority: Major (was: Minor) > Multi-word synonym filter (synonym expansion) > - > > Key: LUCENE-4499 > URL: https://issues.apache.org/jira/browse/LUCENE-4499 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Affects Versions: 4.1, 6.0 >Reporter: Roman Chyla > Labels: analysis, multi-word, synonyms > Fix For: 6.0 > > Attachments: LUCENE-4499.patch, LUCENE-4499.patch > > > I apologize for bringing the multi-token synonym expansion up again. There is > an old, unresolved issue at LUCENE-1622 [1] > While solving the problem for our needs [2], I discovered that the current > SolrSynonym parser (and the wonderful FTS) have almost everything to > satisfactorily handle both the query and index time synonym expansion. It > seems that people often need to use the synonym filter *slightly* differently > at indexing and query time. > In our case, we must do different things during indexing and querying. > Example sentence: Mirrors of the Hubble space telescope pointed at XA5 > This is what we need (comma marks position bump): > indexing: mirrors,hubble|hubble space > telescope|hst,space,telescope,pointed,xa5|astroobject#5 > querying: +mirrors +(hubble space telescope | hst) +pointed > +(xa5|astroboject#5) > This translated to following needs: > indexing time: > single-token synonyms => return only synonyms > multi-token synonyms => return original tokens *AND* the synonyms > query time: > single-token: return only synonyms (but preserve case) > multi-token: return only synonyms > > We need the original tokens for the proximity queries, if we indexed 'hubble > space telescope' > as one token, we cannot search for 'hubble NEAR telescope' > You may (not) be surprised, but Lucene already supports ALL of these > requirements. The patch is an attempt to state the problem differently. I am > not sure if it is the best option, however it works perfectly for our needs > and it seems it could work for general public too. Especially if the > SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and > people would just choose what situation they use. Please look at the unittest. > links: > [1] https://issues.apache.org/jira/browse/LUCENE-1622 > [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158 > [3] seems to have similar request: > http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4499) Multi-word synonym filter (synonym expansion)
[ https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roman Chyla updated LUCENE-4499: Attachment: LUCENE-4499.patch A new patch, as the old version was extending wrong class (which cause web tests to fail) > Multi-word synonym filter (synonym expansion) > - > > Key: LUCENE-4499 > URL: https://issues.apache.org/jira/browse/LUCENE-4499 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Affects Versions: 4.1, 5.0 >Reporter: Roman Chyla >Priority: Minor > Labels: analysis, multi-word, synonyms > Fix For: 5.0 > > Attachments: LUCENE-4499.patch, LUCENE-4499.patch > > > I apologize for bringing the multi-token synonym expansion up again. There is > an old, unresolved issue at LUCENE-1622 [1] > While solving the problem for our needs [2], I discovered that the current > SolrSynonym parser (and the wonderful FTS) have almost everything to > satisfactorily handle both the query and index time synonym expansion. It > seems that people often need to use the synonym filter *slightly* differently > at indexing and query time. > In our case, we must do different things during indexing and querying. > Example sentence: Mirrors of the Hubble space telescope pointed at XA5 > This is what we need (comma marks position bump): > indexing: mirrors,hubble|hubble space > telescope|hst,space,telescope,pointed,xa5|astroobject#5 > querying: +mirrors +(hubble space telescope | hst) +pointed > +(xa5|astroboject#5) > This translated to following needs: > indexing time: > single-token synonyms => return only synonyms > multi-token synonyms => return original tokens *AND* the synonyms > query time: > single-token: return only synonyms (but preserve case) > multi-token: return only synonyms > > We need the original tokens for the proximity queries, if we indexed 'hubble > space telescope' > as one token, we cannot search for 'hubble NEAR telescope' > You may (not) be surprised, but Lucene already supports ALL of these > requirements. The patch is an attempt to state the problem differently. I am > not sure if it is the best option, however it works perfectly for our needs > and it seems it could work for general public too. Especially if the > SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and > people would just choose what situation they use. Please look at the unittest. > links: > [1] https://issues.apache.org/jira/browse/LUCENE-1622 > [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158 > [3] seems to have similar request: > http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4499) Multi-word synonym filter (synonym expansion)
[ https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roman updated LUCENE-4499: -- Attachment: LUCENE-4499.patch patch against latest trunk, i am seeing some unrelated unittests failing > Multi-word synonym filter (synonym expansion) > - > > Key: LUCENE-4499 > URL: https://issues.apache.org/jira/browse/LUCENE-4499 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Affects Versions: 4.1, 5.0 >Reporter: roman >Priority: Minor > Labels: analysis, multi-word, synonyms > Fix For: 5.0 > > Attachments: LUCENE-4499.patch > > > I apologize for bringing the multi-token synonym expansion up again. There is > an old, unresolved issue at LUCENE-1622 [1] > While solving the problem for our needs [2], I discovered that the current > SolrSynonym parser (and the wonderful FTS) have almost everything to > satisfactorily handle both the query and index time synonym expansion. It > seems that people often need to use the synonym filter *slightly* differently > at indexing and query time. > In our case, we must do different things during indexing and querying. > Example sentence: Mirrors of the Hubble space telescope pointed at XA5 > This is what we need (comma marks position bump): > indexing: mirrors,hubble|hubble space > telescope|hst,space,telescope,pointed,xa5|astroobject#5 > querying: +mirrors +(hubble space telescope | hst) +pointed > +(xa5|astroboject#5) > This translated to following needs: > indexing time: > single-token synonyms => return only synonyms > multi-token synonyms => return original tokens *AND* the synonyms > query time: > single-token: return only synonyms (but preserve case) > multi-token: return only synonyms > > We need the original tokens for the proximity queries, if we indexed 'hubble > space telescope' > as one token, we cannot search for 'hubble NEAR telescope' > You may (not) be surprised, but Lucene already supports ALL of these > requirements. The patch is an attempt to state the problem differently. I am > not sure if it is the best option, however it works perfectly for our needs > and it seems it could work for general public too. Especially if the > SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and > people would just choose what situation they use. Please look at the unittest. > links: > [1] https://issues.apache.org/jira/browse/LUCENE-1622 > [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158 > [3] seems to have similar request: > http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4499) Multi-word synonym filter (synonym expansion)
[ https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roman updated LUCENE-4499: -- Description: I apologize for bringing the multi-token synonym expansion up again. There is an old, unresolved issue at LUCENE-1622 [1] While solving the problem for our needs [2], I discovered that the current SolrSynonym parser (and the wonderful FTS) have almost everything to satisfactorily handle both the query and index time synonym expansion. It seems that people often need to use the synonym filter *slightly* differently at indexing and query time. In our case, we must do different things during indexing and querying. Example sentence: Mirrors of the Hubble space telescope pointed at XA5 This is what we need (comma marks position bump): indexing: mirrors,hubble|hubble space telescope|hst,space,telescope,pointed,xa5|astroobject#5 querying: +mirrors +(hubble space telescope | hst) +pointed +(xa5|astroboject#5) This translated to following needs: indexing time: single-token synonyms => return only synonyms multi-token synonyms => return original tokens *AND* the synonyms query time: single-token: return only synonyms (but preserve case) multi-token: return only synonyms We need the original tokens for the proximity queries, if we indexed 'hubble space telescope' as one token, we cannot search for 'hubble NEAR telescope' You may (not) be surprised, but Lucene already supports ALL of these requirements. The patch is an attempt to state the problem differently. I am not sure if it is the best option, however it works perfectly for our needs and it seems it could work for general public too. Especially if the SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and people would just choose what situation they use. Please look at the unittest. links: [1] https://issues.apache.org/jira/browse/LUCENE-1622 [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158 [3] seems to have similar request: http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html was: I apologize for bringing the multi-token synonym expansion up again. There is an old, unresolved issue at LUCENE-1622 [1] While solving the problem for our needs [2], I discovered that the current SolrSynonym parser (and the wonderful FTS) have almost everything to satisfactorily handle both the query and index time synonym expansion. It seems that people often need to use the synonym filter *slightly* differently at indexing and query time. In our case, we must do different things during indexing and querying. Example sentence: Mirrors of the Hubble space telescope pointed at XA5 This is what we need (comma marks position bump): indexing: mirrors,hubble|hubble space telescope|hst,space,telescope,pointed,xa5|astroobject#5 querying: +mirrors +(hubble space telescope | hst) +pointed +(xa5|astroboject#5) This translated to following needs: indexing time: single-token synonyms => return only synonyms multi-token synonyms => return original tokens AND the synonyms We need the original tokens for the proximity queries, if we indexed 'hubble space telescope' as one token, we cannot search for 'hubble NEAR telescope' query time: single-token: return only its synonyms (but preserve case) multi-token: return only synonyms You may (not) be surprised, but Lucene already supports ALL these requirements. The patch is an attempt to state the problem differently. I am not sure if it is the best option, however it works perfectly for our needs and it seems it could work for general public too. Especially if the SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and people could just choose what situation they use. links: [1] https://issues.apache.org/jira/browse/LUCENE-1622 [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158 [3] seems to have similar request: http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html > Multi-word synonym filter (synonym expansion) > - > > Key: LUCENE-4499 > URL: https://issues.apache.org/jira/browse/LUCENE-4499 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Affects Versions: 4.1, 5.0 >Reporter: roman >Priority: Minor > Labels: analysis, multi-word, synonyms > Fix For: 5.0 > > > I apologize for bringing the multi-token synonym expansion up again. There is > an old, unresolved issue at LUCENE-1622 [1] > While solving the problem for our needs [2], I discovered that the current > SolrSynonym parser (and the wonderful FTS) have almost everything to > satisfactorily handle both the query and index
[jira] [Updated] (LUCENE-4499) Multi-word synonym filter (synonym expansion)
[ https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roman updated LUCENE-4499: -- Summary: Multi-word synonym filter (synonym expansion) (was: Multi-word synonym filter (synonym expansion at indexing time).) > Multi-word synonym filter (synonym expansion) > - > > Key: LUCENE-4499 > URL: https://issues.apache.org/jira/browse/LUCENE-4499 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Affects Versions: 4.1, 5.0 >Reporter: roman >Priority: Minor > Labels: analysis, multi-word, synonyms > Fix For: 5.0 > > > I apologize for bringing the multi-token synonym expansion up again. There is > an old, unresolved issue at LUCENE-1622 [1] > While solving the problem for our needs [2], I discovered that the current > SolrSynonym parser (and the wonderful FTS) have almost everything to > satisfactorily handle both the query and index time synonym expansion. It > seems that people often need to use the synonym filter *slightly* differently > at indexing and query time. > In our case, we must do different things during indexing and querying. > Example sentence: Mirrors of the Hubble space telescope pointed at XA5 > This is what we need (comma marks position bump): > > indexing: mirrors,hubble|hubble space > telescope|hst,space,telescope,pointed,xa5|astroobject#5 > querying: +mirrors +(hubble space telescope | hst) +pointed > +(xa5|astroboject#5) > > This translated to following needs: > indexing time: > single-token synonyms => return only synonyms > multi-token synonyms => return original tokens AND the synonyms > > We need the original tokens for the proximity queries, if we indexed 'hubble > space telescope' > as one token, we cannot search for 'hubble NEAR telescope' > query time: > single-token: return only its synonyms (but preserve case) > multi-token: return only synonyms > You may (not) be surprised, but Lucene already supports ALL these > requirements. The patch is an attempt to state the problem differently. I am > not sure if it is the best option, however it works perfectly for our needs > and it seems it could work for general public too. Especially if the > SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and > people could just choose what situation they use. > links: > [1] https://issues.apache.org/jira/browse/LUCENE-1622 > [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158 > [3] seems to have similar request: > http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org