[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2016-12-19 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6664:
---
Attachment: LUCENE-6664.patch

Here's another patch, just modernizing the last one to apply to
current master, renaming {{SausageGraphFilter}} to
{{FlattenGraphFilter}} and fixing a few javadocs.  I think it's
ready.


> Replace SynonymFilter with SynonymGraphFilter
> -
>
> Key: LUCENE-6664
> URL: https://issues.apache.org/jira/browse/LUCENE-6664
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
> LUCENE-6664.patch, LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2015-08-05 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6664:
---
Fix Version/s: (was: 5.3)
   5.4

 Replace SynonymFilter with SynonymGraphFilter
 -

 Key: LUCENE-6664
 URL: https://issues.apache.org/jira/browse/LUCENE-6664
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: Trunk, 5.4

 Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
 LUCENE-6664.patch, usa.png, usa_flat.png


 Spinoff from LUCENE-6582.
 I created a new SynonymGraphFilter (to replace the current buggy
 SynonymFilter), that produces correct graphs (does no graph
 flattening itself).  I think this makes it simpler.
 This means you must add the FlattenGraphFilter yourself, if you are
 applying synonyms during indexing.
 Index-time syn expansion is a necessarily lossy graph transformation
 when multi-token (input or output) synonyms are applied, because the
 index does not store {{posLength}}, so there will always be phrase
 queries that should match but do not, and then phrase queries that
 should not match but do.
 http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
 goes into detail about this.
 However, with this new SynonymGraphFilter, if instead you do synonym
 expansion at query time (and don't do the flattening), and you use
 TermAutomatonQuery (future: somehow integrated into a query parser),
 or maybe just enumerate all paths and make union of PhraseQuery, you
 should get 100% correct matches (not sure about proper scoring
 though...).
 This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2015-08-05 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6664:
---
Fix Version/s: (was: 5.4)
   (was: Trunk)

 Replace SynonymFilter with SynonymGraphFilter
 -

 Key: LUCENE-6664
 URL: https://issues.apache.org/jira/browse/LUCENE-6664
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
 LUCENE-6664.patch, usa.png, usa_flat.png


 Spinoff from LUCENE-6582.
 I created a new SynonymGraphFilter (to replace the current buggy
 SynonymFilter), that produces correct graphs (does no graph
 flattening itself).  I think this makes it simpler.
 This means you must add the FlattenGraphFilter yourself, if you are
 applying synonyms during indexing.
 Index-time syn expansion is a necessarily lossy graph transformation
 when multi-token (input or output) synonyms are applied, because the
 index does not store {{posLength}}, so there will always be phrase
 queries that should match but do not, and then phrase queries that
 should not match but do.
 http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
 goes into detail about this.
 However, with this new SynonymGraphFilter, if instead you do synonym
 expansion at query time (and don't do the flattening), and you use
 TermAutomatonQuery (future: somehow integrated into a query parser),
 or maybe just enumerate all paths and make union of PhraseQuery, you
 should get 100% correct matches (not sure about proper scoring
 though...).
 This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2015-08-02 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6664:
---
Attachment: LUCENE-6664.patch

New patch, making the new filters public and experimental again.

I also improved the naming.

[~rcmuir] is this OK?  Or do you think which attributes to use should block 
committing this?  I can also put this in sandbox?

 Replace SynonymFilter with SynonymGraphFilter
 -

 Key: LUCENE-6664
 URL: https://issues.apache.org/jira/browse/LUCENE-6664
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.3, Trunk

 Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
 LUCENE-6664.patch, usa.png, usa_flat.png


 Spinoff from LUCENE-6582.
 I created a new SynonymGraphFilter (to replace the current buggy
 SynonymFilter), that produces correct graphs (does no graph
 flattening itself).  I think this makes it simpler.
 This means you must add the FlattenGraphFilter yourself, if you are
 applying synonyms during indexing.
 Index-time syn expansion is a necessarily lossy graph transformation
 when multi-token (input or output) synonyms are applied, because the
 index does not store {{posLength}}, so there will always be phrase
 queries that should match but do not, and then phrase queries that
 should not match but do.
 http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
 goes into detail about this.
 However, with this new SynonymGraphFilter, if instead you do synonym
 expansion at query time (and don't do the flattening), and you use
 TermAutomatonQuery (future: somehow integrated into a query parser),
 or maybe just enumerate all paths and make union of PhraseQuery, you
 should get 100% correct matches (not sure about proper scoring
 though...).
 This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2015-07-28 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6664:
---
Attachment: LUCENE-6664.patch

New patch with Rob's idea: I made the new SynonymGraphFilter and
SausageFilter package private, and replaced the old SynonymFilter with
these two filters.

But TestSynonymMapFilter (the existing unit test) fails, because there
are some changes in behavior with the new filter:

  * Syn output order is different: with the new syn filter, the syn
comes out before the original token.  This is necessary to ensure
offsets never go backwards...

  * When there are more output tokens for a syn than input tokens,
then new syn filter makes new positions for the extra tokens, but
the old one didn't.

  * The new syn filter does more captureState() calls

I think we need to keep the old behavior available, maybe using a
Version constant or a separate class (SynFilterPre53,
LegacySynFilter) or something?


 Replace SynonymFilter with SynonymGraphFilter
 -

 Key: LUCENE-6664
 URL: https://issues.apache.org/jira/browse/LUCENE-6664
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.3, Trunk

 Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
 usa.png, usa_flat.png


 Spinoff from LUCENE-6582.
 I created a new SynonymGraphFilter (to replace the current buggy
 SynonymFilter), that produces correct graphs (does no graph
 flattening itself).  I think this makes it simpler.
 This means you must add the FlattenGraphFilter yourself, if you are
 applying synonyms during indexing.
 Index-time syn expansion is a necessarily lossy graph transformation
 when multi-token (input or output) synonyms are applied, because the
 index does not store {{posLength}}, so there will always be phrase
 queries that should match but do not, and then phrase queries that
 should not match but do.
 http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
 goes into detail about this.
 However, with this new SynonymGraphFilter, if instead you do synonym
 expansion at query time (and don't do the flattening), and you use
 TermAutomatonQuery (future: somehow integrated into a query parser),
 or maybe just enumerate all paths and make union of PhraseQuery, you
 should get 100% correct matches (not sure about proper scoring
 though...).
 This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2015-07-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6664:
---
Attachment: usa.png
usa_flat.png

Example syn graph, and flattened version.

 Replace SynonymFilter with SynonymGraphFilter
 -

 Key: LUCENE-6664
 URL: https://issues.apache.org/jira/browse/LUCENE-6664
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.3, Trunk

 Attachments: LUCENE-6664.patch, usa.png, usa_flat.png


 Spinoff from LUCENE-6582.
 I created a new SynonymGraphFilter (to replace the current buggy
 SynonymFilter), that produces correct graphs (does no graph
 flattening itself).  I think this makes it simpler.
 This means you must add the FlattenGraphFilter yourself, if you are
 applying synonyms during indexing.
 Index-time syn expansion is a necessarily lossy graph transformation
 when multi-token (input or output) synonyms are applied, because the
 index does not store {{posLength}}, so there will always be phrase
 queries that should match but do not, and then phrase queries that
 should not match but do.
 http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
 goes into detail about this.
 However, with this new SynonymGraphFilter, if instead you do synonym
 expansion at query time (and don't do the flattening), and you use
 TermAutomatonQuery (future: somehow integrated into a query parser),
 or maybe just enumerate all paths and make union of PhraseQuery, you
 should get 100% correct matches (not sure about proper scoring
 though...).
 This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2015-07-26 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6664:
---
Attachment: LUCENE-6664.patch

New patch, fixing all nocommits, folding in all the nice test cases from 
LUCENE-6582 (thanks [~ianribas]!), fixing some offsets bugs.

I think it's finally ready.  This issue absorbs LUCENE-6638.

I also wrote a fun test method ({{toDot(TokenStream)}}) that converts a 
{{TokenStream}} to a dot file which you can then render with graphviz.  E.g. 
here's the un-flattened expansion for various syns of usa:

!usa.png!

and the corresponding flattened version:

!usa_flat.png!

(red arcs are the inserted synonym tokens)

With {{SynonymGraphFilter}}, multi token synonyms can finally be correctly 
represented in the token stream, and using query-time synonyms with either 
{{TermAutomatonQuery}} or some other approach (e.g. expanding all paths and 
making OR of PhraseQuery), the correct results should be returned.  Index-time 
synonyms will still be incorrect (fail to match some phrase queries, and 
incorrectly match other phrase queries) since we don't index the 
PosLenAttribute.


 Replace SynonymFilter with SynonymGraphFilter
 -

 Key: LUCENE-6664
 URL: https://issues.apache.org/jira/browse/LUCENE-6664
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.3, Trunk

 Attachments: LUCENE-6664.patch, LUCENE-6664.patch, usa.png, 
 usa_flat.png


 Spinoff from LUCENE-6582.
 I created a new SynonymGraphFilter (to replace the current buggy
 SynonymFilter), that produces correct graphs (does no graph
 flattening itself).  I think this makes it simpler.
 This means you must add the FlattenGraphFilter yourself, if you are
 applying synonyms during indexing.
 Index-time syn expansion is a necessarily lossy graph transformation
 when multi-token (input or output) synonyms are applied, because the
 index does not store {{posLength}}, so there will always be phrase
 queries that should match but do not, and then phrase queries that
 should not match but do.
 http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
 goes into detail about this.
 However, with this new SynonymGraphFilter, if instead you do synonym
 expansion at query time (and don't do the flattening), and you use
 TermAutomatonQuery (future: somehow integrated into a query parser),
 or maybe just enumerate all paths and make union of PhraseQuery, you
 should get 100% correct matches (not sure about proper scoring
 though...).
 This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2015-07-07 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-6664:
---
Attachment: LUCENE-6664.patch

Patch, still work in progress.  It includes the FlattenGraphFilter
from LUCENE-6638.

I put everything in sandbox for now, so I could add a test case that
TermAutomatonQuery works correctly for query-time syn expansion.  But
this added a dep from sandbox on analyzers ... I think I'll move the
new filters back to analyzers module and comment on the TAQ test case
as an example.


 Replace SynonymFilter with SynonymGraphFilter
 -

 Key: LUCENE-6664
 URL: https://issues.apache.org/jira/browse/LUCENE-6664
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.3, Trunk

 Attachments: LUCENE-6664.patch


 Spinoff from LUCENE-6582.
 I created a new SynonymGraphFilter (to replace the current buggy
 SynonymFilter), that produces correct graphs (does no graph
 flattening itself).  I think this makes it simpler.
 This means you must add the FlattenGraphFilter yourself, if you are
 applying synonyms during indexing.
 Index-time syn expansion is a necessarily lossy graph transformation
 when multi-token (input or output) synonyms are applied, because the
 index does not store {{posLength}}, so there will always be phrase
 queries that should match but do not, and then phrase queries that
 should not match but do.
 http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
 goes into detail about this.
 However, with this new SynonymGraphFilter, if instead you do synonym
 expansion at query time (and don't do the flattening), and you use
 TermAutomatonQuery (future: somehow integrated into a query parser),
 or maybe just enumerate all paths and make union of PhraseQuery, you
 should get 100% correct matches (not sure about proper scoring
 though...).
 This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org