[jira] [Updated] (LUCENE-6687) MLT term frequency calculation bug

2019-05-10 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-6687:

Fix Version/s: 8.1

> MLT term frequency calculation bug
> --
>
> Key: LUCENE-6687
> URL: https://issues.apache.org/jira/browse/LUCENE-6687
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/queryparser
>Affects Versions: 5.2.1, 6.0
> Environment: OS X v10.10.4; Solr 5.2.1
>Reporter: Marko Bonaci
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 5.2.2, 8.1, master (9.0)
>
> Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, 
> LUCENE-6687.patch, buggy-method-usage.png, 
> solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6687) MLT term frequency calculation bug

2019-05-02 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831521#comment-16831521
 ] 

Tommaso Teofili commented on LUCENE-6687:
-

bq. Why do we show Fix Version 5.2.2?  Was it really backported to 5.2.x branch?

I think the 5.2.2 fix version was added quite earlier than the actual merge of 
the patch in master and AFAIK this was never backported to any branch other 
than master.
The reason for not backporting being it's a subtle change and I had thought it 
would have been good to have in a non releasing branch for a while before 
including it in a branch we release from.

However perhaps it's time to include it in 8x.



> MLT term frequency calculation bug
> --
>
> Key: LUCENE-6687
> URL: https://issues.apache.org/jira/browse/LUCENE-6687
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/queryparser
>Affects Versions: 5.2.1, 6.0
> Environment: OS X v10.10.4; Solr 5.2.1
>Reporter: Marko Bonaci
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 5.2.2, master (9.0)
>
> Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, 
> LUCENE-6687.patch, buggy-method-usage.png, 
> solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6687) MLT term frequency calculation bug

2019-03-26 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-6687:

Fix Version/s: master (9.0)

> MLT term frequency calculation bug
> --
>
> Key: LUCENE-6687
> URL: https://issues.apache.org/jira/browse/LUCENE-6687
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/queryparser
>Affects Versions: 5.2.1, 6.0
> Environment: OS X v10.10.4; Solr 5.2.1
>Reporter: Marko Bonaci
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 5.2.2, master (9.0)
>
> Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, 
> LUCENE-6687.patch, buggy-method-usage.png, 
> solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6687) MLT term frequency calculation bug

2019-03-26 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801565#comment-16801565
 ] 

Tommaso Teofili commented on LUCENE-6687:
-

thanks [~mbonaci] and [~alessandro.benedetti], I've committed and pushed your 
patch.

> MLT term frequency calculation bug
> --
>
> Key: LUCENE-6687
> URL: https://issues.apache.org/jira/browse/LUCENE-6687
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/queryparser
>Affects Versions: 5.2.1, 6.0
> Environment: OS X v10.10.4; Solr 5.2.1
>Reporter: Marko Bonaci
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 5.2.2
>
> Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, 
> LUCENE-6687.patch, buggy-method-usage.png, 
> solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-6687) MLT term frequency calculation bug

2019-03-26 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-6687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reassigned LUCENE-6687:
---

Assignee: Tommaso Teofili

> MLT term frequency calculation bug
> --
>
> Key: LUCENE-6687
> URL: https://issues.apache.org/jira/browse/LUCENE-6687
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring, core/queryparser
>Affects Versions: 5.2.1, 6.0
> Environment: OS X v10.10.4; Solr 5.2.1
>Reporter: Marko Bonaci
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 5.2.2
>
> Attachments: LUCENE-6687.patch, LUCENE-6687.patch, LUCENE-6687.patch, 
> LUCENE-6687.patch, buggy-method-usage.png, 
> solr-mlt-tf-doubling-bug-results.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png, 
> solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png, 
> solr-mlt-tf-doubling-bug.png, terms-accumulator.png, terms-angry.png, 
> terms-glass.png, terms-how.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In {{org.apache.lucene.queries.mlt.MoreLikeThis}}, there's a method 
> {{retrieveTerms}} that receives a {{Map}} of fields, i.e. a document 
> basically, but it doesn't have to be an existing doc.
> !solr-mlt-tf-doubling-bug.png|height=500!
> There are 2 for loops, one inside the other, which both loop through the same 
> set of fields.
> That effectively doubles the term frequency for all the terms from fields 
> that we provide in MLT QP {{qf}} parameter. 
> It basically goes two times over the list of fields and accumulates the term 
> frequencies from all fields into {{termFreqMap}}.
> The private method {{retrieveTerms}} is only called from one public method, 
> the version of overloaded method {{like}} that receives a Map: so that 
> private class member {{fieldNames}} is always derived from 
> {{retrieveTerms}}'s argument {{fields}}.
>  
> Uh, I don't understand what I wrote myself, but that basically means that, by 
> the time {{retrieveTerms}} method gets called, its parameter fields and 
> private member {{fieldNames}} always contain the same list of fields.
> Here's the proof:
> These are the final results of the calculation:
> !solr-mlt-tf-doubling-bug-results.png|height=700!
> And this is the actual {{thread_id:TID0009}} document, where those values 
> were derived from (from fields {{title_mlt}} and {{pagetext_mlt}}):
> !terms-glass.png|height=100!
> !terms-angry.png|height=100!
> !terms-how.png|height=100!
> !terms-accumulator.png|height=100!
> Now, let's further test this hypothesis by seeing MLT QP in action from the 
> AdminUI.
> Let's try to find docs that are More Like doc {{TID0009}}. 
> Here's the interesting part, the query:
> {code}
> q={!mlt qf=pagetext_mlt,title_mlt mintf=14 mindf=2 minwl=3 maxwl=15}TID0009
> {code}
> We just saw, in the last image above, that the term accumulator appears {{7}} 
> times in {{TID0009}} doc, but the {{accumulator}}'s TF was calculated as 
> {{14}}.
> By using {{mintf=14}}, we say that, when calculating similarity, we don't 
> want to consider terms that appear less than 14 times (when terms from fields 
> {{title_mlt}} and {{pagetext_mlt}} are merged together) in {{TID0009}}.
> I added the term accumulator in only one other document ({{TID0004}}), where 
> it appears only once, in the field {{title_mlt}}. 
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf14.png|height=500!
> Let's see what happens when we use {{mintf=15}}:
> !solr-mlt-tf-doubling-bug-verify-accumulator-mintf15.png|height=500!
> I should probably mention that multiple fields ({{qf}}) work because I 
> applied the patch: 
> [SOLR-7143|https://issues.apache.org/jira/browse/SOLR-7143].
> Bug, no?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8659) Upgrade OpenNLP to 1.9.1

2019-01-28 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-8659:

Fix Version/s: 8.0

> Upgrade OpenNLP to 1.9.1
> 
>
> Key: LUCENE-8659
> URL: https://issues.apache.org/jira/browse/LUCENE-8659
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: 8.0, master (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since Apache OpenNLP 1.9.1 has been released it would be nice to upgrade 
> Lucene/Solr to use that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8659) Upgrade OpenNLP to 1.9.1

2019-01-26 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753205#comment-16753205
 ] 

Tommaso Teofili commented on LUCENE-8659:
-

thanks [~steve_rowe], I've adjusted the checksums as per your previous comments.
Yes, I'll backport this to 7x and 8x branches.

> Upgrade OpenNLP to 1.9.1
> 
>
> Key: LUCENE-8659
> URL: https://issues.apache.org/jira/browse/LUCENE-8659
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since Apache OpenNLP 1.9.1 has been released it would be nice to upgrade 
> Lucene/Solr to use that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Reopened] (LUCENE-8659) Upgrade OpenNLP to 1.9.1

2019-01-26 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reopened LUCENE-8659:
-

> Upgrade OpenNLP to 1.9.1
> 
>
> Key: LUCENE-8659
> URL: https://issues.apache.org/jira/browse/LUCENE-8659
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since Apache OpenNLP 1.9.1 has been released it would be nice to upgrade 
> Lucene/Solr to use that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8659) Upgrade OpenNLP to 1.9.1

2019-01-26 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-8659.
-
Resolution: Fixed
  Assignee: Tommaso Teofili

> Upgrade OpenNLP to 1.9.1
> 
>
> Key: LUCENE-8659
> URL: https://issues.apache.org/jira/browse/LUCENE-8659
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Since Apache OpenNLP 1.9.1 has been released it would be nice to upgrade 
> Lucene/Solr to use that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8659) Upgrade OpenNLP to 1.9.1

2019-01-26 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753010#comment-16753010
 ] 

Tommaso Teofili commented on LUCENE-8659:
-

created [PR|https://github.com/apache/lucene-solr/pull/548].

> Upgrade OpenNLP to 1.9.1
> 
>
> Key: LUCENE-8659
> URL: https://issues.apache.org/jira/browse/LUCENE-8659
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Since Apache OpenNLP 1.9.1 has been released it would be nice to upgrade 
> Lucene/Solr to use that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8659) Upgrade OpenNLP to 1.9.1

2019-01-26 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-8659:
---

 Summary: Upgrade OpenNLP to 1.9.1
 Key: LUCENE-8659
 URL: https://issues.apache.org/jira/browse/LUCENE-8659
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Reporter: Tommaso Teofili
 Fix For: trunk


Since Apache OpenNLP 1.9.1 has been released it would be nice to upgrade 
Lucene/Solr to use that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-5698) Evaluate Lucene classification on publicly available datasets

2019-01-24 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-5698.
-
   Resolution: Fixed
Fix Version/s: trunk

> Evaluate Lucene classification on publicly available datasets
> -
>
> Key: LUCENE-5698
> URL: https://issues.apache.org/jira/browse/LUCENE-5698
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: modules/classification
>Reporter: Gergő Törcsvári
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
> Attachments: 0803-test.patch, 0810-test.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Lucene classification module need some publicly available dataset for 
> keep track on the development.
> Now it woud be nice to have some generated fast test-sets, and some bigger 
> real world dataset too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5698) Evaluate Lucene classification on publicly available datasets

2019-01-18 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-5698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16746633#comment-16746633
 ] 

Tommaso Teofili commented on LUCENE-5698:
-

created PR at https://github.com/apache/lucene-solr/pull/544

> Evaluate Lucene classification on publicly available datasets
> -
>
> Key: LUCENE-5698
> URL: https://issues.apache.org/jira/browse/LUCENE-5698
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: modules/classification
>Reporter: Gergő Törcsvári
>Assignee: Tommaso Teofili
>Priority: Major
> Attachments: 0803-test.patch, 0810-test.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Lucene classification module need some publicly available dataset for 
> keep track on the development.
> Now it woud be nice to have some generated fast test-sets, and some bigger 
> real world dataset too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2018-11-06 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676322#comment-16676322
 ] 

Tommaso Teofili commented on SOLR-12879:


[~andyhind] I think a separate issue is not needed.
The above doc looks good to me, for the _MinHashFilter_.
Would you be able to provide also some documentation about this query parser ?
I think it would be good if we could provide documentation for an end to end 
usage of the query parser in combination with the filter, if possible.


> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.filter.adoc.fragment, minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-25 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663405#comment-16663405
 ] 

Tommaso Teofili commented on SOLR-12879:


it should be back to green now, thanks [~steve_rowe] for the heads up.

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.filter.adoc.fragment, minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-23 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660186#comment-16660186
 ] 

Tommaso Teofili edited comment on SOLR-12879 at 10/23/18 7:12 AM:
--

+1 for backporting to 7.x branch.

bq. the parser could potentially be given a default name of (say) minhash and 
included in the standard plugins i.e. 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.5.0/solr/core/src/java/org/apache/solr/search/QParserPlugin.java#L46

good point, +1 (I think _min_hash_ would be slightly better naming as it aligns 
with other patterns, e.g. _PayloadCheckQParserPlugin_ registered as 
_payload_check_ )

bq. The solr/CHANGES.txt entry lacks the customary attribution, just an 
oversight I'm sure and easily fixed.

yes, sorry! I'll fix it right away.


was (Author: teofili):
+1 for backporting to 7.x branch.

bq. the parser could potentially be given a default name of (say) minhash and 
included in the standard plugins i.e. 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.5.0/solr/core/src/java/org/apache/solr/search/QParserPlugin.java#L46

good point, +1 (I think _min_hash_ would be slightly better naming as it align 
with other patterns, e.g. _PayloadCheckQParserPlugin_ registered as 
_payload_check_ )

bq. The solr/CHANGES.txt entry lacks the customary attribution, just an 
oversight I'm sure and easily fixed.

yes, sorry! I'll fix it right away.

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-23 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660186#comment-16660186
 ] 

Tommaso Teofili edited comment on SOLR-12879 at 10/23/18 7:12 AM:
--

+1 for backporting to 7.x branch.

bq. the parser could potentially be given a default name of (say) minhash and 
included in the standard plugins i.e. 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.5.0/solr/core/src/java/org/apache/solr/search/QParserPlugin.java#L46

good point, +1 (I think _min_hash_ would be slightly better naming as it align 
with other patterns, e.g. _PayloadCheckQParserPlugin_ registered as 
_payload_check_ )

bq. The solr/CHANGES.txt entry lacks the customary attribution, just an 
oversight I'm sure and easily fixed.

yes, sorry! I'll fix it right away.


was (Author: teofili):
+1 for backporting to 7.x branch.

bq. the parser could potentially be given a default name of (say) minhash and 
included in the standard plugins i.e. 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.5.0/solr/core/src/java/org/apache/solr/search/QParserPlugin.java#L46

good point, +1

bq. The solr/CHANGES.txt entry lacks the customary attribution, just an 
oversight I'm sure and easily fixed.

yes, sorry! I'll fix it right away.

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-23 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660186#comment-16660186
 ] 

Tommaso Teofili commented on SOLR-12879:


+1 for backporting to 7.x branch.

bq. the parser could potentially be given a default name of (say) minhash and 
included in the standard plugins i.e. 
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.5.0/solr/core/src/java/org/apache/solr/search/QParserPlugin.java#L46

good point, +1

bq. The solr/CHANGES.txt entry lacks the customary attribution, just an 
oversight I'm sure and easily fixed.

yes, sorry! I'll fix it right away.

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-20 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved SOLR-12879.

Resolution: Fixed

thanks [~andyhind] for your patch, it's now committed on master.

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-20 Thread Tommaso Teofili (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reassigned SOLR-12879:
--

Assignee: Tommaso Teofili

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: minhash.patch
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12879) Query Parser for MinHash/LSH

2018-10-17 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653938#comment-16653938
 ] 

Tommaso Teofili commented on SOLR-12879:


bq. Should the score from the overall query be normalised?

I think that may depend, in some edge cases non normalized scores may generate 
unexpected bias. But all in all I don't think it should be.

> Query Parser for MinHash/LSH
> 
>
> Key: SOLR-12879
> URL: https://issues.apache.org/jira/browse/SOLR-12879
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: master (8.0)
>Reporter: Andy Hind
>Priority: Major
> Fix For: master (8.0)
>
>
> Following on from https://issues.apache.org/jira/browse/LUCENE-6968, provide 
> a query parser that builds queries that provide a measure of Jaccard 
> similarity. The initial patch includes banded queries that were also proposed 
> on the original issue.
>  
> I have one outstanding questions:
>  * Should the score from the overall query be normalised?
> Note, that the band count is currently approximate and may be one less than 
> in practise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8331) MergePolicy simulator utility

2018-05-29 Thread Tommaso Teofili (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493290#comment-16493290
 ] 

Tommaso Teofili commented on LUCENE-8331:
-

bq. I think it should support deletes and should not use IW then I ok with it
 
+1

> MergePolicy simulator utility
> -
>
> Key: LUCENE-8331
> URL: https://issues.apache.org/jira/browse/LUCENE-8331
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8331.patch
>
>
> This issue introduces a MergePolicy simulator utility to help evaluate the 
> effectiveness of a MergePolicy.  The simulator does not result in the actual 
> indexing and merging of segments; instead it provides some dummy constructs 
> to MergePolicy to evaluate its decisions.  Therefore you can do simulation 
> runs in little time.
> I'm not sure where it would live.  Perhaps dev-tools, or in tests, or in 
> benchmark?
> I mentioned this recently here:
> https://issues.apache.org/jira/browse/LUCENE-7976?focusedCommentId=16446985=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16446985
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-05-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492469#comment-16492469
 ] 

Tommaso Teofili edited comment on LUCENE-8162 at 5/28/18 10:05 AM:
---

{quote}but many users index at full speed for a long time and suppressing 
merges in that case is dangerous
{quote}
yes, that might make search performance degrade. To mitigate that the proposed 
MP has a maximum number of segments allowed for throttling. So for example if 
the throttling algorithm makes the number of segments go beyond a configurable 
threshold (e.g. 20), the throttling algorithm doesn't kick in in the next merge 
and until the number of segments gets back beyond the threshold (by using 
standard TMP merge algorithm).

I have been trying to use [https://github.com/mikemccand/luceneutil] to make 
some benchmarks. However it seems the tool only creates one index per 
benchmark, if anyone has suggestions about how to benchmark both indexing (time 
and space) and querying performance that'd be great. 


was (Author: teofili):
{quote}but many users index at full speed for a long time and suppressing 
merges in that case is dangerous
{quote}
yes, that might make search performance degrade. To mitigate that the proposed 
MP has a maximum number of segments allowed for throttling. So for example if 
the throttling algorithm makes the number of segments go beyond a configurable 
threshold (e.g. 20), the throttling algorithm doesn't kick in in the next merge 
and until the number of segments gets back beyond the threshold.

I have been trying to use [https://github.com/mikemccand/luceneutil] to make 
some benchmarks. However it seems the tool only creates one index per 
benchmark. 

> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
> Attachments: LUCENE-8162.0.patch
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP doesn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : [http://markmail.org/message/re3ifmq2664bqfjk]
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-05-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492469#comment-16492469
 ] 

Tommaso Teofili edited comment on LUCENE-8162 at 5/28/18 10:04 AM:
---

{quote}but many users index at full speed for a long time and suppressing 
merges in that case is dangerous
{quote}
yes, that might make search performance degrade. To mitigate that the proposed 
MP has a maximum number of segments allowed for throttling. So for example if 
the throttling algorithm makes the number of segments go beyond a configurable 
threshold (e.g. 20), the throttling algorithm doesn't kick in in the next merge 
and until the number of segments gets back beyond the threshold.

I have been trying to use [https://github.com/mikemccand/luceneutil] to make 
some benchmarks. However it seems the tool only creates one index per 
benchmark. 


was (Author: teofili):
{quote}but many users index at full speed for a long time and suppressing 
merges in that case is dangerous
{quote}
yes, that might make search degrade. To mitigate that the proposed MP has a 
maximum number of segments allowed for throttling. So for example if the 
throttling algorithm makes the number of segments go beyond a configurable 
threshold (e.g. 20), the throttling algorithm doesn't kick in in the next merge 
and until the number of segments gets back beyond the threshold.

I have been trying to use [https://github.com/mikemccand/luceneutil] to make 
some benchmarks. However it seems the tool only creates one index per 
benchmark. 

> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
> Attachments: LUCENE-8162.0.patch
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP doesn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : [http://markmail.org/message/re3ifmq2664bqfjk]
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-05-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492469#comment-16492469
 ] 

Tommaso Teofili commented on LUCENE-8162:
-

{quote}but many users index at full speed for a long time and suppressing 
merges in that case is dangerous
{quote}
yes, that might make search degrade. To mitigate that the proposed MP has a 
maximum number of segments allowed for throttling. So for example if the 
throttling algorithm makes the number of segments go beyond a configurable 
threshold (e.g. 20), the throttling algorithm doesn't kick in in the next merge 
and until the number of segments gets back beyond the threshold.

I have been trying to use [https://github.com/mikemccand/luceneutil] to make 
some benchmarks. However it seems the tool only creates one index per 
benchmark. 

> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
> Attachments: LUCENE-8162.0.patch
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP doesn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : [http://markmail.org/message/re3ifmq2664bqfjk]
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-05-28 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-8162:

Attachment: LUCENE-8162.0.patch

> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
> Attachments: LUCENE-8162.0.patch
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP doesn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : [http://markmail.org/message/re3ifmq2664bqfjk]
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-05-09 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468743#comment-16468743
 ] 

Tommaso Teofili commented on LUCENE-8162:
-

[~mikemccand] any suggestions on how to make "reliable" tests with different 
merge policies ? Even though this merge policy was designed for a specific use 
case, I would still be curious to do some experiments on how it behaves in a 
more common case (e.g. benchmarking indexing / queries on wikipedia).

> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP doesn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : [http://markmail.org/message/re3ifmq2664bqfjk]
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-05-09 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355374#comment-16355374
 ] 

Tommaso Teofili edited comment on LUCENE-8162 at 5/9/18 11:49 AM:
--

the class in Oak is a fork of TMP, but the one in Lucene would extend TMP (see 
[https://gist.github.com/tteofili/f60bd633557b93be106dc8e806d2b8fa).|https://gist.github.com/tteofili/f60bd633557b93be106dc8e806d2b8fa]

the logic uses doc/sec and mb/sec so you're right that the no. of _commits_ is 
not measured.
{quote}So if I index at a high rate but don't commit, the throttling logic can 
still kick in?
{quote}
yes


was (Author: teofili):
the class in Oak is a fork of TMP, but the one in Lucene would extend TMP (see 
[https://gist.github.com/tteofili/f60bd633557b93be106dc8e806d2b8fa).]

the logic uses doc/sec and mb/sec so you're right that the no. of _commits_ is 
not measured.
{quote}So if I index at a high rate but don't commit, the throttling logic can 
still kick in?
{quote}
yes

> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP doesn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : [http://markmail.org/message/re3ifmq2664bqfjk]
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-8223) CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines

2018-03-27 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-8223.
-
Resolution: Fixed

> CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines
> 
>
> Key: LUCENE-8223
> URL: https://issues.apache.org/jira/browse/LUCENE-8223
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Alan Woodward
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: trunk, 7.4
>
>
> The 7.3 Jenkins smoke tester has failed a couple of times due to 
> CachingNaiveBayesClassifierTest.testPerformance() (see 
> [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-7.3/9/] for example).
> I don't think performance tests like this are very useful as part of the 
> standard test suite, because they depend too much on what else is happening 
> on the machine they're being run on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8223) CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines

2018-03-27 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-8223:

Fix Version/s: 7.4
   trunk

> CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines
> 
>
> Key: LUCENE-8223
> URL: https://issues.apache.org/jira/browse/LUCENE-8223
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Alan Woodward
>Assignee: Tommaso Teofili
>Priority: Major
> Fix For: trunk, 7.4
>
>
> The 7.3 Jenkins smoke tester has failed a couple of times due to 
> CachingNaiveBayesClassifierTest.testPerformance() (see 
> [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-7.3/9/] for example).
> I don't think performance tests like this are very useful as part of the 
> standard test suite, because they depend too much on what else is happening 
> on the machine they're being run on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8223) CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines

2018-03-27 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-8223:

Component/s: modules/classification

> CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines
> 
>
> Key: LUCENE-8223
> URL: https://issues.apache.org/jira/browse/LUCENE-8223
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Alan Woodward
>Assignee: Tommaso Teofili
>Priority: Major
>
> The 7.3 Jenkins smoke tester has failed a couple of times due to 
> CachingNaiveBayesClassifierTest.testPerformance() (see 
> [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-7.3/9/] for example).
> I don't think performance tests like this are very useful as part of the 
> standard test suite, because they depend too much on what else is happening 
> on the machine they're being run on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-8223) CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines

2018-03-26 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reassigned LUCENE-8223:
---

Assignee: Tommaso Teofili

> CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines
> 
>
> Key: LUCENE-8223
> URL: https://issues.apache.org/jira/browse/LUCENE-8223
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Tommaso Teofili
>Priority: Major
>
> The 7.3 Jenkins smoke tester has failed a couple of times due to 
> CachingNaiveBayesClassifierTest.testPerformance() (see 
> [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-7.3/9/] for example).
> I don't think performance tests like this are very useful as part of the 
> standard test suite, because they depend too much on what else is happening 
> on the machine they're being run on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8223) CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines

2018-03-26 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413630#comment-16413630
 ] 

Tommaso Teofili commented on LUCENE-8223:
-

agreed [~romseygeek], will remove such time based tests.

> CachingNaiveBayesClassifierTest.testPerformance() fails on slow machines
> 
>
> Key: LUCENE-8223
> URL: https://issues.apache.org/jira/browse/LUCENE-8223
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Priority: Major
>
> The 7.3 Jenkins smoke tester has failed a couple of times due to 
> CachingNaiveBayesClassifierTest.testPerformance() (see 
> [https://builds.apache.org/job/Lucene-Solr-SmokeRelease-7.3/9/] for example).
> I don't think performance tests like this are very useful as part of the 
> standard test suite, because they depend too much on what else is happening 
> on the machine they're being run on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-5351) More Like This Handler uses only first field in mlt.fl when using stream.body

2018-03-01 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382048#comment-16382048
 ] 

Tommaso Teofili edited comment on SOLR-5351 at 3/1/18 2:19 PM:
---

+1 thanks [~dweiss] , the patch looks good to me !


was (Author: teofili):
+1 thanks [~dweiss] , the patch looks good to me, thanks !

> More Like This Handler uses only first field in mlt.fl when using stream.body
> -
>
> Key: SOLR-5351
> URL: https://issues.apache.org/jira/browse/SOLR-5351
> Project: Solr
>  Issue Type: Bug
>  Components: MoreLikeThis
>Affects Versions: 4.4
> Environment: Linux,Windows
>Reporter: Zygmunt Wiercioch
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: SOLR-5351.patch, SOLR-5351.patch
>
>
> The documentation at: http://wiki.apache.org/solr/MoreLikeThisHandler 
> indicates that one can use multiple fields for similarity in mlt.fl:
> http://localhost:8983/solr/mlt?stream.body=electronics%20memory=manu,cat=list=0
> In trying this, only one field is used. 
> Looking at the code, it only looks at the first field:
>  public DocListAndSet getMoreLikeThis( Reader reader, int start, int rows, 
> List filters, List terms, int flags ) throws 
> IOException
> {
>   // analyzing with the first field: previous (stupid) behavior
>   rawMLTQuery = mlt.like(reader, mlt.getFieldNames()[0]); 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-5351) More Like This Handler uses only first field in mlt.fl when using stream.body

2018-03-01 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382048#comment-16382048
 ] 

Tommaso Teofili commented on SOLR-5351:
---

+1 thanks [~dweiss] , the patch looks good to me, thanks !

> More Like This Handler uses only first field in mlt.fl when using stream.body
> -
>
> Key: SOLR-5351
> URL: https://issues.apache.org/jira/browse/SOLR-5351
> Project: Solr
>  Issue Type: Bug
>  Components: MoreLikeThis
>Affects Versions: 4.4
> Environment: Linux,Windows
>Reporter: Zygmunt Wiercioch
>Assignee: Tommaso Teofili
>Priority: Minor
> Attachments: SOLR-5351.patch, SOLR-5351.patch
>
>
> The documentation at: http://wiki.apache.org/solr/MoreLikeThisHandler 
> indicates that one can use multiple fields for similarity in mlt.fl:
> http://localhost:8983/solr/mlt?stream.body=electronics%20memory=manu,cat=list=0
> In trying this, only one field is used. 
> Looking at the code, it only looks at the first field:
>  public DocListAndSet getMoreLikeThis( Reader reader, int start, int rows, 
> List filters, List terms, int flags ) throws 
> IOException
> {
>   // analyzing with the first field: previous (stupid) behavior
>   rawMLTQuery = mlt.like(reader, mlt.getFieldNames()[0]); 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-02-07 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355374#comment-16355374
 ] 

Tommaso Teofili commented on LUCENE-8162:
-

the class in Oak is a fork of TMP, but the one in Lucene would extend TMP (see 
[https://gist.github.com/tteofili/f60bd633557b93be106dc8e806d2b8fa).]

the logic uses doc/sec and mb/sec so you're right that the no. of _commits_ is 
not measured.
{quote}So if I index at a high rate but don't commit, the throttling logic can 
still kick in?
{quote}
yes

> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP doesn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : [http://markmail.org/message/re3ifmq2664bqfjk]
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-02-06 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-8162:

Description: 
As discussed in a recent mailing list thread [1] and observed in a project 
using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
the aggressiveness of (Tiered)MergePolicy when commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [2].

That MP doesn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec).

In such impl, the commit rate thresholds adapt to average commit rate by means 
of single exponential smoothing.

The results in that specific case looked encouraging as it brought a 5% perf 
improvement in querying and ~10% reduced IO. However Oak has some specifics 
which might not fit in other scenarios. Anyway it could be interesting to see 
how this behaves in plain Lucene scenario.

[1] : [http://markmail.org/message/re3ifmq2664bqfjk]

[2] : 
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]

  was:
As discussed in a recent mailing list thread [1] and observed in a project 
using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
the aggressiveness of (Tiered)MergePolicy when commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [2].

That MP didn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec).

In such impl, the commit rate thresholds adapt to average commit rate by means 
of single exponential smoothing.

The results in that specific case looked encouraging as it brought a 5% perf 
improvement in querying and ~10% reduced IO. However Oak has some specifics 
which might not fit in other scenarios. Anyway it could be interesting to see 
how this behaves in plain Lucene scenario.

[1] : http://markmail.org/message/re3ifmq2664bqfjk

[2] : 
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]


> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP doesn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : [http://markmail.org/message/re3ifmq2664bqfjk]
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-02-06 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-8162:

Description: 
As discussed in a recent mailing list thread [1] and observed in a project 
using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
the aggressiveness of (Tiered)MergePolicy when commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [2].

That MP didn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec).

In such impl, the commit rate thresholds adapt to average commit rate by means 
of single exponential smoothing.

The results in that specific case looked encouraging as it brought a 5% perf 
improvement in querying and ~10% reduced IO. However Oak has some specifics 
which might not fit in other scenarios. Anyway it could be interesting to see 
how this behaves in plain Lucene scenario.

[1] : http://markmail.org/message/re3ifmq2664bqfjk

[2] : 
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]

  was:
As discussed in a [recent mailing list 
thread|[http://markmail.org/message/re3ifmq2664bqfjk]] and observed in a 
project using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to 
throttle the aggressiveness of (Tiered)MergePolicy when commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [1].

That MP didn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec).

In such impl, the commit rate thresholds adapt to average commit rate by means 
of single exponential smoothing.

The results in that specific case looked encouraging as it brought a 5% perf 
improvement in querying and ~10% reduced IO. However Oak has some specifics 
which might not fit in other scenarios. Anyway it could be interesting to see 
how this behaves in plain Lucene scenario.

[1] : 
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]


> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>
> As discussed in a recent mailing list thread [1] and observed in a project 
> using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to throttle 
> the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [2].
> That MP didn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : http://markmail.org/message/re3ifmq2664bqfjk
> [2] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-02-06 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-8162:

Description: 
As discussed in a [recent mailing list 
thread|[http://markmail.org/message/re3ifmq2664bqfjk]] and observed in a 
project using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to 
throttle the aggressiveness of (Tiered)MergePolicy when commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [1].

That MP didn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec).

In such impl, the commit rate thresholds adapt to average commit rate by means 
of single exponential smoothing.

The results in that specific case looked encouraging as it brought a 5% perf 
improvement in querying and ~10% reduced IO. However Oak has some specifics 
which might not fit in other scenarios. Anyway it could be interesting to see 
how this behaves in plain Lucene scenario.

[1] : 
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]

  was:
As discussed in a [recent mailing list 
thread|[http://markmail.org/message/re3ifmq2664bqfjk],]] and observed in a 
project using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to 
throttle the aggressiveness of (Tiered)MergePolicy when commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [1].

That MP didn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec).

In such impl, the commit rate thresholds adapt to average commit rate by means 
of single exponential smoothing.

The results in that specific case looked encouraging as it brought a 5% perf 
improvement in querying and ~10% reduced IO. However Oak has some specifics 
which might not fit in other scenarios. Anyway it could be interesting to see 
how this behaves in plain Lucene scenario.

[1] : 
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]


> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>
> As discussed in a [recent mailing list 
> thread|[http://markmail.org/message/re3ifmq2664bqfjk]] and observed in a 
> project using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to 
> throttle the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [1].
> That MP didn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-02-06 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-8162:

Description: 
As discussed in a [recent mailing list 
thread|[http://markmail.org/message/re3ifmq2664bqfjk],]] and observed in a 
project using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to 
throttle the aggressiveness of (Tiered)MergePolicy when commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [1].

That MP didn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec).

In such impl, the commit rate thresholds adapt to average commit rate by means 
of single exponential smoothing.

The results in that specific case looked encouraging as it brought a 5% perf 
improvement in querying and ~10% reduced IO. However Oak has some specifics 
which might not fit in other scenarios. Anyway it could be interesting to see 
how this behaves in plain Lucene scenario.

[1] : 
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]

  was:
As discussed in a [recent mailing list 
thread|[http://markmail.org/message/re3ifmq2664bqfjk|http://markmail.org/message/re3ifmq2664bqfjk],]]
 and observed in a project using Lucene (see OAK-5192 and OAK-6710), it is 
sometimes helpful to throttle the aggressiveness of (Tiered)MergePolicy when 
commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [1].

That MP didn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec). The results in that specific case looked 
encouraging as it brought a 5% perf improvement in querying and ~10% reduced 
IO. However Oak has some specifics which might not fit in other scenarios. 
Anyway it could be interesting to see how this behaves in plain Lucene scenario.

[1] : 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java


> Make it possible to throttle (Tiered)MergePolicy when commit rate is high
> -
>
> Key: LUCENE-8162
> URL: https://issues.apache.org/jira/browse/LUCENE-8162
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Priority: Major
> Fix For: trunk
>
>
> As discussed in a [recent mailing list 
> thread|[http://markmail.org/message/re3ifmq2664bqfjk],]] and observed in a 
> project using Lucene (see OAK-5192 and OAK-6710), it is sometimes helpful to 
> throttle the aggressiveness of (Tiered)MergePolicy when commit rate is high.
> In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
> implemented [1].
> That MP didn't merge in case the number of segments is below a certain 
> threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high 
> (e.g. above 1000 doc / sec , 5MB / sec).
> In such impl, the commit rate thresholds adapt to average commit rate by 
> means of single exponential smoothing.
> The results in that specific case looked encouraging as it brought a 5% perf 
> improvement in querying and ~10% reduced IO. However Oak has some specifics 
> which might not fit in other scenarios. Anyway it could be interesting to see 
> how this behaves in plain Lucene scenario.
> [1] : 
> [https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8162) Make it possible to throttle (Tiered)MergePolicy when commit rate is high

2018-02-06 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-8162:
---

 Summary: Make it possible to throttle (Tiered)MergePolicy when 
commit rate is high
 Key: LUCENE-8162
 URL: https://issues.apache.org/jira/browse/LUCENE-8162
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Tommaso Teofili
 Fix For: trunk


As discussed in a [recent mailing list 
thread|[http://markmail.org/message/re3ifmq2664bqfjk|http://markmail.org/message/re3ifmq2664bqfjk],]]
 and observed in a project using Lucene (see OAK-5192 and OAK-6710), it is 
sometimes helpful to throttle the aggressiveness of (Tiered)MergePolicy when 
commit rate is high.

In the case of Apache Jackrabbit Oak a dedicated {{MergePolicy}} was 
implemented [1].

That MP didn't merge in case the number of segments is below a certain 
threshold (e.g. 30) and commit rate (docs per sec and MB per sec) is high (e.g. 
above 1000 doc / sec , 5MB / sec). The results in that specific case looked 
encouraging as it brought a 5% perf improvement in querying and ~10% reduced 
IO. However Oak has some specifics which might not fit in other scenarios. 
Anyway it could be interesting to see how this behaves in plain Lucene scenario.

[1] : 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/writer/CommitMitigatingTieredMergePolicy.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2899) Add OpenNLP Analysis capabilities as a module

2017-12-13 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289246#comment-16289246
 ] 

Tommaso Teofili commented on LUCENE-2899:
-

looks good to me, thanks Steve!

> Add OpenNLP Analysis capabilities as a module
> -
>
> Key: LUCENE-2899
> URL: https://issues.apache.org/jira/browse/LUCENE-2899
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/analysis
>Reporter: Grant Ingersoll
>Assignee: Steve Rowe
>Priority: Minor
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-2899-6.1.0.patch, LUCENE-2899-RJN.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, LUCENE-2899.patch, 
> LUCENE-2899.patch, LUCENE-2899.patch, OpenNLPFilter.java, 
> OpenNLPTokenizer.java
>
>
> Now that OpenNLP is an ASF project and has a nice license, it would be nice 
> to have a submodule (under analysis) that exposed capabilities for it. Drew 
> Farris, Tom Morton and I have code that does:
> * Sentence Detection as a Tokenizer (could also be a TokenFilter, although it 
> would have to change slightly to buffer tokens)
> * NamedEntity recognition as a TokenFilter
> We are also planning a Tokenizer/TokenFilter that can put parts of speech as 
> either payloads (PartOfSpeechAttribute?) on a token or at the same position.
> I'd propose it go under:
> modules/analysis/opennlp



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7991) KNearestNeighborDocumentClassifier.knnSearch applies previous boosted field's factor to subsequent unboosted fields

2017-10-17 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16207191#comment-16207191
 ] 

Tommaso Teofili commented on LUCENE-7991:
-

thanks [~cpoerschke], sure, it looks good to me.

> KNearestNeighborDocumentClassifier.knnSearch applies previous boosted field's 
> factor to subsequent unboosted fields
> ---
>
> Key: LUCENE-7991
> URL: https://issues.apache.org/jira/browse/LUCENE-7991
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Christine Poerschke
>Priority: Minor
> Attachments: LUCENE-7991.patch
>
>
> When reading code noticed that in 
> [KNearestNeighborClassifier|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.0.1/lucene/classification/src/java/org/apache/lucene/classification/KNearestNeighborClassifier.java#L179-L182]
>  a neutral boost factor is restored but in 
> [KNearestNeighborDocumentClassifier|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.0.1/lucene/classification/src/java/org/apache/lucene/classification/document/KNearestNeighborDocumentClassifier.java#L126]
>  this currently does not happen. This seems unintended.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7981) ClassificationTestBase should check if result is null

2017-09-29 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7981.
-
Resolution: Fixed

> ClassificationTestBase should check if result is null
> -
>
> Key: LUCENE-7981
> URL: https://issues.apache.org/jira/browse/LUCENE-7981
> Project: Lucene - Core
>  Issue Type: Test
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Trivial
> Fix For: trunk
>
>
> {{ClassificationTestBase}} should check that the {{ClassificationResult}} 
> returned by a {{Classifier}} is always _not null_. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7981) ClassificationTestBase should check if result is null

2017-09-29 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-7981:

Priority: Trivial  (was: Major)

> ClassificationTestBase should check if result is null
> -
>
> Key: LUCENE-7981
> URL: https://issues.apache.org/jira/browse/LUCENE-7981
> Project: Lucene - Core
>  Issue Type: Test
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Trivial
> Fix For: trunk
>
>
> {{ClassificationTestBase}} should check that the {{ClassificationResult}} 
> returned by a {{Classifier}} is always _not null_. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7981) ClassificationTestBase should check if result is null

2017-09-29 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-7981:
---

 Summary: ClassificationTestBase should check if result is null
 Key: LUCENE-7981
 URL: https://issues.apache.org/jira/browse/LUCENE-7981
 Project: Lucene - Core
  Issue Type: Test
  Components: modules/classification
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
 Fix For: trunk


{{ClassificationTestBase}} should check that the {{ClassificationResult}} 
returned by a {{Classifier}} is always _not null_. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7950) SimpleNaiveBayesDocumentClassifier throws NPE if no docs have the class field

2017-09-02 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-7950:

Affects Version/s: 6.6

> SimpleNaiveBayesDocumentClassifier throws NPE if no docs have the class field
> -
>
> Key: LUCENE-7950
> URL: https://issues.apache.org/jira/browse/LUCENE-7950
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Affects Versions: 6.6
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 7.1
>
>
> As discussed on the solr-user@ list the SNBDC throws a NPE as the potential 
> _null_ value resulting from _MultiFields.getTerms(indexReader, 
> classFieldName)_ is not properly handled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7950) SimpleNaiveBayesDocumentClassifier throws NPE if no docs have the class field

2017-09-02 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7950.
-
Resolution: Fixed

> SimpleNaiveBayesDocumentClassifier throws NPE if no docs have the class field
> -
>
> Key: LUCENE-7950
> URL: https://issues.apache.org/jira/browse/LUCENE-7950
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Affects Versions: 6.6
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 7.1
>
>
> As discussed on the solr-user@ list the SNBDC throws a NPE as the potential 
> _null_ value resulting from _MultiFields.getTerms(indexReader, 
> classFieldName)_ is not properly handled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7950) SimpleNaiveBayesDocumentClassifier throws NPE if no docs have the class field

2017-09-02 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-7950:
---

 Summary: SimpleNaiveBayesDocumentClassifier throws NPE if no docs 
have the class field
 Key: LUCENE-7950
 URL: https://issues.apache.org/jira/browse/LUCENE-7950
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/classification
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
 Fix For: 7.1


As discussed on the solr-user@ list the SNBDC throws a NPE as the potential 
_null_ value resulting from _MultiFields.getTerms(indexReader, classFieldName)_ 
is not properly handled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7915) Avoid looping over merge segments in best merge selection

2017-08-01 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16109841#comment-16109841
 ] 

Tommaso Teofili commented on LUCENE-7915:
-

thanks!

> Avoid looping over merge segments in best merge selection
> -
>
> Key: LUCENE-7915
> URL: https://issues.apache.org/jira/browse/LUCENE-7915
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Trivial
> Fix For: trunk, 7.1
>
>
> With java 8 we can trivially avoid looping over merge segments to be merged, 
> switching from 
> {code}
> for(SegmentCommitInfo info : merge.segments) {
>   toBeMerged.add(info);
> }
> {code}
> to :
> {code}
> toBeMerged.addAll(merge.segments);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7915) Avoid looping over merge segments in best merge selection

2017-08-01 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7915.
-
Resolution: Fixed

> Avoid looping over merge segments in best merge selection
> -
>
> Key: LUCENE-7915
> URL: https://issues.apache.org/jira/browse/LUCENE-7915
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Trivial
> Fix For: trunk, 7.1
>
>
> With java 8 we can trivially avoid looping over merge segments to be merged, 
> switching from 
> {code}
> for(SegmentCommitInfo info : merge.segments) {
>   toBeMerged.add(info);
> }
> {code}
> to :
> {code}
> toBeMerged.addAll(merge.segments);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7915) Avoid looping over merge segments in best merge selection

2017-08-01 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-7915:

Fix Version/s: 7.1

> Avoid looping over merge segments in best merge selection
> -
>
> Key: LUCENE-7915
> URL: https://issues.apache.org/jira/browse/LUCENE-7915
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
>Priority: Trivial
> Fix For: trunk, 7.1
>
>
> With java 8 we can trivially avoid looping over merge segments to be merged, 
> switching from 
> {code}
> for(SegmentCommitInfo info : merge.segments) {
>   toBeMerged.add(info);
> }
> {code}
> to :
> {code}
> toBeMerged.addAll(merge.segments);
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7915) Avoid looping over merge segments in best merge selection

2017-08-01 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-7915:
---

 Summary: Avoid looping over merge segments in best merge selection
 Key: LUCENE-7915
 URL: https://issues.apache.org/jira/browse/LUCENE-7915
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/index
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
Priority: Trivial
 Fix For: trunk


With java 8 we can trivially avoid looping over merge segments to be merged, 
switching from 
{code}
for(SegmentCommitInfo info : merge.segments) {
  toBeMerged.add(info);
}
{code}

to :

{code}
toBeMerged.addAll(merge.segments);
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8492) Add LogisticRegressionQuery and LogitStream

2017-07-27 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102919#comment-16102919
 ] 

Tommaso Teofili commented on SOLR-8492:
---

sure Joel, I'll have a look and let you know.

> Add LogisticRegressionQuery and LogitStream
> ---
>
> Key: SOLR-8492
> URL: https://issues.apache.org/jira/browse/SOLR-8492
> Project: Solr
>  Issue Type: New Feature
>  Components: streaming expressions
>Reporter: Joel Bernstein
> Fix For: 6.2, 7.0
>
> Attachments: logit.csv, SOLR-8492.diff, SOLR-8492.diff, 
> SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, 
> SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch
>
>
> This ticket is to add a new query called a LogisticRegressionQuery (LRQ).
> The LRQ extends AnalyticsQuery 
> (http://joelsolr.blogspot.com/2015/12/understanding-solrs-analyticsquery.html)
>  and returns a DelegatingCollector that implements a Stochastic Gradient 
> Descent (SGD) optimizer for Logistic Regression.
> This ticket also adds the LogitStream which leverages Streaming Expressions 
> to provide iteration over the shards. Each call to LogitStream.read() calls 
> down to the shards and executes the LogisticRegressionQuery. The model data 
> is collected from the shards and the weights are averaged and sent back to 
> the shards with the next iteration. Each call to read() returns a Tuple with 
> the averaged weights and error from the shards. With this approach the 
> LogitStream streams the changing model back to the client after each 
> iteration.
> The LogitStream will return the EOF Tuple when it reaches the defined 
> maxIterations. When sent as a Streaming Expression to the Stream handler this 
> provides parallel iterative behavior. This same approach can be used to 
> implement other parallel iterative algorithms.
> The initial patch has  a test which simply tests the mechanics of the 
> iteration. More work will need to be done to ensure the SGD is properly 
> implemented. The distributed approach of the SGD will also need to be 
> reviewed.  
> This implementation is designed for use cases with a small number of features 
> because each feature is it's own discreet field.
> An implementation which supports a higher number of features would be 
> possible by packing features into a byte array and storing as binary 
> DocValues.
> This implementation is designed to support a large sample set. With a large 
> number of shards, a sample set into the billions may be possible.
> sample Streaming Expression Syntax:
> {code}
> logit(collection1, features="a,b,c,d,e,f" outcome="x" maxIterations="80")
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7838) Add a knn classifier based on fuzzified term queries

2017-07-05 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7838.
-
Resolution: Fixed

I'm marking this as resolved, improvements will come in subsequent issues.

> Add a knn classifier based on fuzzified term queries
> 
>
> Key: LUCENE-7838
> URL: https://issues.apache.org/jira/browse/LUCENE-7838
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 7.0
>
>
> FLT mixes fuzzy and MLT, in the context of Lucene based classification it 
> might be useful to add such a fuzziness to a dedicated KNN classifier (based 
> on FLT queries).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7838) Add a knn classifier based on fuzzified term queries

2017-07-05 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-7838:

Summary: Add a knn classifier based on fuzzified term queries  (was: Add a 
knn classifier based on fuzzy like this)

> Add a knn classifier based on fuzzified term queries
> 
>
> Key: LUCENE-7838
> URL: https://issues.apache.org/jira/browse/LUCENE-7838
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: 7.0
>
>
> FLT mixes fuzzy and MLT, in the context of Lucene based classification it 
> might be useful to add such a fuzziness to a dedicated KNN classifier (based 
> on FLT queries).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7838) Add a knn classifier based on fuzzy like this

2017-06-29 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067962#comment-16067962
 ] 

Tommaso Teofili commented on LUCENE-7838:
-

I've removed the dependency on the sandbox module and created a dedicated 
version of FLT named NearestFuzzyQuery in 
org.apache.lucene.classification.utils package.
The goal now is to refine NearestFuzzyQuery in order to get better 
classification results and remove some specifics of FLT.

> Add a knn classifier based on fuzzy like this
> -
>
> Key: LUCENE-7838
> URL: https://issues.apache.org/jira/browse/LUCENE-7838
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> FLT mixes fuzzy and MLT, in the context of Lucene based classification it 
> might be useful to add such a fuzziness to a dedicated KNN classifier (based 
> on FLT queries).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7274) Add LogisticRegressionDocumentClassifier

2017-06-29 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7274.
-
Resolution: Won't Fix

> Add LogisticRegressionDocumentClassifier
> 
>
> Key: LUCENE-7274
> URL: https://issues.apache.org/jira/browse/LUCENE-7274
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-7274.patch
>
>
> Add LogisticRegressionDocumentClassifier for Lucene.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7838) Add a knn classifier based on fuzzy like this

2017-05-31 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16031313#comment-16031313
 ] 

Tommaso Teofili commented on LUCENE-7838:
-

as per related thread on dev@ I'll drop the dependency over the sandbox module 
which is indeed not appropriate. If possible I'd like to keep the classifier 
but I'd not just copy paste the FLT code from sandbox to classification 
therefore it'll take a bit of time to tweak it as needed.

> Add a knn classifier based on fuzzy like this
> -
>
> Key: LUCENE-7838
> URL: https://issues.apache.org/jira/browse/LUCENE-7838
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> FLT mixes fuzzy and MLT, in the context of Lucene based classification it 
> might be useful to add such a fuzziness to a dedicated KNN classifier (based 
> on FLT queries).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7838) Add a knn classifier based on fuzzy like this

2017-05-27 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027295#comment-16027295
 ] 

Tommaso Teofili commented on LUCENE-7838:
-

bq. you added a dependency on the sandbox module from another module. That's 
quite surprising to me...  I don't think that's legit?

why? As soon as we provide releases of lucene-sandbox I assume we expect people 
and other modules to use it.

bq. New inter-module dependencies (of any kind) I think should also deserve 
communication on the JIRA issue and I don't see any mention here.

Since this is only impacting master branch I had thought there was no need to 
explicitly mention that; on the other hand {{FuzzyLikeThisQuery}} lives in 
sandbox therefore I had assumed there was no need to explicitly specify that in 
the issue.

bq. I also don't see a CHANGES.txt entry

right, there's no such entry.

bq.  I don't see a patch file either but I admit I welcome that 

I'm not sure I get your point here, would you have expected a patch ?

> Add a knn classifier based on fuzzy like this
> -
>
> Key: LUCENE-7838
> URL: https://issues.apache.org/jira/browse/LUCENE-7838
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> FLT mixes fuzzy and MLT, in the context of Lucene based classification it 
> might be useful to add such a fuzziness to a dedicated KNN classifier (based 
> on FLT queries).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7838) Add a knn classifier based on fuzzy like this

2017-05-18 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7838.
-
Resolution: Fixed

> Add a knn classifier based on fuzzy like this
> -
>
> Key: LUCENE-7838
> URL: https://issues.apache.org/jira/browse/LUCENE-7838
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> FLT mixes fuzzy and MLT, in the context of Lucene based classification it 
> might be useful to add such a fuzziness to a dedicated KNN classifier (based 
> on FLT queries).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-7838) Add a knn classifier based on fuzzy like this

2017-05-18 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reassigned LUCENE-7838:
---

Assignee: Tommaso Teofili

> Add a knn classifier based on fuzzy like this
> -
>
> Key: LUCENE-7838
> URL: https://issues.apache.org/jira/browse/LUCENE-7838
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> FLT mixes fuzzy and MLT, in the context of Lucene based classification it 
> might be useful to add such a fuzziness to a dedicated KNN classifier (based 
> on FLT queries).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7838) Add a knn classifier based on fuzzy like this

2017-05-18 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-7838:
---

 Summary: Add a knn classifier based on fuzzy like this
 Key: LUCENE-7838
 URL: https://issues.apache.org/jira/browse/LUCENE-7838
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/classification
Reporter: Tommaso Teofili
 Fix For: master (7.0)


FLT mixes fuzzy and MLT, in the context of Lucene based classification it might 
be useful to add such a fuzziness to a dedicated KNN classifier (based on FLT 
queries).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7823) Have a naive bayes classifier which uses plain BM25 scores instead of plain frequencies

2017-05-15 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7823.
-
Resolution: Fixed

> Have a naive bayes classifier which uses plain BM25 scores instead of plain 
> frequencies
> ---
>
> Key: LUCENE-7823
> URL: https://issues.apache.org/jira/browse/LUCENE-7823
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> {{SimpleNaiveBayesClassifier}} users term frequencies with add one smoothing 
> to calculate likelihood and just tf for prior. Given Lucene has switched to 
> BM25 it would be better to have a different impl which uses BM25 
> scoring as a probability measure of both prior and likelihood.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7823) Have a naive bayes classifier which uses plain BM25 scores instead of plain frequencies

2017-05-11 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16006515#comment-16006515
 ] 

Tommaso Teofili commented on LUCENE-7823:
-

checked in {{BM25NBClassifier}} implementation; when compared with 
{{SimpleNaiveBayesClassifier}}, it gives a 0.06 improvement in f1 over 20 
newsgroups dataset.

> Have a naive bayes classifier which uses plain BM25 scores instead of plain 
> frequencies
> ---
>
> Key: LUCENE-7823
> URL: https://issues.apache.org/jira/browse/LUCENE-7823
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> {{SimpleNaiveBayesClassifier}} users term frequencies with add one smoothing 
> to calculate likelihood and just tf for prior. Given Lucene has switched to 
> BM25 it would be better to have a different impl which uses BM25 
> scoring as a probability measure of both prior and likelihood.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7823) Have a naive bayes classifier which uses plain BM25 scores instead of plain frequencies

2017-05-11 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-7823:
---

 Summary: Have a naive bayes classifier which uses plain BM25 
scores instead of plain frequencies
 Key: LUCENE-7823
 URL: https://issues.apache.org/jira/browse/LUCENE-7823
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/classification
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
 Fix For: master (7.0)


{{SimpleNaiveBayesClassifier}} users term frequencies with add one smoothing to 
calculate likelihood and just tf for prior. Given Lucene has switched to BM25 
it would be better to have a different impl which uses BM25 
scoring as a probability measure of both prior and likelihood.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-7498) More Like This to Use BM25

2017-04-19 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reassigned LUCENE-7498:
---

Assignee: Tommaso Teofili

> More Like This to Use BM25
> --
>
> Key: LUCENE-7498
> URL: https://issues.apache.org/jira/browse/LUCENE-7498
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/other
>Reporter: Alessandro Benedetti
>Assignee: Tommaso Teofili
>
> BM25 is now the default similarity, but the more like this is still using the 
> old TF/IDF .
>  
> This issue is to move to BM25 and refactor the MLT to be more organised, 
> extensible and maintainable.
> Few extensions will follow later, but the focus of this issue will be :
>  - BM25
> - code refactor + tests



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7776) Switch KNN classifier to use BM25 similarity

2017-04-13 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967746#comment-15967746
 ] 

Tommaso Teofili commented on LUCENE-7776:
-

sure Alessandro, thanks for sharing info about your work, I'll have a look once 
you open the PR.

> Switch KNN classifier to use BM25 similarity
> 
>
> Key: LUCENE-7776
> URL: https://issues.apache.org/jira/browse/LUCENE-7776
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> It'd be good to use BM25 as default {{Similarity}} for KNN classifier.
> Having done some tests on the _20newsgroups_ dataset that resulted in 
> improved _f1_ (between 0.10 and 0.15).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7776) Switch KNN classifier to use BM25 similarity

2017-04-11 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7776.
-
Resolution: Fixed

> Switch KNN classifier to use BM25 similarity
> 
>
> Key: LUCENE-7776
> URL: https://issues.apache.org/jira/browse/LUCENE-7776
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> It'd be good to use BM25 as default {{Similarity}} for KNN classifier.
> Having done some tests on the _20newsgroups_ dataset that resulted in 
> improved _f1_ (between 0.10 and 0.15).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7776) Switch KNN classifier to use BM25 similarity

2017-04-11 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-7776:
---

 Summary: Switch KNN classifier to use BM25 similarity
 Key: LUCENE-7776
 URL: https://issues.apache.org/jira/browse/LUCENE-7776
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/classification
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
 Fix For: master (7.0)


It'd be good to use BM25 as default {{Similarity}} for KNN classifier.
Having done some tests on the _20newsgroups_ dataset that resulted in improved 
_f1_ (between 0.10 and 0.15).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-6853) Boolean perceptron classifier is too sensitive to threshold

2017-04-07 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-6853.
-
   Resolution: Fixed
Fix Version/s: (was: 6.0)
   master (7.0)

> Boolean perceptron classifier is too sensitive to threshold
> ---
>
> Key: LUCENE-6853
> URL: https://issues.apache.org/jira/browse/LUCENE-6853
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Affects Versions: 4.10.4, 5.3
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> {{BooleanPerceptronClassifier}} is too sensitive to the value of its 
> {{threshold}}, that should be weighted and adjusted against the classifier 
> inputs instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-10318) Make sure Solr UIMA example configuration is working

2017-03-20 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created SOLR-10318:
--

 Summary: Make sure Solr UIMA example configuration is working
 Key: SOLR-10318
 URL: https://issues.apache.org/jira/browse/SOLR-10318
 Project: Solr
  Issue Type: Task
  Security Level: Public (Default Security Level. Issues are Public)
  Components: contrib - UIMA
Reporter: Tommaso Teofili


Current Solr UIMA example is using a configuration which involves outdated 
annotators, that should be adjusted in order to avoid confusion for end users 
when looking at the documentation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7274) Add LogisticRegressionDocumentClassifier

2017-02-02 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849959#comment-15849959
 ] 

Tommaso Teofili commented on LUCENE-7274:
-

+1 thanks [~caomanhdat].

> Add LogisticRegressionDocumentClassifier
> 
>
> Key: LUCENE-7274
> URL: https://issues.apache.org/jira/browse/LUCENE-7274
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-7274.patch
>
>
> Add LogisticRegressionDocumentClassifier for Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-5826) Request caching SolrServer

2017-01-23 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved SOLR-5826.
---
   Resolution: Won't Fix
Fix Version/s: (was: 6.0)

I think there's no point in going forward with this patch, giving the low 
feedback there's no need for it.

> Request caching SolrServer
> --
>
> Key: SOLR-5826
> URL: https://issues.apache.org/jira/browse/SOLR-5826
> Project: Solr
>  Issue Type: New Feature
>  Components: clients - java
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Attachments: SOLR-5826.patch
>
>
> As stated in http://markmail.org/thread/a477kyxsp5xrusdu there're scenarios 
> where an application communicating with Solr needs to not loose requests 
> (especially update/indexing requests) that may fail because Solr instance / 
> cluster is not reachable for some time.
> For such scenarios it may helpful to have a wrapping SolrServer which can 
> cache (in a FIFO queue, so that they get executed in order) requests when the 
> Solr endpoint is not reachable and execute them as soon as it's reachable 
> again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7274) Add LogisticRegressionDocumentClassifier

2017-01-23 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15834076#comment-15834076
 ] 

Tommaso Teofili commented on LUCENE-7274:
-

[~caomanhdat] would you have time to have a look into the above points ?

> Add LogisticRegressionDocumentClassifier
> 
>
> Key: LUCENE-7274
> URL: https://issues.apache.org/jira/browse/LUCENE-7274
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-7274.patch
>
>
> Add LogisticRegressionDocumentClassifier for Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7591) Let DatasetSplitter approximate no. of class values by no. of terms

2016-12-12 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7591.
-
Resolution: Fixed

> Let DatasetSplitter approximate no. of class values by no. of terms
> ---
>
> Key: LUCENE-7591
> URL: https://issues.apache.org/jira/browse/LUCENE-7591
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0)
>
>
> Currently {{DatasetSplitter}} throws an exception if it's not possible to 
> find {{SortedDocValues}} or {{SortedSetDocValues}} on the class field as it 
> wouldn't be possible to correctly split the indexes in a balanced way.
> As a fallback we could instead use the no. of terms per leaf reader as an 
> approximate count (upper bound) to the no. of classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7591) Let DatasetSplitter approximate no. of class values by no. of terms

2016-12-12 Thread Tommaso Teofili (JIRA)
Tommaso Teofili created LUCENE-7591:
---

 Summary: Let DatasetSplitter approximate no. of class values by 
no. of terms
 Key: LUCENE-7591
 URL: https://issues.apache.org/jira/browse/LUCENE-7591
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/classification
Reporter: Tommaso Teofili
Assignee: Tommaso Teofili
 Fix For: master (7.0)


Currently {{DatasetSplitter}} throws an exception if it's not possible to find 
{{SortedDocValues}} or {{SortedSetDocValues}} on the class field as it wouldn't 
be possible to correctly split the indexes in a balanced way.
As a fallback we could instead use the no. of terms per leaf reader as an 
approximate count (upper bound) to the no. of classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-5317) Concordance/Key Word In Context (KWIC) capability

2016-12-07 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reassigned LUCENE-5317:
---

Assignee: Tommaso Teofili

> Concordance/Key Word In Context (KWIC) capability
> -
>
> Key: LUCENE-5317
> URL: https://issues.apache.org/jira/browse/LUCENE-5317
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/search
>Affects Versions: 4.5
>Reporter: Tim Allison
>Assignee: Tommaso Teofili
>  Labels: patch
> Attachments: LUCENE-5317.patch, LUCENE-5317.patch, 
> concordance_v1.patch.gz, lucene5317v1.patch, lucene5317v2.patch
>
>
> This patch enables a Lucene-powered concordance search capability.
> Concordances are extremely useful for linguists, lawyers and other analysts 
> performing analytic search vs. traditional snippeting/document retrieval 
> tasks.  By "analytic search," I mean that the user wants to browse every time 
> a term appears (or at least the topn)  in a subset of documents and see the 
> words before and after.  
> Concordance technology is far simpler and less interesting than IR relevance 
> models/methods, but it can be extremely useful for some use cases.
> Traditional concordance sort orders are available (sort on words before the 
> target, words after, target then words before and target then words after).
> Under the hood, this is running SpanQuery's getSpans() and reanalyzing to 
> obtain character offsets.  There is plenty of room for optimizations and 
> refactoring.
> Many thanks to my colleague, Jason Robinson, for input on the design of this 
> patch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (SOLR-8871) Classification Update Request Processor Improvements

2016-12-05 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved SOLR-8871.
---
   Resolution: Fixed
Fix Version/s: 6.4

> Classification Update Request Processor Improvements
> 
>
> Key: SOLR-8871
> URL: https://issues.apache.org/jira/browse/SOLR-8871
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 6.1
>Reporter: Alessandro Benedetti
>Assignee: Tommaso Teofili
>  Labels: classification, classifier, update, update.chain
> Fix For: 6.4
>
> Attachments: SOLR_8871.patch, SOLR_8871_UIMA_processor_test_fix.patch
>
>
> This task will group a set of modifications to the classification update 
> reqeust processor ( and Lucene classification module), based on user's 
> feedback ( thanks [~teofili] and Александър Цветанов  ) :
> - include boosting support for inputFields in the solrconfig.xml for the 
> classification update request processor
> e.g.
> field1^2, field2^5 ...
> - multi class assignement ( introduce a parameter, default 1, for the max 
> number of class to assign)
> - separate the classField in :
> classTrainingField
> classOutputField
> Default when classOutputField is not defined, is classTrainingField .
> - add support for the classification query, to use only a subset of the 
> entire index to classify.
> - Improve Related Tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7350) Let classifiers be constructed from IndexReaders

2016-12-05 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili updated LUCENE-7350:

Fix Version/s: 6.4

> Let classifiers be constructed from IndexReaders
> 
>
> Key: LUCENE-7350
> URL: https://issues.apache.org/jira/browse/LUCENE-7350
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: master (7.0), 6.4
>
>
> Current {{Classifier}} implementations are built from {{LeafReaders}}, this 
> is an heritage of using certain Lucene 4.x {{AtomicReader}}'s specific APIs; 
> this is no longer required as what is used by current implementations is 
> based on {{IndexReader}} APIs and therefore it makes more sense to use that 
> as constructor parameter as it doesn't give any additional benefit whereas it 
> requires client code to deal with classifiers that are tight to segments 
> (which doesn't make much sense).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7466) add axiomatic similarity

2016-11-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15703302#comment-15703302
 ] 

Tommaso Teofili edited comment on LUCENE-7466 at 11/28/16 10:14 PM:


well .. that's weird, I had set it to resolved back on Nov 20th (click on the 
'All' tab), but then when you commented I saw it was still unresolved and 
therefore assumed it was reopened by someone else.
Now it looks resolved because you can close and reopen, but also unresolved as 
per current resolution value ... 


was (Author: teofili):
well .. that's weird, I had set it to resolved back on Nov 20th (click on the 
'All' tab), but then when you commented I saw it was still unresolved and 
therefore assumed it was reopened by someone else.
Now it looksresolved because you can close and reopen, but also unresolved as 
per current resolution value ... 

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
> Fix For: 6.4
>
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7466) add axiomatic similarity

2016-11-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15703302#comment-15703302
 ] 

Tommaso Teofili edited comment on LUCENE-7466 at 11/28/16 10:14 PM:


well .. that's weird, I had set it to resolved back on Nov 20th (click on the 
'All' tab), but then when you commented I saw it was still unresolved and 
therefore assumed it was reopened by someone else.
Now it looksresolved because you can close and reopen, but also unresolved as 
per current resolution value ... 


was (Author: teofili):
well .. that's weird, I had set it to resolved back on Nov 20th (click on the 
'All' tab), but then when you commented I saw it was still unresolved and 
therefore assumed it was reopened by someone else.
Now it looks fixed resolved because you can close and reopen, but also 
unresolved as per current resolution value ... 

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
> Fix For: 6.4
>
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7466) add axiomatic similarity

2016-11-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15703302#comment-15703302
 ] 

Tommaso Teofili commented on LUCENE-7466:
-

well .. that's weird, I had set it to resolved back on Nov 20th (click on the 
'All' tab), but then when you commented I saw it was still unresolved and 
therefore assumed it was reopened by someone else.
Now it looks fixed resolved because you can close and reopen, but also 
unresolved as per current resolution value ... 

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
> Fix For: 6.4
>
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7466) add axiomatic similarity

2016-11-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15702674#comment-15702674
 ] 

Tommaso Teofili commented on LUCENE-7466:
-

sure, thanks Mike.

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
> Fix For: 6.4
>
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8871) Classification Update Request Processor Improvements

2016-11-28 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15701299#comment-15701299
 ] 

Tommaso Teofili commented on SOLR-8871:
---

thanks Alan and Alessandro, I've applied Alessandro's patch which seems to fix 
the mentioned issue.
I've also removed the forbidden API call, as per [~steve_rowe]'s suggestion.

> Classification Update Request Processor Improvements
> 
>
> Key: SOLR-8871
> URL: https://issues.apache.org/jira/browse/SOLR-8871
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 6.1
>Reporter: Alessandro Benedetti
>Assignee: Tommaso Teofili
>  Labels: classification, classifier, update, update.chain
> Attachments: SOLR_8871.patch, SOLR_8871_UIMA_processor_test_fix.patch
>
>
> This task will group a set of modifications to the classification update 
> reqeust processor ( and Lucene classification module), based on user's 
> feedback ( thanks [~teofili] and Александър Цветанов  ) :
> - include boosting support for inputFields in the solrconfig.xml for the 
> classification update request processor
> e.g.
> field1^2, field2^5 ...
> - multi class assignement ( introduce a parameter, default 1, for the max 
> number of class to assign)
> - separate the classField in :
> classTrainingField
> classOutputField
> Default when classOutputField is not defined, is classTrainingField .
> - add support for the classification query, to use only a subset of the 
> entire index to classify.
> - Improve Related Tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8871) Classification Update Request Processor Improvements

2016-11-24 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15694470#comment-15694470
 ] 

Tommaso Teofili commented on SOLR-8871:
---

I've applied your patch [~alessandro.benedetti], thanks!
I just adjusted a misplaced AL header in a test and added the 
{{SuppressForbidden}} annotation to {{ClassificationUpdateProcessorFactory}} as 
{{String#toUppeerCase}} falls into the forbidden APIs bucket. It'd be good to 
remove the uppercase call entirely if possible.

> Classification Update Request Processor Improvements
> 
>
> Key: SOLR-8871
> URL: https://issues.apache.org/jira/browse/SOLR-8871
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Affects Versions: 6.1
>Reporter: Alessandro Benedetti
>Assignee: Tommaso Teofili
>  Labels: classification, classifier, update, update.chain
> Attachments: SOLR_8871.patch
>
>
> This task will group a set of modifications to the classification update 
> reqeust processor ( and Lucene classification module), based on user's 
> feedback ( thanks [~teofili] and Александър Цветанов  ) :
> - include boosting support for inputFields in the solrconfig.xml for the 
> classification update request processor
> e.g.
> field1^2, field2^5 ...
> - multi class assignement ( introduce a parameter, default 1, for the max 
> number of class to assign)
> - separate the classField in :
> classTrainingField
> classOutputField
> Default when classOutputField is not defined, is classTrainingField .
> - add support for the classification query, to use only a subset of the 
> entire index to classify.
> - Improve Related Tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-7466) add axiomatic similarity

2016-11-20 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili resolved LUCENE-7466.
-
   Resolution: Fixed
Fix Version/s: 6.4

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
> Fix For: 6.4
>
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7466) add axiomatic similarity

2016-11-18 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15678786#comment-15678786
 ] 

Tommaso Teofili commented on LUCENE-7466:
-

thanks [~ypeilin], I've applied your patch (with minor fixes to javadoc and 
unused imports).

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7466) add axiomatic similarity

2016-11-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671519#comment-15671519
 ] 

Tommaso Teofili edited comment on LUCENE-7466 at 11/16/16 8:38 PM:
---

when running 'ant clean test' under lucene/core the only error I see is in 
{{TestAxiomaticSimilarity#testIllegalQL}} which fails because the test has a 
wrong String in the _expected.getMessage().contains("...")_ check (note also 
that _testSaneNormValues_ uses {{BM25Similarity}}, I have locally changed it to 
{{AxiomaticF2EXP}}).
Other than that it seems the {{TestAxiomaticSimilarity}} actually tests only 
{{AxiomaticF2EXP}}, shouldn't it also test the other {{Axiomatic}} extensions? 

You can check the different test options on the wiki [Running 
Tests|https://wiki.apache.org/lucene-java/RunningTests]


was (Author: teofili):
when running 'ant clean test' under lucene/core the only error I see is in 
{{TestAxiomaticSimilarity#testIllegalQL}} which fails because the test has a 
wrong String in the _ expected.getMessage().contains("...")_ check (note also 
that _testSaneNormValues_ uses {{BM25Similarity}}, I have locally changed it to 
{{AxiomaticF2EXP}}).
Other than that it seems the {{TestAxiomaticSimilarity}} actually tests only 
{{AxiomaticF2EXP}}, shouldn't it also test the other {{Axiomatic}} extensions? 

You can check the different test options on the wiki [Running 
Tests|https://wiki.apache.org/lucene-java/RunningTests]

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7466) add axiomatic similarity

2016-11-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671519#comment-15671519
 ] 

Tommaso Teofili edited comment on LUCENE-7466 at 11/16/16 8:38 PM:
---

when running 'ant clean test' under lucene/core the only error I see is in 
{{TestAxiomaticSimilarity#testIllegalQL}} which fails because the test has a 
wrong String in the _ expected.getMessage().contains("...")_ check (note also 
that _testSaneNormValues_ uses {{BM25Similarity}}, I have locally changed it to 
{{AxiomaticF2EXP}}).
Other than that it seems the {{TestAxiomaticSimilarity}} actually tests only 
{{AxiomaticF2EXP}}, shouldn't it also test the other {{Axiomatic}} extensions? 

You can check the different test options on the wiki [Running 
Tests|https://wiki.apache.org/lucene-java/RunningTests]


was (Author: teofili):
when running 'ant clean test' under lucene/core the only error I see is in 
{{TestAxiomaticSimilarity#testIllegalQL}} (note that _testSaneNormValues_ uses 
{{BM25Similarity}}, I have locally changed it to {{AxiomaticF2EXP}}).
Other than that it seems the {{TestAxiomaticSimilarity}} actually tests only 
{{AxiomaticF2EXP}}, shouldn't it also test the other {{Axiomatic}} extensions? 

You can check the different test options on the wiki [Running 
Tests|https://wiki.apache.org/lucene-java/RunningTests]

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7466) add axiomatic similarity

2016-11-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671519#comment-15671519
 ] 

Tommaso Teofili edited comment on LUCENE-7466 at 11/16/16 8:29 PM:
---

when running 'ant clean test' under lucene/core the only error I see is in 
{{TestAxiomaticSimilarity#testIllegalQL}} (note that _testSaneNormValues_ uses 
{{BM25Similarity}}, I have locally changed it to {{AxiomaticF2EXP}}).
Other than that it seems the {{TestAxiomaticSimilarity}} actually tests only 
{{AxiomaticF2EXP}}, shouldn't it also test the other {{Axiomatic}} extensions? 

You can check the different test options on the wiki [Running 
Tests|https://wiki.apache.org/lucene-java/RunningTests]


was (Author: teofili):
when running 'ant clean test' under lucene/core the only error I see is in 
{{TestAxiomaticSimilarity#testIllegalQL}} (note that _testSaneNormValues_ uses 
{{BM25Similarity}}, I have locally changed it to {{AxiomaticF2EXP}}).
Other than that it seems the {{TestAxiomaticSimilarity}} actually tests only 
{{AxiomaticF2EXP}}, shouldn't it also test the other {{Axiomatic}} extensions? 

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7466) add axiomatic similarity

2016-11-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671519#comment-15671519
 ] 

Tommaso Teofili commented on LUCENE-7466:
-

when running 'ant clean test' under lucene/core the only error I see is in 
{{TestAxiomaticSimilarity#testIllegalQL}} (note that _testSaneNormValues_ uses 
{{BM25Similarity}}, I have locally changed it to {{AxiomaticF2EXP}}).
Other than that it seems the {{TestAxiomaticSimilarity}} actually tests only 
{{AxiomaticF2EXP}}, shouldn't it also test the other {{Axiomatic}} extensions? 

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7274) Add LogisticRegressionDocumentClassifier

2016-11-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669850#comment-15669850
 ] 

Tommaso Teofili edited comment on LUCENE-7274 at 11/16/16 12:05 PM:


Hi [~caomanhdat], thanks for your patch.
A couple of comments:
- I think it'd be good if we could make it a {{LogisticRegressionClassifier}} 
and then extend it into a {{LogisticRegressionDocumentClassifier}} (like for 
{{KNearestNeighbourClassifier}}.
- IIUTC this implementation assumes each feature is stored in a separate field 
and the weights to be computed externally as a _double[]_ , can this work for 
example with Solr's capabilities to store AI models ?
- regarding the labels, wouldn't it be better to declare the classifier as a 
{{Classifier}} (it's a binary classifier in the end)?
- the changes to NumericDocValues, FloatDocValues and DoubleDocValues break 
some lucene/core tests as it seems your patched NumericDocValues always returns 
a Long while FloatDV and DoubleDV convert such a Long value to an Integer and 
then back to a Float / Double using Float.intBitsToFloat / 
Double.intBitsToDouble, can you clarify if / why that is needed ?


was (Author: teofili):
Hi [~caomanhdat], thanks for your patch.
A couple of comments:
- I think it'd be good if we could make it a {{LogisticRegressionClassifier}} 
and then extend it into a {{LogisticRegressionDocumentClassifier}} (like for 
{{KNearestNeighbourClassifier}}.
- IIUTC this implementation assumes each feature is stored in a separate field 
and the weights to be computed externally as a _double[]_ , can this work for 
example with Solr's capabilities to store AI models ?
- regarding the labels, wouldn't it be better to declare the classifier as a 
{{Classifier}} (it's a binary classifier in the end)?

> Add LogisticRegressionDocumentClassifier
> 
>
> Key: LUCENE-7274
> URL: https://issues.apache.org/jira/browse/LUCENE-7274
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-7274.patch
>
>
> Add LogisticRegressionDocumentClassifier for Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7466) add axiomatic similarity

2016-11-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670273#comment-15670273
 ] 

Tommaso Teofili commented on LUCENE-7466:
-

sorry for the confusion, forget about the NumericDocValues related comment, 
that came from another leftover patch I had applied locally.
Therefore it would just be good to have some tests for the axiom similarities, 
everything else looks good to me.

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7466) add axiomatic similarity

2016-11-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15670066#comment-15670066
 ] 

Tommaso Teofili commented on LUCENE-7466:
-

thanks [~ypeilin] for your patch, here're a couple of comments:
- I think a testcase for all the added models should be provided in order to 
make sure they work as expected
- the changes to {{NumericDocValues}}, {{FloatDocValues}} and 
{{DoubleDocValues}} break some tests as it seems NDV always returns a _Long_ 
while FDV and DDV convert such a _Long_ value to an _Integer_ and then back to 
a _Float_ / _Double_ using _Float.intBitsToFloat_ / _Double.intBitsToDouble_, 
can you clarify if / why is that needed for axiomatic similarity ? (if I remove 
the mentioned changes all the tests pass but then I'm not sure if that has an 
impact on the Axiomatic similarities because of the missing tests)

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7274) Add LogisticRegressionDocumentClassifier

2016-11-16 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669850#comment-15669850
 ] 

Tommaso Teofili commented on LUCENE-7274:
-

Hi [~caomanhdat], thanks for your patch.
A couple of comments:
- I think it'd be good if we could make it a {{LogisticRegressionClassifier}} 
and then extend it into a {{LogisticRegressionDocumentClassifier}} (like for 
{{KNearestNeighbourClassifier}}.
- IIUTC this implementation assumes each feature is stored in a separate field 
and the weights to be computed externally as a _double[]_ , can this work for 
example with Solr's capabilities to store AI models ?
- regarding the labels, wouldn't it be better to declare the classifier as a 
{{Classifier}} (it's a binary classifier in the end)?

> Add LogisticRegressionDocumentClassifier
> 
>
> Key: LUCENE-7274
> URL: https://issues.apache.org/jira/browse/LUCENE-7274
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-7274.patch
>
>
> Add LogisticRegressionDocumentClassifier for Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-7274) Add LogisticRegressionDocumentClassifier

2016-11-16 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reassigned LUCENE-7274:
---

Assignee: Tommaso Teofili

> Add LogisticRegressionDocumentClassifier
> 
>
> Key: LUCENE-7274
> URL: https://issues.apache.org/jira/browse/LUCENE-7274
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/classification
>Reporter: Cao Manh Dat
>Assignee: Tommaso Teofili
> Attachments: LUCENE-7274.patch
>
>
> Add LogisticRegressionDocumentClassifier for Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-7466) add axiomatic similarity

2016-11-16 Thread Tommaso Teofili (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tommaso Teofili reassigned LUCENE-7466:
---

Assignee: Tommaso Teofili

> add axiomatic similarity 
> -
>
> Key: LUCENE-7466
> URL: https://issues.apache.org/jira/browse/LUCENE-7466
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (7.0)
>Reporter: Peilin Yang
>Assignee: Tommaso Teofili
>  Labels: patch
>
> Add axiomatic similarity approaches to the similarity family.
> More details can be found at http://dl.acm.org/citation.cfm?id=1076116 and 
> https://www.eecis.udel.edu/~hfang/pubs/sigir05-axiom.pdf
> There are in total six similarity models. All of them are based on BM25, 
> Pivoted Document Length Normalization or Language Model with Dirichlet prior. 
> We think it is worthy to add the models as part of Lucene.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7560) Can we make QueryBuilder.createFieldQuery un-final?

2016-11-15 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669760#comment-15669760
 ] 

Tommaso Teofili commented on LUCENE-7560:
-

+1

> Can we make QueryBuilder.createFieldQuery un-final?
> ---
>
> Key: LUCENE-7560
> URL: https://issues.apache.org/jira/browse/LUCENE-7560
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>
> It's marked final, I assume because we want people who customize their query 
> parsers to only override {{newXXXQuery}} instead.
> But for deeper query parser customization, like using exploring consuming a 
> graph and creating a {{TermAutomatonQuery}}, or a union of {{PhraseQuery}}, 
> etc., it is not possible today and one must fork the class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6664) Replace SynonymFilter with SynonymGraphFilter

2016-11-15 Thread Tommaso Teofili (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15669749#comment-15669749
 ] 

Tommaso Teofili commented on LUCENE-6664:
-

{quote}
 I'm proposing that we make it possible for query-time position graphs to work 
correctly, so multi-token synonyms are no longer buggy, and I think this is a 
good way to make that happen.
{quote}

+1 

> Replace SynonymFilter with SynonymGraphFilter
> -
>
> Key: LUCENE-6664
> URL: https://issues.apache.org/jira/browse/LUCENE-6664
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-6664.patch, LUCENE-6664.patch, LUCENE-6664.patch, 
> LUCENE-6664.patch, usa.png, usa_flat.png
>
>
> Spinoff from LUCENE-6582.
> I created a new SynonymGraphFilter (to replace the current buggy
> SynonymFilter), that produces correct graphs (does no "graph
> flattening" itself).  I think this makes it simpler.
> This means you must add the FlattenGraphFilter yourself, if you are
> applying synonyms during indexing.
> Index-time syn expansion is a necessarily "lossy" graph transformation
> when multi-token (input or output) synonyms are applied, because the
> index does not store {{posLength}}, so there will always be phrase
> queries that should match but do not, and then phrase queries that
> should not match but do.
> http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
> goes into detail about this.
> However, with this new SynonymGraphFilter, if instead you do synonym
> expansion at query time (and don't do the flattening), and you use
> TermAutomatonQuery (future: somehow integrated into a query parser),
> or maybe just "enumerate all paths and make union of PhraseQuery", you
> should get 100% correct matches (not sure about "proper" scoring
> though...).
> This new syn filter still cannot consume an arbitrary graph.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   6   >