[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-13 Thread Sebastian Lutze (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13294461#comment-13294461
 ] 

Sebastian Lutze commented on LUCENE-3440:
-

Hi Koji,

here's the Solr-Integration: 

https://issues.apache.org/jira/browse/SOLR-3542 



 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: Sebastian Lutze
Assignee: Koji Sekiguchi
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0, 5.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-12 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293653#comment-13293653
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi Sebastian,

I've committed LUCENE-4133.

I'm going to close and mark this issue as resolved because I think Lucene part 
has been completed. Can you open a separate issue for Solr part?

This is a great improvement for FVH. I really appreciate what you've done!

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: Sebastian Lutze
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-12 Thread Sebastian Lutze (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293669#comment-13293669
 ] 

Sebastian Lutze commented on LUCENE-3440:
-

Hi Koji,

bq. I'm going to close and mark this issue as resolved because I think Lucene 
part has been completed. 

that's really awesome! 

bq. Can you open a separate issue for Solr part? 

Sure. 

bq. This is a great improvement for FVH. I really appreciate what you've done! 

It was an honor for me! :) 


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: Sebastian Lutze
Assignee: Koji Sekiguchi
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0, 5.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-11 Thread Sebastian Lutze (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292822#comment-13292822
 ] 

Sebastian Lutze commented on LUCENE-3440:
-

Hi Koji,
  
bq. Is the next the last one? 

almost. :) Next thing would be Solr-Integration. 

So, I just realized: trunk is not trunk anymore! 

This one is for branch_4x: 

https://issues.apache.org/jira/browse/LUCENE-4133 

Tests are fine. 


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: Sebastian Lutze
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-05 Thread sebastian L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289448#comment-13289448
 ] 

sebastian L. commented on LUCENE-3440:
--

Hi Koji,

bq. I committed LUCENE-4107 in trunk and branch_4x.

That was fast! ;)

https://issues.apache.org/jira/browse/LUCENE-4113 

This one introduced and maintains IDF-weight for FieldTermStack.TermInfo. 



 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-05 Thread sebastian L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289459#comment-13289459
 ] 

sebastian L. commented on LUCENE-3440:
--

Hi Koji,

I was just wondering about

https://issues.apache.org/jira/browse/LUCENE-2949 

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-05 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289498#comment-13289498
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi sebastian,

I committed LUCENE-4113 in trunk and branch_4x. Is the next the last one? :)


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-04 Thread sebastian L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288532#comment-13288532
 ] 

sebastian L. commented on LUCENE-3440:
--

Hi Koji,

bq. I committed LUCENE-4091 in trunk and branch_4x. For the credit, I will give 
it in CHANGES.txt when committing the main body (LUCENE-3440) patch.

great, here is the next one: 

https://issues.apache.org/jira/browse/LUCENE-4107

This one simply makes FieldFragList abstract and plugable. Tests were okay.  



 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-06-04 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13288625#comment-13288625
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi sebastian,

I committed LUCENE-4107 in trunk and branch_4x.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-05-31 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287088#comment-13287088
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi sebastian,

I committed LUCENE-4091 in trunk and branch_4x. For the credit, I will give it 
in CHANGES.txt when committing the main body (LUCENE-3440) patch.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-05-30 Thread sebastian L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13285787#comment-13285787
 ] 

sebastian L. commented on LUCENE-3440:
--

Hi Koji, 

bq. This is a great idea and it helps me a lot! If you could provide them one 
by one for trunk, I think I can review the smaller patch and commit them one by 
one.

Okay, lets give it a try, here the first one: 

https://issues.apache.org/jira/browse/LUCENE-4091 

This one simply adds getters. Tests were okay.  

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-05-25 Thread sebastian L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283467#comment-13283467
 ] 

sebastian L. commented on LUCENE-3440:
--

Hi Koji, 
hi Simon,

if there is something to do for me, please let me know. 

Maybe it would be better to split the patch in several smaller ones, e.g.

1. Use Getters/Setters where possible in FVH 
2. Make FieldFragList interface and BaseFieldFragList abstract class
3. Introduction of SimpleFieldFragList and SimpleFragListBuilder as default  
4. Introduction of WeightedFieldFragList and WeightedFragListBuilder  
5. Integration into Solr

When's the 4.0-release scheduled, anyway? 

A Patch for trunk 1342490 is on it's way. 

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-05-25 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13283496#comment-13283496
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi sebastian!

bq. Maybe it would be better to split the patch in several smaller ones, e.g.

This is a great idea and it helps me a lot! If you could provide them one by 
one for trunk, I think I can review the smaller patch and commit them one by 
one.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-05-23 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13281897#comment-13281897
 ] 

Simon Willnauer commented on LUCENE-3440:
-

Koji, do you wanna get this in any time? Now is likely a good time since 4.0 is 
getting close. We won't apply this to 3.6.1 since that is a bugfix only release 
if it is going to happen at all.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2012-05-23 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13282129#comment-13282129
 ] 

Koji Sekiguchi commented on LUCENE-3440:


bq. Koji, do you wanna get this in any time? Now is likely a good time since 
4.0 is getting close.

Hi Simon, thank you for bring this up to me! Yes, I do want sebastian's great 
patch to get in 4.0. It has been on my TODO list for a long time, but I 
couldn't find time to look into it deeply. I'm very sorry about that.

If I remember correctly, when I tried previous patch, I got errors on testing. 
Then sebastian fixed them and attached updated patch. I looked into the updated 
test, but I think I couldn't understand them very well at that time. Just after 
that, couldn't have my time because I was assigned something.

Anyway, the idea of this ticket is definitely great and should be committed. So 
can someone take over it?

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 4.0

 Attachments: LUCENE-3440.patch, LUCENE-3440.patch, 
 LUCENE-3440_3.6.1-SNAPSHOT.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128762#comment-13128762
 ] 

sebastian L. commented on LUCENE-3440:
--

Hi Koji, patch don't work because of 
https://issues.apache.org/jira/browse/LUCENE-3513.

bq. And I found a lot of test errors...

Frankly, I didn't run the tests because I thought the changes provided with the 
last patch shouldn't affect the original behavior. 
I'll have a look into it. But this may take some time, due to the fact that I 
have no knowledge about the test-framework.  

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128795#comment-13128795
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi sebastian,

{quote}
Frankly, I didn't run the tests because I thought the changes provided with the 
last patch shouldn't affect the original behavior.
I'll have a look into it. But this may take some time, due to the fact that I 
have no knowledge about the test-framework. 
{quote}

Ok, no problem. I'll see the test case (hopefully next week or so). But can you 
take care of the following to go forward?

{quote}
Ah, sebastian, I think you needed to check Grant license to ASF for inclusion 
in ASF works when you attach your patch. Can you remove the latest patches and 
reattach them with that flag? Thanks!
{quote}


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128799#comment-13128799
 ] 

Koji Sekiguchi commented on LUCENE-3440:


I've removed my latest patch. Because the patch had ASF granted license flag 
but it was not right because it was totally based on sebastian's patch, which 
was not granted to ASF.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-17 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128802#comment-13128802
 ] 

sebastian L. commented on LUCENE-3440:
--

bq. Ah, sebastian, I think you needed to check Grant license to ASF for 
inclusion in ASF works when you attach your patch. Can you remove the latest 
patches and reattach them with that flag? Thanks!

Sorry, I forgot that. Done.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, LUCENE-4.0-SNAPSHOT-3440-9.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-14 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128022#comment-13128022
 ] 

Koji Sekiguchi commented on LUCENE-3440:


In the latest patch, now FieldFragList becomes interface and BaseFieldFragList 
abstract class, which implements the interface, is introduced. But I think it 
is strange that the javadoc of add() method says that the interface depends on 
FieldFragInfo, which is defined in the abstract class.

{code}
* convert the list of FieldPhraseInfo to FieldFragInfo, then add it to the 
fragInfos
{code}

How about just changing FieldFragList to abstract and avoiding to introduce 
BaseFieldFragList?

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-14 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128026#comment-13128026
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Ah, sebastian, I think you needed to check Grant license to ASF for inclusion 
in ASF works when you attach your patch. Can you remove the latest patches and 
reattach them with that flag? Thanks!

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-14 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13128049#comment-13128049
 ] 

Koji Sekiguchi commented on LUCENE-3440:


And I found a lot of test errors...

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-3440.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-11 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13125074#comment-13125074
 ] 

sebastian L. commented on LUCENE-3440:
--

Okay, here we go again. 

This patch contains:

- Fixed docs
- Fixed test cases 

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-7.patch, 
 LUCENE-3.5-SNAPSHOT-3440-8.patch, LUCENE-4.0-SNAPSHOT-3440-7.patch, 
 LUCENE-4.0-SNAPSHOT-3440-9.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-06 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13122474#comment-13122474
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Very nice progress, thanks! I think this is almost close to commit. I think the 
following is a must to do:

# update description and figures of the package javadoc ( 
https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description
 )
# update test cases. currently they cannot be compiled.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-7.patch, 
 LUCENE-4.0-SNAPSHOT-3440-7.patch, weight-vs-boost_table01.html, 
 weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-04 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120049#comment-13120049
 ] 

sebastian L. commented on LUCENE-3440:
--

Another patch for 4.0. This one makes FieldFragList plugable.  

This patch contains:
- Introduction of interface FieldFragList
- Introduction of abstract class BaseFieldFragList which contains SubInfo and 
FieldFragInfo (I renamed WeightedFragInfo)
- Introduction of class SimpleFieldFragList (default)
- Introduction of class WeightedFieldFragList
- Introduction of abstract class BaseFragListBuilder
- Introduction of class SimpleFragListBuilder (default)
- Introduction of class WeightedFragListBuilder 

The weighting-formula now depends on the implementation of 
FieldFragList.add(int startOffset, int endOffset, ListFieldPhraseInfo 
phraseInfoList):

{code:java}
  /* (non-Javadoc)
   * @see org.apache.lucene.search.vectorhighlight.FieldFragList#getFragInfos()
   */ 
  @Override
  public void add( int startOffset, int endOffset, ListFieldPhraseInfo 
phraseInfoList ) {
float score = 0;
ListSubInfo subInfos = new ArrayListSubInfo();
for( FieldPhraseInfo phraseInfo : phraseInfoList ){
  subInfos.add( new SubInfo( phraseInfo.getText(), 
phraseInfo.getTermsOffset(), phraseInfo.getSeqnum() ) );
  score += phraseInfo.getBoost();
}
getFragInfos().add( new FieldFragInfo( startOffset, endOffset, subInfos, 
score ) );
  }
{code}

The choosen FieldFragList depends on FragListBuilder.createFieldFragList( 
FieldPhraseList fieldPhraseList, int fragCharSize ):

{code:java}
  /* (non-Javadoc)
   * @see 
org.apache.lucene.search.vectorhighlight.FragListBuilder#createFieldFragList(FieldPhraseList
 fieldPhraseList, int fragCharSize)
   */ 
  @Override
  public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, 
int fragCharSize ){
return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( 
fragCharSize ), fragCharSize );
  } 
{code}

Of course, Solr-config could look like this:

{code:xml}
highlighter
 fragListBuilder name=simple 
class=org.apache.solr.highlight.SimpleFragListBuilder/
 fragListBuilder name=weighted 
class=org.apache.solr.highlight.WeightedFragListBuilder default=true/
 fragmentsBuilder name=ordered 
class=org.apache.solr.highlight.ScoreOrderFragmentsBuilder default=true/
/highlighter
{code}

I think, this is the best possible approach, because it maintains 
backwards-compatibility, but do also some refactoring which 
would/could/should/can make it easier to plug-in different approaches in 
future. 

But, after a few weeks of banging my head against the wall I have to admit: I 
have no idea. ;) 


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, 
 LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, 
 WeightOrderFragmentsBuilder_table01.html, 
 WeightOrderFragmentsBuilder_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-04 Thread Koji Sekiguchi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13120143#comment-13120143
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi sebastian, thank you for the continuous work on this! I'd like to take a 
look them in this week.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-4.0-SNAPSHOT-3440-7.patch, 
 weight-vs-boost_table01.html, weight-vs-boost_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-01 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118784#comment-13118784
 ] 

sebastian L. commented on LUCENE-3440:
--

Here's the patch for 4.0. I forgot to update my Solr-plugin-lib to 
4.0-SNAPSHOT.  

Another patch, another idea! :)

Some thoughts: 
- With the last patch, sum-of-distinct-weights will be calculated anyhow, even 
if ScoreOrderFragmentsBuilder is used. 
- Also regardless of further calculations, FieldTermsStack retrieves document 
frequency for each term from IndexReader in any case.
- Solr-Developers have no chance to implement a FragmentsBuilder-plugin with 
their custom-scoring for fragments, because the weighting-formula is 
hard-coded in WeightedFragInfo. BTW, that's the reason I started to work on 
this patch anyway.   

Possible Solution:

1. Collect and pass all needed Informations to the 
BaseFragmentsBuilder-implementation 
- Introduction of TermInfo.fieldName
- Introduction of WeightedFragInfo.phraseInfos
- Passing a instance of IndexReader as argument to 
BaseFragmentsBuilder.getWeightedFragInfoList() in order to get the needed 
statistical data from the index

2. Move the calculation of sum-of-boosts to 
ScoreOrderFramentsBuilder.calculateScore()

{code}
  /**
   * Compute WeightedFragInfo.score based on query-boosts
   * @throws IOException 
   */
  public ListWeightedFragInfo calculateScore( ListWeightedFragInfo 
weightedFragInfos, IndexReader reader ) throws IOException{
for( WeightedFragInfo wfi : weightedFragInfos ){
  for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
wfi.score += wpi.boost;
  }
}
return weightedFragInfos;
  }
{code}

3. Calculation of sum-of-distinct-weights with 
WeightOrderFramentsBuilder.calculateScore()

- In this patch WeightOrderFramentsBuilder is a subclass of 
ScoreOrderFragmentsBuilder.
- But I think the introduction of an abstract class OrderedFragmentsBuilder as 
superclass of BoostOrderFragmentsBuilder and WeightOrderFragmentsBuilder would 
be a better strategy.  
- Moving calculateScore() into BaseFragmentsBuilder and making it abstract 
would be another idea. 
- The _sum-of-distinct-weight_-approach is the same as presented in the last 
patch.

{code}
  /**
   * Compute WeightedFragInfo.score based on IDF-weighted terms
   * @throws IOException 
   */
  @Override
  public ListWeightedFragInfo calculateScore( ListWeightedFragInfo 
weightedFragInfos, IndexReader reader ) throws IOException{

MapString, Float lookup = new HashMapString, Float(); 
HashSetString distinctTerms  = new HashSetString();

int numDocs = reader.numDocs() - reader.numDeletedDocs();

int docFreq;
int length;
float boost;
float weight;

for( WeightedFragInfo wfi : weightedFragInfos ){
  uniqueTerms.clear();
  length = 0;
  boost = 0;
  for( WeightedPhraseInfo wpi : wfi.phraseInfos ){
for( TermInfo ti : wpi.termInfos ) {
  length++;
  if( !distinctTerms.add( ti.text ) ) 
continue;
  if ( lookup.containsKey( ti.text ) )
weight = lookup.get( ti.text ).floatValue();
  else {
docFreq = reader.docFreq( new Term( ti.fieldName, ti.text ) );
weight = ( float ) ( Math.log( numDocs / ( double ) ( docFreq + 1 ) 
) + 1.0 );
lookup.put( ti.text, new Float( weight ) );
  }
  boost += Math.pow( weight, 2 ) * wpi.boost;
}
  }
  wfi.score = ( float ) ( boost * length * ( 1 / Math.sqrt( length ) ) );
}

return weightedFragInfos;
  }
{code}

With this approach programmers can implement their own fragments-weighting with 
ease, simply overwriting calculateScore(). 

I think, the major drawback of this idea is that the FragmentsBuilder must 
traverse the whole stack of WeightedFragInfo once again. Since we have tomes 
with more than 3000 pages of OCR, this _could_ be a problem. But I can't 
confirm that for sure. One way to avoid this would be making FieldFragList 
plugable with an Interface FragList and the FragmentsBuilder-plugin could 
be parametrized with the intended implementation of FragList:

{code:xml}
highlighter
 fragmentsBuilder name=weight-ordered 
class=org.apache.solr.highlight.OrderedFragmentsBuilder /
  fragList class=org.apache.lucene.search.vectorhighlight.WeightedFragList /
 /fragmentsBuilder
 fragmentsBuilder name=boost-ordered 
class=org.apache.solr.highlight.OrderedFragmentsBuilder /
  fragList class=org.apache.lucene.search.vectorhighlight.BoostedFragList /
 /fragmentsBuilder
/highlighter
{code}

Further notes:
- As shown in this patch WeightedFragInfo.totalBoost should be renamed into 
WeightedFragInfo.score.
- As shown in this patch ScoreOrderFragmentsBuilder should be renamed into 
BoostOrderFragmentsBuilder.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-10-01 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118799#comment-13118799
 ] 

sebastian L. commented on LUCENE-3440:
--

Hm, since FieldFragList is created in 
SimpleFraglistBuilder.createFieldFragList() it should look more like that: 

{code:xml}
highlighter
 fragListBuilder name=simple-boosted 
class=org.apache.solr.highlight.SimpleFragListBuilder
  fragList name=boosted 
class=org.apache.lucene.search.vectorhighlight.BoostedFragList/
 /fragListBuilder
 fragListBuilder name=simple-weighted 
class=org.apache.solr.highlight.SimpleFragListBuilder default=true
  fragList name=weighted 
class=org.apache.lucene.search.vectorhighlight.WeightedFragList
 /fragListBuilder
 fragmentsBuilder name=ordered 
class=org.apache.solr.highlight.ScoreOrderFragmentsBuilder default=true/
/highlighter
{code}


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, 
 LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-6.patch, 
 WeightOrderFragmentsBuilder_table01.html, 
 WeightOrderFragmentsBuilder_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-30 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118012#comment-13118012
 ] 

sebastian L. commented on LUCENE-3440:
--

Hm, I tried to do that all with trunk but: 

{code:borderStyle=dotted}
29.09.2011 15:43:09 org.apache.solr.common.SolrException log
SEVERE: java.lang.VerifyError: class 
org.apache.lucene.analysis.ReusableAnalyzerBase overrides final method 
tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632)
at java.lang.ClassLoader.defineClass(ClassLoader.java:616)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at 
org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733)
at 
org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124)
at 
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612)
at 
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632)
at java.lang.ClassLoader.defineClass(ClassLoader.java:616)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
at 
org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:2733)
at 
org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:1124)
at 
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1612)
at 
org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1491)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:403)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:407)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:456)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1653)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1647)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1680)
at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:875)
at org.apache.solr.core.SolrCore.init(SolrCore.java:574)
at org.apache.solr.core.SolrCore.init(SolrCore.java:507)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:653)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:407)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:292)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:241)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93)
at 
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
at 
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
at 
org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115)
at 
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4001)
at 
org.apache.catalina.core.StandardContext.start(StandardContext.java:4651)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:785)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at 
org.apache.catalina.core.StandardEngine.start(StandardEngine.java:445)
at 
org.apache.catalina.core.StandardService.start(StandardService.java:519)
at 
org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:581)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
{code}

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: 

[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-30 Thread sebastian L. (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13118023#comment-13118023
 ] 

sebastian L. commented on LUCENE-3440:
--

*testament*

||Terms in fragment||totalWeight||totalBoost||
|testament testament|1.8171139|2.0|
|testament|1.2848935|1.0|
|testament|1.2848935|1.0|
|testament|1.2848935|1.0|
|testament|1.2848935|1.0|
|testament|1.2848935|1.0|



*das alte testament*

||Terms in fragment||totalWeight||totalBoost||
|das alte testament|5.799069|3.0|
|das alte testament|5.799069|3.0|
|das testament alte|5.799069|3.0|
|das alte testament|5.799069|3.0|
|das testament|2.9178061|2.0|
|das alte|2.9178061|2.0|
|testament testament|1.8171139|2.0|
|das das das das|1.5566137|4.0|
|das das das|1.348067|3.0|
|alte|1.2848935|1.0|
|alte|1.2848935|1.0|
|das das|1.100692|2.0|
|das das|1.100692|2.0|
|das|0.77830684|1.0|
|das|0.77830684|1.0|
|das|0.77830684|1.0|
|das|0.77830684|1.0|
|das|0.77830684|1.0|


Awesome table-formatting!

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-6-ProofOfConcept.java, 
 LUCENE-3.5-SNAPSHOT-3440-6.patch, LUCENE-4.0-SNAPSHOT-3440-3.patch, 
 WeightOrderFragmentsBuilder_table01.html, 
 WeightOrderFragmentsBuilder_table02.html


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-25 Thread sebastian L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114291#comment-13114291
 ] 

sebastian L. commented on LUCENE-3440:
--

bq. Patch looks great! 

Thanks.  

bq. 1. For the new totalWeight, add getter method and modify toString() in 
WeightedFragInfo().

Okay.

bq. 2. The patch uses hard-coded DefaultSimilarity to calculate idf. I don't 
think that a custom similarity can be used here, too. If so, how about just 
copying idf method rather than creating a similarity object?

I played a little with log(numDocs - docFreq  + 0.5 / docFreq + 0.5) but is 
seems to make no difference. If I'm not mistaken there is no method 
IndexReader.getSimilarity() or IndexReader.getDefaultSimilarity(). 

Therefore: Okay. 

bq. 3. Please do not hesitate to update ScoreComparator (do not add 
WeightOrderFragmentsBuilder) 

Hm, I thought about something like that: 

{code:xml}
highlighting
  fragmentsBuilder name=ordered 
class=org.apache.solr.highlight.ScoreOrderFragmentsBuilder default=false/
  fragmentsBuilder name=weighted 
class=org.apache.solr.highlight.WeightOrderFragmentsBuilder default=true/
/highlighting
{code}

For Solr-users (like me). If somebody would like to use the boost-based 
ordering, he could. Maybe, for some use-cases the boost-based approach is 
better than the weighted one.  

bq. 4 Could you update package javadoc ( 
https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description
 ) and insert totalWeight into description and figures.

Okay. 

bq. 5. use docFreq(String field, BytesRef term) version for trunk to avoid 
creating Term object.

Okay. 

bq. I agree. I think if there is a table so that we can compare totalBoost 
(current) and totalWeight (patch) with real values, it helps a lot.

I'll write some Proof-of-concept Test-Class. But this may take some time. 


I discovered a little problem with overlapping terms, depending on the 
analyzing-process:

WeightedPhraseInfo.addIfNoOverlap() dumps the second part of hyphenated words 
(for example: social-economics). The result is that all informations in 
TermInfo are lost and not available for computing the fragments weight. I 
simple modified WeightedPhraseInfo.addIfNoOverlap() a little to change this 
behavior: 

{code:java}
void addIfNoOverlap( WeightedPhraseInfo wpi ){
 for( WeightedPhraseInfo existWpi : phraseList ){
  if( existWpi.isOffsetOverlap( wpi ) ) {
   existWpi.termInfos.addAll( wpi.termInfos );
   return;
  }
 }
 phraseList.add( wpi );
}
{code}

But I am not sure if there could be some unforeseen site-effects? 






 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, 
 LUCENE-4.0-SNAPSHOT-3440-3.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-25 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13114402#comment-13114402
 ] 

Koji Sekiguchi commented on LUCENE-3440:


{quote}
Hm, I thought about something like that: 

{code:xml}
highlighting
  fragmentsBuilder name=ordered 
class=org.apache.solr.highlight.ScoreOrderFragmentsBuilder default=false/
  fragmentsBuilder name=weighted 
class=org.apache.solr.highlight.WeightOrderFragmentsBuilder default=true/
/highlighting
{code}

For Solr-users (like me). If somebody would like to use the boost-based 
ordering, he could. Maybe, for some use-cases the boost-based approach is 
better than the weighted one.  
{quote}

I thought that, too. But I saw the following in the patch:

{code}
public ListWeightedFragInfo getWeightedFragInfoList( ListWeightedFragInfo 
src ) {
Collections.sort( src, new ScoreComparator() );
//Collections.sort( src, new WeightComparator() );
return src;
}
{code}

And I thought you wanted to use WeightComparator from 
ScoreOrderFragmentsBuilder. :)

Well now, let's introduce WeightOrderFragmentsBuilder.


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5, 4.0
Reporter: sebastian L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5, 4.0

 Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, 
 LUCENE-4.0-SNAPSHOT-3440-3.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-23 Thread S.L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113389#comment-13113389
 ] 

S.L. commented on LUCENE-3440:
--

Hi Koji,

bq. 1. Which patch do you want me to try?

Doesn't matter. First time I took the trunk for a long time. I'm looking 
forward to the new admin-interface in solr/lucene-4.0! 

bq. 2. Can you make that for trunk branch?

Here we go. This Version is slightly different, the weight is now boosted by 
the normalized number of terms per fragment:

{code:borderStyle=dotted}
for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
 SubInfo subInfo = new SubInfo( phraseInfo.text, phraseInfo.termsOffsets, 
phraseInfo.seqnum );
 subInfos.add( subInfo );   
 Iterator it = phraseInfo.termInfos.iterator();
 TermInfo ti;
 totalBoost += phraseInfo.boost;  
 while ( it.hasNext() ) {
  ti = ( TermInfo ) it.next();
  if ( uniqueTerms.add( ti.text ) )
   totalWeight += Math.pow(ti.weight, 2) * phraseInfo.boost;
  termsPerFrag++;
  }
 } 
 totalWeight *= termsPerFrag * ( 1 / Math.sqrt( termsPerFrag ) );
}
{code}

Due to a significant lack of mathematical knowledge, a *very* _intuitive_ 
solution. 
But it seems to work very well, at least for our data (highly multi-lingual, 
mostly historical, dirty OCRed, books, journals + papers).  

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5
Reporter: S.L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5

 Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, 
 LUCENE-4.0-SNAPSHOT-3440-3.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-23 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113873#comment-13113873
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Patch looks great! A few comments:

# For the new totalWeight, add getter method and modify toString() in 
WeightedFragInfo().
# The patch uses hard-coded DefaultSimilarity to calculate idf. I don't think 
that a custom similarity can be used here, too. If so, how about just copying 
idf method rather than creating a similarity object?
# Please do not hesitate to update ScoreComparator (do not add 
WeightOrderFragmentsBuilder)
# Could you update package javadoc ( 
https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description
 ) and insert totalWeight into description and figures.
# use docFreq(String field, BytesRef term) version for trunk to avoid creating 
Term object.

bq. Due to a significant lack of mathematical knowledge, a very intuitive 
solution. 

I agree. I think if there is a table so that we can compare totalBoost 
(current) and totalWeight (patch) with real values, it helps a lot.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5
Reporter: S.L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5

 Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, 
 LUCENE-4.0-SNAPSHOT-3440-3.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-22 Thread S.L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112481#comment-13112481
 ] 

S.L. commented on LUCENE-3440:
--

No, can't verify that. It's my first patch, maybe I did something wrong. The 
patch is built from branch_3x with the subversion-plug-in for Eclipse.  I took 
the todays branch_3x (Import - SVN - Checkout projects ...) a few minutes ago 
and patched it (Team - Apply patch). No problem with my setup. 

Another approach:

Assuming a user searches for a single word, he rather would like to see 
fragments with a culmination of that word:

{code:title=Bar.java|borderStyle=solid}
  for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
SubInfo subInfo = new SubInfo( phraseInfo.text, 
phraseInfo.termsOffsets, phraseInfo.seqnum );
subInfos.add( subInfo );

Iterator it = phraseInfo.termInfos.iterator();
TermInfo ti;

while ( it.hasNext() ) {
  ti = ( TermInfo ) it.next();
  distinctTerms.add( ti.text );
  totalBoost += Math.pow(ti.weight, ti.weight) * phraseInfo.boost;
}
  }
  
  totalBoost *= distinctTerms.size();
{code}




 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5
Reporter: S.L.
Priority: Minor
  Labels: patch
 Fix For: 3.5

 Attachments: LUCENE-3440-1.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-22 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112490#comment-13112490
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Ah, I see. I saw trunk, but you made the patch for 3x. I'll see.

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5
Reporter: S.L.
Priority: Minor
  Labels: patch
 Fix For: 3.5

 Attachments: LUCENE-3440-1.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-22 Thread S.L. (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112887#comment-13112887
 ] 

S.L. commented on LUCENE-3440:
--

Here another patch. 

- The calculation of WeightedFragInfo.totalBoost remains unmodified
- A new field WeightedFragInfo.totalWeight has been introduced
- A class WeightOrderFragmentsBuilder sorts now by WeightedFragInfo.totalWeight

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5
Reporter: S.L.
Priority: Minor
  Labels: patch
 Fix For: 3.5

 Attachments: LUCENE-3440-1.patch, LUCENE-3440-2.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-22 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113086#comment-13113086
 ] 

Koji Sekiguchi commented on LUCENE-3440:


Hi,

# Which patch do you want me to try?
# Can you make that for trunk branch?


 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5
Reporter: S.L.
Priority: Minor
  Labels: FastVectorHighlighter
 Fix For: 3.5

 Attachments: LUCENE-3440-1.patch, LUCENE-3440-2.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3440) FastVectorHighlighter: IDF-weighted terms for ordered fragments

2011-09-21 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13112261#comment-13112261
 ] 

Koji Sekiguchi commented on LUCENE-3440:


I think this is an interesting point of view, thanks! But I couldn't apply the 
patch to the latest trunk:

{code}
[koji@MacBook LUCENE-3440]$ patch -p0 --dry-run  LUCENE-3440.patch 
patching file 
lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldFragList.java
patching file 
lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java
patching file 
lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldTermStack.java
Hunk #1 FAILED at 31.
Hunk #2 FAILED at 96.
Hunk #3 FAILED at 108.
Hunk #4 succeeded at 148 (offset -9 lines).
3 out of 4 hunks FAILED -- saving rejects to file 
lucene/contrib/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldTermStack.java.rej
{code}

Can you verify that?

 FastVectorHighlighter: IDF-weighted terms for ordered fragments 
 

 Key: LUCENE-3440
 URL: https://issues.apache.org/jira/browse/LUCENE-3440
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 3.5
Reporter: S.L.
Priority: Minor
  Labels: patch
 Fix For: 3.5

 Attachments: LUCENE-3440-1.patch


 The FastVectorHighlighter uses for every term found in a fragment an equal 
 weight, which causes a higher ranking for fragments with a high number of 
 words or, in the worst case, a high number of very common words than 
 fragments that contains *all* of the terms used in the original query. 
 This patch provides ordered fragments with IDF-weighted terms: 
 total weight = total weight + IDF for unique term per fragment * boost of 
 query; 
 The ranking-formula should be the same, or at least similar, to that one used 
 in org.apache.lucene.search.highlight.QueryTermScorer.
 The patch is simple, but it works for us. 
 Some ideas:
 - A better approach would be moving the whole fragments-scoring into a 
 separate class.
 - Switch scoring via parameter 
 - Exact phrases should be given a even better score, regardless if a 
 phrase-query was executed or not
 - edismax/dismax-parameters pf, ps and pf^boost should be observed and 
 corresponding fragments should be ranked higher 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org