[ https://issues.apache.org/jira/browse/LUCENE-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114291#comment-13114291 ]
sebastian L. commented on LUCENE-3440: -------------------------------------- bq. Patch looks great! Thanks. bq. 1. For the new totalWeight, add getter method and modify toString() in WeightedFragInfo(). Okay. bq. 2. The patch uses hard-coded DefaultSimilarity to calculate idf. I don't think that a custom similarity can be used here, too. If so, how about just copying idf method rather than creating a similarity object? I played a little with log(numDocs - docFreq + 0.5 / docFreq + 0.5) but is seems to make no difference. If I'm not mistaken there is no method IndexReader.getSimilarity() or IndexReader.getDefaultSimilarity(). Therefore: Okay. bq. 3. Please do not hesitate to update ScoreComparator (do not add WeightOrderFragmentsBuilder) Hm, I thought about something like that: {code:xml} <highlighting> <fragmentsBuilder name="ordered" class="org.apache.solr.highlight.ScoreOrderFragmentsBuilder" default="false"/> <fragmentsBuilder name="weighted" class="org.apache.solr.highlight.WeightOrderFragmentsBuilder" default="true"/> </highlighting> {code} For Solr-users (like me). If somebody would like to use the boost-based ordering, he could. Maybe, for some use-cases the boost-based approach is better than the weighted one. bq. 4 Could you update package javadoc ( https://builds.apache.org//job/Lucene-trunk/javadoc/contrib-highlighter/org/apache/lucene/search/vectorhighlight/package-summary.html#package_description ) and insert totalWeight into description and figures. Okay. bq. 5. use docFreq(String field, BytesRef term) version for trunk to avoid creating Term object. Okay. bq. I agree. I think if there is a table so that we can compare totalBoost (current) and totalWeight (patch) with real values, it helps a lot. I'll write some Proof-of-concept Test-Class. But this may take some time. I discovered a little problem with overlapping terms, depending on the analyzing-process: WeightedPhraseInfo.addIfNoOverlap() dumps the second part of hyphenated words (for example: social-economics). The result is that all informations in TermInfo are lost and not available for computing the fragments weight. I simple modified WeightedPhraseInfo.addIfNoOverlap() a little to change this behavior: {code:java} void addIfNoOverlap( WeightedPhraseInfo wpi ){ for( WeightedPhraseInfo existWpi : phraseList ){ if( existWpi.isOffsetOverlap( wpi ) ) { existWpi.termInfos.addAll( wpi.termInfos ); return; } } phraseList.add( wpi ); } {code} But I am not sure if there could be some unforeseen site-effects? > FastVectorHighlighter: IDF-weighted terms for ordered fragments > ---------------------------------------------------------------- > > Key: LUCENE-3440 > URL: https://issues.apache.org/jira/browse/LUCENE-3440 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/highlighter > Affects Versions: 3.5, 4.0 > Reporter: sebastian L. > Priority: Minor > Labels: FastVectorHighlighter > Fix For: 3.5, 4.0 > > Attachments: LUCENE-3.5-SNAPSHOT-3440-3.patch, > LUCENE-4.0-SNAPSHOT-3440-3.patch > > > The FastVectorHighlighter uses for every term found in a fragment an equal > weight, which causes a higher ranking for fragments with a high number of > words or, in the worst case, a high number of very common words than > fragments that contains *all* of the terms used in the original query. > This patch provides ordered fragments with IDF-weighted terms: > total weight = total weight + IDF for unique term per fragment * boost of > query; > The ranking-formula should be the same, or at least similar, to that one used > in org.apache.lucene.search.highlight.QueryTermScorer. > The patch is simple, but it works for us. > Some ideas: > - A better approach would be moving the whole fragments-scoring into a > separate class. > - Switch scoring via parameter > - Exact phrases should be given a even better score, regardless if a > phrase-query was executed or not > - edismax/dismax-parameters pf, ps and pf^boost should be observed and > corresponding fragments should be ranked higher -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org