[jira] [Commented] (LUCENE-8145) UnifiedHighlighter should use single OffsetEnum rather than List
[ https://issues.apache.org/jira/browse/LUCENE-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347823#comment-16347823 ] Timothy M. Rodriguez commented on LUCENE-8145: -- Thanks for the CC [~dsmiley]. [~romseygeek] really nice change! Definitely simplifies things quite a bit and conceptually one meta OffsetEnum over the field makes more sense than the list from previous. I'm in favor of keeping the summed frequency on MTQ or at least preserving a mechanism to keep it on. The extra occurrences may not always seem spurious in all cases. For example, consider "expert" systems where users are accustomed to using wildcards for stemming-like expressions. E.g. purchas* for getting variants of the word purchase. In those cases, the extra frequency counts would hopefully select a better passage. I'm not so sure about setScore being passed in a scorer and content length to set the score though. That feels awkward to me. If we were to keep it this way, I'd argue a Passage should receive the PassageScorer and content length at construction instead of via the setScore method. If we did that, I think we could incrementally build the score instead of tracking terms and frequencies for a later score calculation? Another choice is to move a lot of scoring behavior and perhaps introduce another class that's tracking the terms and score in a passage analagous to Weight? > UnifiedHighlighter should use single OffsetEnum rather than List > > > Key: LUCENE-8145 > URL: https://issues.apache.org/jira/browse/LUCENE-8145 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Alan Woodward >Assignee: Alan Woodward >Priority: Minor > Attachments: LUCENE-8145.patch > > > The UnifiedHighlighter deals with several different aspects of highlighting: > finding highlight offsets, breaking content up into snippets, and passage > scoring. It would be nice to split this up so that consumers can use them > separately. > As a first step, I'd like to change the API of FieldOffsetStrategy to return > a single unified OffsetsEnum, rather than a collection of them. This will > make it easier to expose the OffsetsEnum of a document directly from the > highlighter, bypassing snippet extraction and scoring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents
[ https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213052#comment-16213052 ] Timothy M. Rodriguez commented on LUCENE-7976: -- I didn't know that! Thanks for pointing out. > Add a parameter to TieredMergePolicy to merge segments that have more than X > percent deleted documents > -- > > Key: LUCENE-7976 > URL: https://issues.apache.org/jira/browse/LUCENE-7976 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson > > We're seeing situations "in the wild" where there are very large indexes (on > disk) handled quite easily in a single Lucene index. This is particularly > true as features like docValues move data into MMapDirectory space. The > current TMP algorithm allows on the order of 50% deleted documents as per a > dev list conversation with Mike McCandless (and his blog here: > https://www.elastic.co/blog/lucenes-handling-of-deleted-documents). > Especially in the current era of very large indexes in aggregate, (think many > TB) solutions like "you need to distribute your collection over more shards" > become very costly. Additionally, the tempting "optimize" button exacerbates > the issue since once you form, say, a 100G segment (by > optimizing/forceMerging) it is not eligible for merging until 97.5G of the > docs in it are deleted (current default 5G max segment size). > The proposal here would be to add a new parameter to TMP, something like > (no, that's not serious name, suggestions > welcome) which would default to 100 (or the same behavior we have now). > So if I set this parameter to, say, 20%, and the max segment size stays at > 5G, the following would happen when segments were selected for merging: > > any segment with > 20% deleted documents would be merged or rewritten NO > > MATTER HOW LARGE. There are two cases, > >> the segment has < 5G "live" docs. In that case it would be merged with > >> smaller segments to bring the resulting segment up to 5G. If no smaller > >> segments exist, it would just be rewritten > >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). > >> It would be rewritten into a single segment removing all deleted docs no > >> matter how big it is to start. The 100G example above would be rewritten > >> to an 80G segment for instance. > Of course this would lead to potentially much more I/O which is why the > default would be the same behavior we see now. As it stands now, though, > there's no way to recover from an optimize/forceMerge except to re-index from > scratch. We routinely see 200G-300G Lucene indexes at this point "in the > wild" with 10s of shards replicated 3 or more times. And that doesn't even > include having these over HDFS. > Alternatives welcome! Something like the above seems minimally invasive. A > new merge policy is certainly an alternative. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?
[ https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213042#comment-16213042 ] Timothy M. Rodriguez commented on LUCENE-8000: -- [~rcmuir] thanks for the further explanation. That helped clarify. It does seem the effect would be minor at best. It'd be an interesting experiment at some point, though. If I ever get to trying it, I'll post back. [~gol...@detego-software.de] As an additional point, advanced use cases often utilize token "stacking" for additional uses as well and these would have further distortions on length. For example, some folks use analysis chains that stack variants of urls, currencies, etc. > Document Length Normalization in BM25Similarity correct? > > > Key: LUCENE-8000 > URL: https://issues.apache.org/jira/browse/LUCENE-8000 > Project: Lucene - Core > Issue Type: Bug >Reporter: Christoph Goller >Priority: Minor > > Length of individual documents only counts the number of positions of a > document since discountOverlaps defaults to true. > {code} > @Override > public final long computeNorm(FieldInvertState state) { > final int numTerms = discountOverlaps ? state.getLength() - > state.getNumOverlap() : state.getLength(); > int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor(); > if (indexCreatedVersionMajor >= 7) { > return SmallFloat.intToByte4(numTerms); > } else { > return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms))); > } > }} > {code} > Measureing document length this way seems perfectly ok for me. What bothers > me is that > average document length is based on sumTotalTermFreq for a field. As far as I > understand that sums up totalTermFreqs for all terms of a field, therefore > counting positions of terms including those that overlap. > {code} > protected float avgFieldLength(CollectionStatistics collectionStats) { > final long sumTotalTermFreq = collectionStats.sumTotalTermFreq(); > if (sumTotalTermFreq <= 0) { > return 1f; // field does not exist, or stat is unsupported > } else { > final long docCount = collectionStats.docCount() == -1 ? > collectionStats.maxDoc() : collectionStats.docCount(); > return (float) (sumTotalTermFreq / (double) docCount); > } > } > } > {code} > Are we comparing apples and oranges in the final scoring? > I haven't run any benchmarks and I am not sure whether this has a serious > effect. It just means that documents that have synonyms or in my use case > different normal forms of tokens on the same position are shorter and > therefore get higher scores than they should and that we do not use the > whole spectrum of relative document lenght of BM25. > I think for BM25 discountOverlaps should default to false. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents
[ https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213024#comment-16213024 ] Timothy M. Rodriguez commented on LUCENE-7976: -- An additional place where deletions come up is in replica differences due to the way merging happened on a shard. This can cause jitter in results where the ordering will depend on which shard answered a query because the frequencies are off significantly enough. I know this problem will never go away completely as we can't flush away deletes immediately, but allowing some reclamation of deletes in large segments will help minimize the issue. On max segment size, I also think the merge policy ought to dutifully respect maxSegmentSize. If we don't, other smaller bugs can come up for users, such as ulimits on file size, that they thought they were safely under. > Add a parameter to TieredMergePolicy to merge segments that have more than X > percent deleted documents > -- > > Key: LUCENE-7976 > URL: https://issues.apache.org/jira/browse/LUCENE-7976 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson > > We're seeing situations "in the wild" where there are very large indexes (on > disk) handled quite easily in a single Lucene index. This is particularly > true as features like docValues move data into MMapDirectory space. The > current TMP algorithm allows on the order of 50% deleted documents as per a > dev list conversation with Mike McCandless (and his blog here: > https://www.elastic.co/blog/lucenes-handling-of-deleted-documents). > Especially in the current era of very large indexes in aggregate, (think many > TB) solutions like "you need to distribute your collection over more shards" > become very costly. Additionally, the tempting "optimize" button exacerbates > the issue since once you form, say, a 100G segment (by > optimizing/forceMerging) it is not eligible for merging until 97.5G of the > docs in it are deleted (current default 5G max segment size). > The proposal here would be to add a new parameter to TMP, something like > (no, that's not serious name, suggestions > welcome) which would default to 100 (or the same behavior we have now). > So if I set this parameter to, say, 20%, and the max segment size stays at > 5G, the following would happen when segments were selected for merging: > > any segment with > 20% deleted documents would be merged or rewritten NO > > MATTER HOW LARGE. There are two cases, > >> the segment has < 5G "live" docs. In that case it would be merged with > >> smaller segments to bring the resulting segment up to 5G. If no smaller > >> segments exist, it would just be rewritten > >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). > >> It would be rewritten into a single segment removing all deleted docs no > >> matter how big it is to start. The 100G example above would be rewritten > >> to an 80G segment for instance. > Of course this would lead to potentially much more I/O which is why the > default would be the same behavior we see now. As it stands now, though, > there's no way to recover from an optimize/forceMerge except to re-index from > scratch. We routinely see 200G-300G Lucene indexes at this point "in the > wild" with 10s of shards replicated 3 or more times. And that doesn't even > include having these over HDFS. > Alternatives welcome! Something like the above seems minimally invasive. A > new merge policy is certainly an alternative. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?
[ https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211286#comment-16211286 ] Timothy M. Rodriguez commented on LUCENE-8000: -- Makes sense, agreed on both points. > Document Length Normalization in BM25Similarity correct? > > > Key: LUCENE-8000 > URL: https://issues.apache.org/jira/browse/LUCENE-8000 > Project: Lucene - Core > Issue Type: Bug >Reporter: Christoph Goller >Priority: Minor > > Length of individual documents only counts the number of positions of a > document since discountOverlaps defaults to true. > {quote} @Override > public final long computeNorm(FieldInvertState state) { > final int numTerms = discountOverlaps ? state.getLength() - > state.getNumOverlap() : state.getLength(); > int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor(); > if (indexCreatedVersionMajor >= 7) { > return SmallFloat.intToByte4(numTerms); > } else { > return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms))); > } > }{quote} > Measureing document length this way seems perfectly ok for me. What bothers > me is that > average document length is based on sumTotalTermFreq for a field. As far as I > understand that sums up totalTermFreqs for all terms of a field, therefore > counting positions of terms including those that overlap. > {quote} protected float avgFieldLength(CollectionStatistics collectionStats) > { > final long sumTotalTermFreq = collectionStats.sumTotalTermFreq(); > if (sumTotalTermFreq <= 0) { > return 1f; // field does not exist, or stat is unsupported > } else { > final long docCount = collectionStats.docCount() == -1 ? > collectionStats.maxDoc() : collectionStats.docCount(); > return (float) (sumTotalTermFreq / (double) docCount); > } > }{quote} > Are we comparing apples and oranges in the final scoring? > I haven't run any benchmarks and I am not sure whether this has a serious > effect. It just means that documents that have synonyms or in our case > different normal forms of tokens on the same position are shorter and > therefore get higher scores than they should and that we do not use the > whole spectrum of relative document lenght of BM25. > I think for BM25 discountOverlaps should default to false. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?
[ https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211227#comment-16211227 ] Timothy M. Rodriguez commented on LUCENE-8000: -- +1 for keeping the existing behavior of true. It definitely struck me as weird too, but for many indexes flipping the default would result in markedly worse behavior. Rather than disabling discount overlaps, maybe the more ideal behavior would be making the average document length equal to the total number of positions across the collection divided by the number of documents? That way we'd be comparing position length to average position length? However, I haven't looked into the feasibility or expense of doing that. If we were able to do that, discountOverlaps could move to something like countPositions vs countFrequencies. > Document Length Normalization in BM25Similarity correct? > > > Key: LUCENE-8000 > URL: https://issues.apache.org/jira/browse/LUCENE-8000 > Project: Lucene - Core > Issue Type: Bug >Reporter: Christoph Goller >Priority: Minor > > Length of individual documents only counts the number of positions of a > document since discountOverlaps defaults to true. > {quote} @Override > public final long computeNorm(FieldInvertState state) { > final int numTerms = discountOverlaps ? state.getLength() - > state.getNumOverlap() : state.getLength(); > int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor(); > if (indexCreatedVersionMajor >= 7) { > return SmallFloat.intToByte4(numTerms); > } else { > return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms))); > } > }{quote} > Measureing document length this way seems perfectly ok for me. What bothers > me is that > average document length is based on sumTotalTermFreq for a field. As far as I > understand that sums up totalTermFreqs for all terms of a field, therefore > counting positions of terms including those that overlap. > {quote} protected float avgFieldLength(CollectionStatistics collectionStats) > { > final long sumTotalTermFreq = collectionStats.sumTotalTermFreq(); > if (sumTotalTermFreq <= 0) { > return 1f; // field does not exist, or stat is unsupported > } else { > final long docCount = collectionStats.docCount() == -1 ? > collectionStats.maxDoc() : collectionStats.docCount(); > return (float) (sumTotalTermFreq / (double) docCount); > } > }{quote} > Are we comparing apples and oranges in the final scoring? > I haven't run any benchmarks and I am not sure whether this has a serious > effect. It just means that documents that have synonyms or in our case > different normal forms of tokens on the same position are shorter and > therefore get higher scores than they should and that we do not use the > whole spectrum of relative document lenght of BM25. > I think for BM25 discountOverlaps should default to false. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents
[ https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191470#comment-16191470 ] Timothy M. Rodriguez commented on LUCENE-7976: -- If a collection has many 5GB segments, it's possible for many of them to be at less than 50% but still accumulate a fair amount of deletes. Increasing the max segment helps, but increases the amount of churn on disk through large merges. > Add a parameter to TieredMergePolicy to merge segments that have more than X > percent deleted documents > -- > > Key: LUCENE-7976 > URL: https://issues.apache.org/jira/browse/LUCENE-7976 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson > > We're seeing situations "in the wild" where there are very large indexes (on > disk) handled quite easily in a single Lucene index. This is particularly > true as features like docValues move data into MMapDirectory space. The > current TMP algorithm allows on the order of 50% deleted documents as per a > dev list conversation with Mike McCandless (and his blog here: > https://www.elastic.co/blog/lucenes-handling-of-deleted-documents). > Especially in the current era of very large indexes in aggregate, (think many > TB) solutions like "you need to distribute your collection over more shards" > become very costly. Additionally, the tempting "optimize" button exacerbates > the issue since once you form, say, a 100G segment (by > optimizing/forceMerging) it is not eligible for merging until 97.5G of the > docs in it are deleted (current default 5G max segment size). > The proposal here would be to add a new parameter to TMP, something like > (no, that's not serious name, suggestions > welcome) which would default to 100 (or the same behavior we have now). > So if I set this parameter to, say, 20%, and the max segment size stays at > 5G, the following would happen when segments were selected for merging: > > any segment with > 20% deleted documents would be merged or rewritten NO > > MATTER HOW LARGE. There are two cases, > >> the segment has < 5G "live" docs. In that case it would be merged with > >> smaller segments to bring the resulting segment up to 5G. If no smaller > >> segments exist, it would just be rewritten > >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). > >> It would be rewritten into a single segment removing all deleted docs no > >> matter how big it is to start. The 100G example above would be rewritten > >> to an 80G segment for instance. > Of course this would lead to potentially much more I/O which is why the > default would be the same behavior we see now. As it stands now, though, > there's no way to recover from an optimize/forceMerge except to re-index from > scratch. We routinely see 200G-300G Lucene indexes at this point "in the > wild" with 10s of shards replicated 3 or more times. And that doesn't even > include having these over HDFS. > Alternatives welcome! Something like the above seems minimally invasive. A > new merge policy is certainly an alternative. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents
[ https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191319#comment-16191319 ] Timothy M. Rodriguez commented on LUCENE-7976: -- Agreed, it's not strictly a result of optimizations. It can happen for large collections or with many updates to existing documents. > Add a parameter to TieredMergePolicy to merge segments that have more than X > percent deleted documents > -- > > Key: LUCENE-7976 > URL: https://issues.apache.org/jira/browse/LUCENE-7976 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Erick Erickson > > We're seeing situations "in the wild" where there are very large indexes (on > disk) handled quite easily in a single Lucene index. This is particularly > true as features like docValues move data into MMapDirectory space. The > current TMP algorithm allows on the order of 50% deleted documents as per a > dev list conversation with Mike McCandless (and his blog here: > https://www.elastic.co/blog/lucenes-handling-of-deleted-documents). > Especially in the current era of very large indexes in aggregate, (think many > TB) solutions like "you need to distribute your collection over more shards" > become very costly. Additionally, the tempting "optimize" button exacerbates > the issue since once you form, say, a 100G segment (by > optimizing/forceMerging) it is not eligible for merging until 97.5G of the > docs in it are deleted (current default 5G max segment size). > The proposal here would be to add a new parameter to TMP, something like > (no, that's not serious name, suggestions > welcome) which would default to 100 (or the same behavior we have now). > So if I set this parameter to, say, 20%, and the max segment size stays at > 5G, the following would happen when segments were selected for merging: > > any segment with > 20% deleted documents would be merged or rewritten NO > > MATTER HOW LARGE. There are two cases, > >> the segment has < 5G "live" docs. In that case it would be merged with > >> smaller segments to bring the resulting segment up to 5G. If no smaller > >> segments exist, it would just be rewritten > >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). > >> It would be rewritten into a single segment removing all deleted docs no > >> matter how big it is to start. The 100G example above would be rewritten > >> to an 80G segment for instance. > Of course this would lead to potentially much more I/O which is why the > default would be the same behavior we see now. As it stands now, though, > there's no way to recover from an optimize/forceMerge except to re-index from > scratch. We routinely see 200G-300G Lucene indexes at this point "in the > wild" with 10s of shards replicated 3 or more times. And that doesn't even > include having these over HDFS. > Alternatives welcome! Something like the above seems minimally invasive. A > new merge policy is certainly an alternative. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6513) Allow limits on SpanMultiTermQueryWrapper expansion
[ https://issues.apache.org/jira/browse/LUCENE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188913#comment-16188913 ] Timothy M. Rodriguez commented on LUCENE-6513: -- Apologies for the late alternative implementation. For what it's worth, we've been utilizing this patch for about a year and it's helped improve responsiveness to queries while limiting the expansions. > Allow limits on SpanMultiTermQueryWrapper expansion > --- > > Key: LUCENE-6513 > URL: https://issues.apache.org/jira/browse/LUCENE-6513 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Priority: Minor > Attachments: LUCENE-6513.patch, LUCENE-6513.patch, LUCENE-6513.patch, > LUCENE-6513.patch > > > SpanMultiTermQueryWrapper currently rewrites to a SpanOrQuery with as many > clauses as there are matching terms. It would be nice to be able to limit > this in a slightly nicer way than using TopTerms, which for most queries just > translates to a lexicographical ordering. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6513) Allow limits on SpanMultiTermQueryWrapper expansion
[ https://issues.apache.org/jira/browse/LUCENE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075601#comment-16075601 ] Timothy M. Rodriguez commented on LUCENE-6513: -- [~romseygeek] we've written a patch to solve this problem as well we've been meaning to share with the community. It goes about the solution in a bit of a different way. We'll try to get it up here in a day or two, though I'm not sure which approach will be preferable. > Allow limits on SpanMultiTermQueryWrapper expansion > --- > > Key: LUCENE-6513 > URL: https://issues.apache.org/jira/browse/LUCENE-6513 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Alan Woodward >Priority: Minor > Attachments: LUCENE-6513.patch, LUCENE-6513.patch, LUCENE-6513.patch > > > SpanMultiTermQueryWrapper currently rewrites to a SpanOrQuery with as many > clauses as there are matching terms. It would be nice to be able to limit > this in a slightly nicer way than using TopTerms, which for most queries just > translates to a lexicographical ordering. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7844) UnifiedHighlighter: simplify "maxPassages" input API
[ https://issues.apache.org/jira/browse/LUCENE-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025178#comment-16025178 ] Timothy M. Rodriguez commented on LUCENE-7844: -- This syntax looks really good! {code} unifiedHighlighter.highlight(query, topDocs, unifiedHighlighter.fieldOptionsWhole("title"), unifiedHighlighter.fieldOptions("body", 3) ); {code} with maybe {code}unifiedHighlighter.fieldOptionsWhole();{code} being a specialization of {code}unifiedHiglighter.fieldOptions("title", 3, BreakOption.WHOLE);{code} or something to that effect Fair point on the performance difference being negligible. In terms of now, I'd be in favor of leaving the current parallel array approach and working towards a fieldOption approach. I can offer to help on that end! > UnifiedHighlighter: simplify "maxPassages" input API > > > Key: LUCENE-7844 > URL: https://issues.apache.org/jira/browse/LUCENE-7844 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Minor > Fix For: master (7.0) > > Attachments: LUCENE_7844__UH_maxPassages_simplification.patch > > > The "maxPassages" input to the UnifiedHighlighter can be provided as an array > to some of the public methods on UnifiedHighlighter. When it's provided as > an array, the index in the array is for the field in a parallel array. I > think this is awkward and furthermore it's inconsistent with the way this > highlighter customizes things on a by field basis. Instead, the parameter > can be a simple int default (not an array), and then there can be a protected > method like {{getMaxPassageCount(String field}} that returns an Integer > which, when non-null, replaces the default value for this field. > Aside from API simplicity and consistency, this will also remove some > annoying parallel array sorting going on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7844) UnifiedHighlighter: simplify "maxPassages" input API
[ https://issues.apache.org/jira/browse/LUCENE-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024765#comment-16024765 ] Timothy M. Rodriguez commented on LUCENE-7844: -- +1 on the comparator use. That definitely cleaned up some code. I'm a bit uncertain on the maxPassages change, however. I think it may be pretty common to pivot the number of passages required per field. For example, a user may want to highlight a title fully (one passage) and get several passages from the primary content field. The motivation to get rid of the parallel arrays makes a lot of sense, maybe we could try to lump all these options into an object per field? For lack of a better name something like FieldOptions[] or the like? Longer term, I could even see options for the break iterator, scorer, and formatter being configured per field. (In the previous example, it may be better to have a dummy iterator that chunks on value delineations, a noop scorer, and a formatter that just returns the entire stored value for the title, while the content would have more traditional options. I know this is all still possible with overrides in the current design, but I'm not sure we should push it further into the "specialized" use-case area. What do you think? > UnifiedHighlighter: simplify "maxPassages" input API > > > Key: LUCENE-7844 > URL: https://issues.apache.org/jira/browse/LUCENE-7844 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Minor > Fix For: master (7.0) > > Attachments: LUCENE_7844__UH_maxPassages_simplification.patch > > > The "maxPassages" input to the UnifiedHighlighter can be provided as an array > to some of the public methods on UnifiedHighlighter. When it's provided as > an array, the index in the array is for the field in a parallel array. I > think this is awkward and furthermore it's inconsistent with the way this > highlighter customizes things on a by field basis. Instead, the parameter > can be a simple int default (not an array), and then there can be a protected > method like {{getMaxPassageCount(String field}} that returns an Integer > which, when non-null, replaces the default value for this field. > Aside from API simplicity and consistency, this will also remove some > annoying parallel array sorting going on. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
[ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813306#comment-15813306 ] Timothy M. Rodriguez commented on LUCENE-7620: -- Me too! > UnifiedHighlighter: add target character width BreakIterator wrapper > > > Key: LUCENE-7620 > URL: https://issues.apache.org/jira/browse/LUCENE-7620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Fix For: 6.4 > > Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, > LUCENE_7620_UH_LengthGoalBreakIterator.patch, > LUCENE_7620_UH_LengthGoalBreakIterator.patch > > > The original Highlighter includes a {{SimpleFragmenter}} that delineates > fragments (aka Passages) by a character width. The default is 100 characters. > It would be great to support something similar for the UnifiedHighlighter. > It's useful in its own right and of course it helps users transition to the > UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a > sentence one. In this way you get back Passages that are a number of > sentences so they will look nice instead of breaking mid-way through a > sentence. And you get some control by specifying a target number of > characters. This BreakIterator wouldn't be a general purpose > java.text.BreakIterator since it would assume it's called in a manner exactly > as the UnifiedHighlighter uses it. It would probably be compatible with the > PostingsHighlighter too. > I don't propose doing this by default; besides, it's easy enough to pick your > BreakIterator config. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
[ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806094#comment-15806094 ] Timothy M. Rodriguez edited comment on LUCENE-7620 at 1/6/17 11:09 PM: --- Very useful! I like that it decorates an underlying BreakIterator. For the following method, does it make sense to return the baseIter if the followingIdx < startIndex? Maybe throw an exception instead or just have an assert that it's less? This is subjective, but I find it's more useful to break out the different tests with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom, etc. Similar for the defaultSummary tests. This helps when coming back to the test and helps tease apart if one piece of functionality is broken vs another. was (Author: timothy055): Very useful! I like that it decorates an underlying BreakIterator. For the following method, does it make sense to return the baseIter if the followingIdx < startIndex? Maybe throw an exception instead or just have an assert that it's less? This is subjective, but I find it's more useful to break out the different tests with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom, etc. Similar for the defaultSummary tests. > UnifiedHighlighter: add target character width BreakIterator wrapper > > > Key: LUCENE-7620 > URL: https://issues.apache.org/jira/browse/LUCENE-7620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Fix For: 6.4 > > Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, > LUCENE_7620_UH_LengthGoalBreakIterator.patch > > > The original Highlighter includes a {{SimpleFragmenter}} that delineates > fragments (aka Passages) by a character width. The default is 100 characters. > It would be great to support something similar for the UnifiedHighlighter. > It's useful in its own right and of course it helps users transition to the > UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a > sentence one. In this way you get back Passages that are a number of > sentences so they will look nice instead of breaking mid-way through a > sentence. And you get some control by specifying a target number of > characters. This BreakIterator wouldn't be a general purpose > java.text.BreakIterator since it would assume it's called in a manner exactly > as the UnifiedHighlighter uses it. It would probably be compatible with the > PostingsHighlighter too. > I don't propose doing this by default; besides, it's easy enough to pick your > BreakIterator config. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper
[ https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806094#comment-15806094 ] Timothy M. Rodriguez commented on LUCENE-7620: -- Very useful! I like that it decorates an underlying BreakIterator. For the following method, does it make sense to return the baseIter if the followingIdx < startIndex? Maybe throw an exception instead or just have an assert that it's less? This is subjective, but I find it's more useful to break out the different tests with methods for each condition. For example: breakAtGoal, breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom, etc. Similar for the defaultSummary tests. > UnifiedHighlighter: add target character width BreakIterator wrapper > > > Key: LUCENE-7620 > URL: https://issues.apache.org/jira/browse/LUCENE-7620 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Fix For: 6.4 > > Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, > LUCENE_7620_UH_LengthGoalBreakIterator.patch > > > The original Highlighter includes a {{SimpleFragmenter}} that delineates > fragments (aka Passages) by a character width. The default is 100 characters. > It would be great to support something similar for the UnifiedHighlighter. > It's useful in its own right and of course it helps users transition to the > UH. I'd like to do it as a wrapper to another BreakIterator -- perhaps a > sentence one. In this way you get back Passages that are a number of > sentences so they will look nice instead of breaking mid-way through a > sentence. And you get some control by specifying a target number of > characters. This BreakIterator wouldn't be a general purpose > java.text.BreakIterator since it would assume it's called in a manner exactly > as the UnifiedHighlighter uses it. It would probably be compatible with the > PostingsHighlighter too. > I don't propose doing this by default; besides, it's easy enough to pick your > BreakIterator config. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8241) Evaluate W-TinyLfu cache
[ https://issues.apache.org/jira/browse/SOLR-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15801792#comment-15801792 ] Timothy M. Rodriguez commented on SOLR-8241: +1 for this issue. Solr currently uses caffeine-1.0.1 in it's distribution, which can cause conflicts if you create any extensions that intend to use the new library. > Evaluate W-TinyLfu cache > > > Key: SOLR-8241 > URL: https://issues.apache.org/jira/browse/SOLR-8241 > Project: Solr > Issue Type: Wish > Components: search >Reporter: Ben Manes >Priority: Minor > Attachments: SOLR-8241.patch, SOLR-8241.patch, SOLR-8241.patch, > proposal.patch > > > SOLR-2906 introduced an LFU cache and in-progress SOLR-3393 makes it O(1). > The discussions seem to indicate that the higher hit rate (vs LRU) is offset > by the slower performance of the implementation. An original goal appeared to > be to introduce ARC, a patented algorithm that uses ghost entries to retain > history information. > My analysis of Window TinyLfu indicates that it may be a better option. It > uses a frequency sketch to compactly estimate an entry's popularity. It uses > LRU to capture recency and operate in O(1) time. When using available > academic traces the policy provides a near optimal hit rate regardless of the > workload. > I'm getting ready to release the policy in Caffeine, which Solr already has a > dependency on. But, the code is fairly straightforward and a port into Solr's > caches instead is a pragmatic alternative. More interesting is what the > impact would be in Solr's workloads and feedback on the policy's design. > https://github.com/ben-manes/caffeine/wiki/Efficiency -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7578) UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API
[ https://issues.apache.org/jira/browse/LUCENE-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709795#comment-15709795 ] Timothy M. Rodriguez edited comment on LUCENE-7578 at 11/30/16 9:15 PM: Some care would have to be taken with spans, especially with significant slop. It's arguably worse to have a single highlight across it. But otherwise, this definitely is a desired improvement. was (Author: timothy055): Some care would have to be taken with spans, especially with significant slop. It's arguably worse to have a single highlight across it. > UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API > - > > Key: LUCENE-7578 > URL: https://issues.apache.org/jira/browse/LUCENE-7578 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley > > The PhraseHelper of the UnifiedHighlighter currently collects position-spans > per SpanQuery (and it knows which terms are in which SpanQuery), and then it > filters PostingsEnum based on that. It's similar to how the original > Highlighter WSTE works. The main problem with this approach is that it can > be inaccurate for some nested span queries -- LUCENE-2287, LUCENE-5455 (has > the clearest example), LUCENE-6796. Non-nested SpanQueries (e.g. that which > is converted from a PhraseQuery or MultiPhraseQuery) are _not_ a problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7578) UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API
[ https://issues.apache.org/jira/browse/LUCENE-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709795#comment-15709795 ] Timothy M. Rodriguez commented on LUCENE-7578: -- Some care would have to be taken with spans, especially with significant slop. It's arguably worse to have a single highlight across it. > UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API > - > > Key: LUCENE-7578 > URL: https://issues.apache.org/jira/browse/LUCENE-7578 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley > > The PhraseHelper of the UnifiedHighlighter currently collects position-spans > per SpanQuery (and it knows which terms are in which SpanQuery), and then it > filters PostingsEnum based on that. It's similar to how the original > Highlighter WSTE works. The main problem with this approach is that it can > be inaccurate for some nested span queries -- LUCENE-2287, LUCENE-5455 (has > the clearest example), LUCENE-6796. Non-nested SpanQueries (e.g. that which > is converted from a PhraseQuery or MultiPhraseQuery) are _not_ a problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support
[ https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709621#comment-15709621 ] Timothy M. Rodriguez commented on LUCENE-7575: -- Looks good to me too. Some additional suggestions: UnifiedHighlighter: * +1 on the suggestion to use HighlightFlags instead. PhraseHelper: * It's clearer in my opinion to change the boolean branch to something like {code} if (!requireFieldMatch) {} else {} {code} instead of checking {code} requireFieldMatch == false {code}. Even better would be swapping the branches so it's {code}if (requireFieldBranch) {} else {}{code} * Similar point for line 287 {code} if (requireFieldMatch && fieldName.equals(queryTerm.field()) == false) {} {code} TestUnifiedHiglighter: * I think it'd be clearer to separate the the cases for term/phrase/multi-term queries into separate tests. This makes it easier to chase bugs down the line if only 1 fails. (And provides more information if all 3 fail) > UnifiedHighlighter: add requireFieldMatch=false support > --- > > Key: LUCENE-7575 > URL: https://issues.apache.org/jira/browse/LUCENE-7575 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Assignee: David Smiley > Attachments: LUCENE-7575.patch > > > The UnifiedHighlighter (like the PostingsHighlighter) only supports > highlighting queries for the same fields that are being highlighted. The > original Highlighter and FVH support loosening this, AKA > requireFieldMatch=false. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15693717#comment-15693717 ] Timothy M. Rodriguez commented on SOLR-9708: Haha, no problem. It'll improve usability quite a bit to be able to dynamically invoke it per request (and the other highlighters). I'm glad it landed with the initial Solr release of the unified highlighter. > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > Attachments: SOLR-9708.patch > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691807#comment-15691807 ] Timothy M. Rodriguez commented on SOLR-9708: Looks great! Adding the other highlighters to method really fleshed it out. Also in favor of the change from "default" to "original". No further suggested changes other than a rename on the FASTVECTOR enum to FAST_VECTOR. Let me know if you need any help with the wiki in December. Would be glad to contribute there as well. > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15683730#comment-15683730 ] Timothy M. Rodriguez commented on SOLR-9708: Added a normalizeParameters method that will set tag.pre or post if simple.pre or post are set. > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15683698#comment-15683698 ] Timothy M. Rodriguez commented on SOLR-9708: I've posted an initial commit that allows the user to override the configured highlighter based on the "hl.method" parameter. Two things I want to highlight: * The highlighter can no longer safely be statically determined using HighlightComponent.getHiglighter since a request parameter can override the pre-configured one. I've marked this usage deprecated as it affects quite a few places outside of this change. Is that okay? * Use of an enum for collecting all the highlight methods and giving a bit extra type safety when switching over the values in the override. I'm not sure if this is out of style and several static String fields is preferred (although I personally prefer the former). > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668824#comment-15668824 ] Timothy M. Rodriguez commented on SOLR-9708: I was suggesting instead of hl.tag.pre, but realized that's used too. No sense adding a third. Even though both names are not so ideal IMO > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668766#comment-15668766 ] Timothy M. Rodriguez commented on SOLR-9708: I thought the suggestion was to use hl.tag.pre instead of hl.simple.pre? > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668715#comment-15668715 ] Timothy M. Rodriguez commented on SOLR-9708: I'm okay with hl.tag.pre/post, but it may not always be a tag. Perhaps something like hl.pre.marker? or hl.pre.sigil? > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies
[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15665317#comment-15665317 ] Timothy M. Rodriguez commented on LUCENE-7526: -- Added the proposed changes. I'm on the fence around the refactor for MultiValueTokenStream, I'd much prefer to get rid of it completely if we could. But for now having some symmetry between the two impls seems worthwhile to me? I'd like to punt on that one. > Improvements to UnifiedHighlighter OffsetStrategies > --- > > Key: LUCENE-7526 > URL: https://issues.apache.org/jira/browse/LUCENE-7526 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley >Priority: Minor > Fix For: 6.4 > > > This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies > by reducing reliance on creating or re-creating TokenStreams. > The primary changes are as follows: > * AnalysisOffsetStrategy - split into two offset strategies > ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a > MemoryIndex for producing Offsets > ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a > MemoryIndex. Can only be used if the query distills down to terms and > automata. > * TokenStream removal > ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill > the memory index and then once consumed a new one was generated by > uninverting the MemoryIndex back into a TokenStream if there were automata > (wildcard/mtq queries) involved. Now this is avoided, which should save > memory and avoid a second pass over the data. > ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid > generating a TokenStream if automata are involved. > ** PostingsWithTermVectorsOffsetStrategy - similar refactoring > * CompositePostingsEnum - aggregates several underlying PostingsEnums for > wildcard/mtq queries. This should improve relevancy by providing unified > metrics for a wildcard across all it's term matches > * Added a HighlightFlag for enabling the newly separated > TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15665291#comment-15665291 ] Timothy M. Rodriguez commented on SOLR-9708: Thanks for catching those things. I've fixed them and pushed to the pr. Regarding the hl.useUnifiedHighlighter I'm actually very in favor of that idea, but perhaps that logic would be better in the highlight component? In that way the actual highlighters would be more like the facet params that help tweak with algorithm gets used. I agree that we really shouldn't have to "configure" the highlighters. Perhaps that should be a separate issue though more in line with the the other changes mentioned? > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662087#comment-15662087 ] Timothy M. Rodriguez commented on SOLR-9708: Let me know what you think. If it looks good, I think we can commit it. > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr
[ https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662086#comment-15662086 ] Timothy M. Rodriguez commented on SOLR-9708: I've pushed tests for the configurable items in the UH as well as for support of multiple snippets. In addition a change was done to push highlighter specific logic down into the DefaultSolrHighlighter that was in the HighlightComponent (thanks [~dsmiley] for pointing that out). > Expose UnifiedHighlighter in Solr > - > > Key: SOLR-9708 > URL: https://issues.apache.org/jira/browse/SOLR-9708 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.4 > > > This ticket is for creating a Solr plugin that can utilize the new > UnifiedHighlighter which was initially committed in > https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies
[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660330#comment-15660330 ] Timothy M. Rodriguez commented on LUCENE-7526: -- Other than that, I think this code is in good shape for committing. > Improvements to UnifiedHighlighter OffsetStrategies > --- > > Key: LUCENE-7526 > URL: https://issues.apache.org/jira/browse/LUCENE-7526 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley >Priority: Minor > Fix For: 6.4 > > > This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies > by reducing reliance on creating or re-creating TokenStreams. > The primary changes are as follows: > * AnalysisOffsetStrategy - split into two offset strategies > ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a > MemoryIndex for producing Offsets > ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a > MemoryIndex. Can only be used if the query distills down to terms and > automata. > * TokenStream removal > ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill > the memory index and then once consumed a new one was generated by > uninverting the MemoryIndex back into a TokenStream if there were automata > (wildcard/mtq queries) involved. Now this is avoided, which should save > memory and avoid a second pass over the data. > ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid > generating a TokenStream if automata are involved. > ** PostingsWithTermVectorsOffsetStrategy - similar refactoring > * CompositePostingsEnum - aggregates several underlying PostingsEnums for > wildcard/mtq queries. This should improve relevancy by providing unified > metrics for a wildcard across all it's term matches > * Added a HighlightFlag for enabling the newly separated > TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies
[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660328#comment-15660328 ] Timothy M. Rodriguez commented on LUCENE-7526: -- I've merged with the changes from LUCENE-7544 and also ran some benchmarks. (Thanks [~dsmiley] for the fix on LUCENE-7546!) Original: ||Impl||Terms||Phrases||Wildcards|| |(search)|1.14|1.43|2.44| |SH_A|7.36|7.49|16.37| |UH_A|5.32|4.55|9.24| |SH_V|4.12|4.42|8.47| |FVH_V|3.46|2.98|7.13| |UH_V|3.7|3.45|6.61| |PH_P|3.76|3.45|9.6| |UH_P|3.34|2.91|9.33| |UH_PV|3.26|2.8|6.72| With improvements from LUCENE-7526: ||Impl||Terms||Phrases||Wildcards|| |(search)|1.18|1.38|2.52| |SH_A|7.98|7.53|16.62| |UH_A|5.46|4.6|9.43| |SH_V|4.13|4.42|8.26| |FVH_V|3.45|3.05|6.93| |UH_V|3.79|3.43|6.62| |PH_P|3.82|3.47|9.4| |UH_P|3.33|3.03|9.46| |UH_PV|3.24|2.81|6.92| If you disable the new option to prefer passage relevancy over speed you'll get the following for analysis: ||Impl||Terms||Phrases||Wildcards|| |(search)|1.1|1.43|2.44| |UH_A|5.31|4.66|9.14| I wasn't able to get very consistent times with the benchmarks, but it looks like the changes keep close performance while simplifying the code and improving relevancy in the Analysis case (unless preferPassageRelevancyOverSpeed is disabled). If that option is disabled the timings line up pretty closely with the originals, providing a minor speed boost. There should also be a memory savings by avoiding re-creation of TokenStreams, but that was difficult to measure, but could prove beneficial if there is memory pressure. I performed these benchmark on a machine with the following configuration: Processor: AMD Phenom II X4 960T 3.0GHz Memory: 24GB DDR3 Disk: Crucial CT256MX SSD OS: Windows 10 Java: Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode) All versions of the benchmarks incorporated above included the changes from LUCENE-7544. [~dsmiley] It looks like my older processor took significantly longer to highlight across the board than in your initial run for LUCENE-7438. I'd be curious how this set of changes performs on your machine now. > Improvements to UnifiedHighlighter OffsetStrategies > --- > > Key: LUCENE-7526 > URL: https://issues.apache.org/jira/browse/LUCENE-7526 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley >Priority: Minor > Fix For: 6.4 > > > This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies > by reducing reliance on creating or re-creating TokenStreams. > The primary changes are as follows: > * AnalysisOffsetStrategy - split into two offset strategies > ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a > MemoryIndex for producing Offsets > ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a > MemoryIndex. Can only be used if the query distills down to terms and > automata. > * TokenStream removal > ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill > the memory index and then once consumed a new one was generated by > uninverting the MemoryIndex back into a TokenStream if there were automata > (wildcard/mtq queries) involved. Now this is avoided, which should save > memory and avoid a second pass over the data. > ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid > generating a TokenStream if automata are involved. > ** PostingsWithTermVectorsOffsetStrategy - similar refactoring > * CompositePostingsEnum - aggregates several underlying PostingsEnums for > wildcard/mtq queries. This should improve relevancy by providing unified > metrics for a wildcard across all it's term matches > * Added a HighlightFlag for enabling the newly separated > TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7546) Rename uses of people.apache.org to home.apache.org
[ https://issues.apache.org/jira/browse/LUCENE-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649031#comment-15649031 ] Timothy M. Rodriguez commented on LUCENE-7546: -- I like the idea of dist.apache.org as an alternative source. That should definitely be more stable than web space tied to an individual account. > Rename uses of people.apache.org to home.apache.org > --- > > Key: LUCENE-7546 > URL: https://issues.apache.org/jira/browse/LUCENE-7546 > Project: Lucene - Core > Issue Type: Task >Reporter: David Smiley > > The people.apache.org server was replaced by a different server > home.apache.org officially last year, and it appears to have completed > sometime this year. DNS for both points to the same machine but we should > reference home.apache.org now. *Unfortunately, some data was large enough > that ASF Infra didn't automatically move it, leaving that up to the > individuals to do. I think any data that hasn't been moved by now might be > gone.* > Here's a useful reference to this: EMPIREDB-234 The second part of that > issue also informs us that RC artifacts don't belong on home.apache.org; > there is https://dist.apache.org/repos/dist/dev/ for that. 6.3 was > done the right way... yet I see references to using people.apache.org in the > build for RCs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7438) UnifiedHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623451#comment-15623451 ] Timothy M. Rodriguez commented on LUCENE-7438: -- Not yet, we have an initial general implementation, but it's lacking tests. (We have a customized extension internally that does have tests.) I've created a new ticket https://issues.apache.org/jira/browse/SOLR-9708 with a PR containing the initial impl so folks can follow or help the work towards finishing it up. Thanks for asking though, hopefully this gets the ball rolling faster. > UnifiedHighlighter > -- > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 6.2 >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Fix For: 6.3 > > Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, > LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch > > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from > analysis (a TokenStream). Lucene’s existing highlighters are mostly > demarcated along offset source lines, whereas here it is unified -- hence > this proposed name. In this highlighter, the offset source strategy is > separated from the core highlighting functionalty. The UnifiedHighlighter > further improves on the PostingsHighlighter’s design by supporting accurate > phrase highlighting using an approach similar to the standard highlighter’s > WeightedSpanTermExtractor. The next major improvement is a hybrid offset > source strategythat utilizes postings and “light” term vectors (i.e. just the > terms) for highlighting multi-term queries (wildcards) without resorting to > analysis. Phrase highlighting and wildcard highlighting can both be disabled > if you’d rather highlight a little faster albeit not as accurately reflecting > the query. > We’ve benchmarked an earlier version of this highlighter comparing it to the > other highlighters and the results were exciting! It’s tempting to share > those results but it’s definitely due for another benchmark, so we’ll work on > that. Performance was the main motivator for creating the UnifiedHighlighter, > as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy > requirements) wasn’t fast enough, even with term vectors along with several > improvements we contributed back, and even after we forked it to highlight in > multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-9708) Expose UnifiedHighlighter in Solr
Timothy M. Rodriguez created SOLR-9708: -- Summary: Expose UnifiedHighlighter in Solr Key: SOLR-9708 URL: https://issues.apache.org/jira/browse/SOLR-9708 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: highlighter Reporter: Timothy M. Rodriguez Priority: Minor This ticket is for creating a Solr plugin that can utilize the new UnifiedHighlighter which was initially committed in https://issues.apache.org/jira/browse/LUCENE-7438 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies
[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616294#comment-15616294 ] Timothy M. Rodriguez commented on LUCENE-7526: -- Thanks [~dsmiley] :). I've just submitted the pull request. You're right this only removes an additional use of token streams. In the case of the Analysis strategies a TokenStream is still necessary at least initially to analyze the field. I'm glad I got to work on this during the wonderful Boston Hackday event (https://github.com/flaxsearch/london-hackday-2016). Thanks [~dsmiley] for some tips while there and [~mbraun688] for some initial feedback on the pr. > Improvements to UnifiedHighlighter OffsetStrategies > --- > > Key: LUCENE-7526 > URL: https://issues.apache.org/jira/browse/LUCENE-7526 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Timothy M. Rodriguez >Assignee: David Smiley >Priority: Minor > Fix For: 6.4 > > > This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies > by reducing reliance on creating or re-creating TokenStreams. > The primary changes are as follows: > * AnalysisOffsetStrategy - split into two offset strategies > ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a > MemoryIndex for producing Offsets > ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a > MemoryIndex. Can only be used if the query distills down to terms and > automata. > * TokenStream removal > ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill > the memory index and then once consumed a new one was generated by > uninverting the MemoryIndex back into a TokenStream if there were automata > (wildcard/mtq queries) involved. Now this is avoided, which should save > memory and avoid a second pass over the data. > ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid > generating a TokenStream if automata are involved. > ** PostingsWithTermVectorsOffsetStrategy - similar refactoring > * CompositePostingsEnum - aggregates several underlying PostingsEnums for > wildcard/mtq queries. This should improve relevancy by providing unified > metrics for a wildcard across all it's term matches > * Added a HighlightFlag for enabling the newly separated > TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies
[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612996#comment-15612996 ] Timothy M. Rodriguez commented on LUCENE-7526: -- Pull request forthcoming - I had some more merging work to do with master than I anticipated! > Improvements to UnifiedHighlighter OffsetStrategies > --- > > Key: LUCENE-7526 > URL: https://issues.apache.org/jira/browse/LUCENE-7526 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Timothy M. Rodriguez >Priority: Minor > Labels: highlighter, unified-highlighter > > This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies > by reducing reliance on creating or re-creating TokenStreams. > The primary changes are as follows: > * AnalysisOffsetStrategy - split into two offset strategies > ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a > MemoryIndex for producing Offsets > ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a > MemoryIndex. Can only be used if the query distills down to terms and > automata. > * TokenStream removal > ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill > the memory index and then once consumed a new one was generated by > uninverting the MemoryIndex back into a TokenStream if there were automata > (wildcard/mtq queries) involved. Now this is avoided, which should save > memory and avoid a second pass over the data. > ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid > generating a TokenStream if automata are involved. > ** PostingsWithTermVectorsOffsetStrategy - similar refactoring > * CompositePostingsEnum - aggregates several underlying PostingsEnums for > wildcard/mtq queries. This should improve relevancy by providing unified > metrics for a wildcard across all it's term matches > * Added a HighlightFlag for enabling the newly separated > TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies
Timothy M. Rodriguez created LUCENE-7526: Summary: Improvements to UnifiedHighlighter OffsetStrategies Key: LUCENE-7526 URL: https://issues.apache.org/jira/browse/LUCENE-7526 Project: Lucene - Core Issue Type: Improvement Reporter: Timothy M. Rodriguez Priority: Minor This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by reducing reliance on creating or re-creating TokenStreams. The primary changes are as follows: * AnalysisOffsetStrategy - split into two offset strategies * MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a MemoryIndex for producing Offsets * TokenStreamOffsetStrategy - an offset strategy that avoids creating a MemoryIndex. Can only be used if the query distills down to terms and automata. * TokenStream removal * MemoryIndexOffsetStrategy - previously a TokenStream was created to fill the memory index and then once consumed a new one was generated by uninverting the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq queries) involved. Now this is avoided, which should save memory and avoid a second pass over the data. * TermVectorOffsetStrategy - this was refactored in a similar way to avoid generating a TokenStream if automata are involved. * PostingsWithTermVectorsOffsetStrategy - similar refactoring * CompositePostingsEnum - aggregates several underlying PostingsEnums for wildcard/mtq queries. This should improve relevancy by providing unified metrics for a wildcard across all it's term matches * Added a HighlightFlag for enabling the newly separated TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies
[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy M. Rodriguez updated LUCENE-7526: - Description: This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by reducing reliance on creating or re-creating TokenStreams. The primary changes are as follows: * AnalysisOffsetStrategy - split into two offset strategies ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a MemoryIndex for producing Offsets ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a MemoryIndex. Can only be used if the query distills down to terms and automata. * TokenStream removal ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill the memory index and then once consumed a new one was generated by uninverting the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq queries) involved. Now this is avoided, which should save memory and avoid a second pass over the data. ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid generating a TokenStream if automata are involved. ** PostingsWithTermVectorsOffsetStrategy - similar refactoring * CompositePostingsEnum - aggregates several underlying PostingsEnums for wildcard/mtq queries. This should improve relevancy by providing unified metrics for a wildcard across all it's term matches * Added a HighlightFlag for enabling the newly separated TokenStreamOffsetStrategy since it can adversely affect passage relevancy was: This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by reducing reliance on creating or re-creating TokenStreams. The primary changes are as follows: * AnalysisOffsetStrategy - split into two offset strategies * MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a MemoryIndex for producing Offsets * TokenStreamOffsetStrategy - an offset strategy that avoids creating a MemoryIndex. Can only be used if the query distills down to terms and automata. * TokenStream removal * MemoryIndexOffsetStrategy - previously a TokenStream was created to fill the memory index and then once consumed a new one was generated by uninverting the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq queries) involved. Now this is avoided, which should save memory and avoid a second pass over the data. * TermVectorOffsetStrategy - this was refactored in a similar way to avoid generating a TokenStream if automata are involved. * PostingsWithTermVectorsOffsetStrategy - similar refactoring * CompositePostingsEnum - aggregates several underlying PostingsEnums for wildcard/mtq queries. This should improve relevancy by providing unified metrics for a wildcard across all it's term matches * Added a HighlightFlag for enabling the newly separated TokenStreamOffsetStrategy since it can adversely affect passage relevancy > Improvements to UnifiedHighlighter OffsetStrategies > --- > > Key: LUCENE-7526 > URL: https://issues.apache.org/jira/browse/LUCENE-7526 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Timothy M. Rodriguez >Priority: Minor > Labels: highlighter, unified-highlighter > > This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies > by reducing reliance on creating or re-creating TokenStreams. > The primary changes are as follows: > * AnalysisOffsetStrategy - split into two offset strategies > ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a > MemoryIndex for producing Offsets > ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a > MemoryIndex. Can only be used if the query distills down to terms and > automata. > * TokenStream removal > ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill > the memory index and then once consumed a new one was generated by > uninverting the MemoryIndex back into a TokenStream if there were automata > (wildcard/mtq queries) involved. Now this is avoided, which should save > memory and avoid a second pass over the data. > ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid > generating a TokenStream if automata are involved. > ** PostingsWithTermVectorsOffsetStrategy - similar refactoring > * CompositePostingsEnum - aggregates several underlying PostingsEnums for > wildcard/mtq queries. This should improve relevancy by providing unified > metrics for a wildcard across all it's term matches > * Added a HighlightFlag for enabling the newly separated > TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SOLR-9542) Kerberos delegation tokens requires missing Jackson library
[ https://issues.apache.org/jira/browse/SOLR-9542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537247#comment-15537247 ] Timothy M. Rodriguez commented on SOLR-9542: Not sure it makes sense to introduce a Jackson dependency here. I'm conflicted on how big of an issue this is though. It's a really old version of jackson since it depends on the org.codehaus version. On the other hand, it's probably less likely to conflict as such. > Kerberos delegation tokens requires missing Jackson library > --- > > Key: SOLR-9542 > URL: https://issues.apache.org/jira/browse/SOLR-9542 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Ishan Chattopadhyaya > Attachments: SOLR-9542.patch > > > GET, RENEW or CANCEL operations for the delegation tokens support requires > the Solr server to have old jackson added as a dependency. > Steps to reproduce the problem: > 1) Configure Solr to use delegation tokens > 2) Start Solr > 3) Use a SolrJ application to get a delegation token. > The server throws the following: > {code} > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.managementOperation(DelegationTokenAuthenticationHandler.java:279) > at > org.apache.solr.security.KerberosPlugin$RequestContinuesRecorderAuthenticationHandler.managementOperation(KerberosPlugin.java:566) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:514) > at > org.apache.solr.security.DelegationTokenKerberosFilter.doFilter(DelegationTokenKerberosFilter.java:123) > at > org.apache.solr.security.KerberosPlugin.doAuthenticate(KerberosPlugin.java:265) > at > org.apache.solr.servlet.SolrDispatchFilter.authenticateRequest(SolrDispatchFilter.java:318) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:518) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7438) UnifiedHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537117#comment-15537117 ] Timothy M. Rodriguez commented on LUCENE-7438: -- After further consideration, it seems best to leave some of the classes common between Postings and the Unified highlighters separate. If we were to use the same classes they'd ideally move to a common sub-package that both could share and this would introduce unneeded change and hurt potential compatibility for any users of those classes. Keeping them separate also allows for a possible improvement to the method highlightFieldsAsObjects which internally creates a Map that is promptly thrown away again in the highlight methods. I briefly investigated changing this to return the internal Object[][] array and avoid the extra Map allocation, but this creates some awkwardness since the Object[][] array sorts the input fields before filling the arrays, which would make the API somewhat of a trap for callers. This undesired behavior is likely why the map is being created. One way to fix this is to generify PassageFormatter over it's output type which would allow for a PassageFormatter in the case of the DefaultPassageFormatter. However, changing this is a rather involved change that could ultimately result in the UnifiedHighlighter itself having a generic type and it was not clear that muddying the waters with that right now was a good idea. However, keeping these classes separate will allow for an attempt at that in the future. In the meantime, I've also pushed a commit to reduce the visibility of the MultiTermHighlighting to package protected. As it stands, I think this patch is ready. > UnifiedHighlighter > -- > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 6.2 >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > Attachments: LUCENE_7438_UH_benchmark.patch > > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from > analysis (a TokenStream). Lucene’s existing highlighters are mostly > demarcated along offset source lines, whereas here it is unified -- hence > this proposed name. In this highlighter, the offset source strategy is > separated from the core highlighting functionalty. The UnifiedHighlighter > further improves on the PostingsHighlighter’s design by supporting accurate > phrase highlighting using an approach similar to the standard highlighter’s > WeightedSpanTermExtractor. The next major improvement is a hybrid offset > source strategythat utilizes postings and “light” term vectors (i.e. just the > terms) for highlighting multi-term queries (wildcards) without resorting to > analysis. Phrase highlighting and wildcard highlighting can both be disabled > if you’d rather highlight a little faster albeit not as accurately reflecting > the query. > We’ve benchmarked an earlier version of this highlighter comparing it to the > other highlighters and the results were exciting! It’s tempting to share > those results but it’s definitely due for another benchmark, so we’ll work on > that. Performance was the main motivator for creating the UnifiedHighlighter, > as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy > requirements) wasn’t fast enough, even with term vectors along with several > improvements we contributed back, and even after we forked it to highlight in > multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7438) UnifiedHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474731#comment-15474731 ] Timothy M. Rodriguez commented on LUCENE-7438: -- Actually, I think passage relevancy might be something we'd look into in more details down the line. Definitely, some of the things in LUCENE-4909 could be useful. :) I see merit in keeping things separate to allow for flexibility. > UnifiedHighlighter > -- > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 6.2 >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from > analysis (a TokenStream). Lucene’s existing highlighters are mostly > demarcated along offset source lines, whereas here it is unified -- hence > this proposed name. In this highlighter, the offset source strategy is > separated from the core highlighting functionalty. The UnifiedHighlighter > further improves on the PostingsHighlighter’s design by supporting accurate > phrase highlighting using an approach similar to the standard highlighter’s > WeightedSpanTermExtractor. The next major improvement is a hybrid offset > source strategythat utilizes postings and “light” term vectors (i.e. just the > terms) for highlighting multi-term queries (wildcards) without resorting to > analysis. Phrase highlighting and wildcard highlighting can both be disabled > if you’d rather highlight a little faster albeit not as accurately reflecting > the query. > We’ve benchmarked an earlier version of this highlighter comparing it to the > other highlighters and the results were exciting! It’s tempting to share > those results but it’s definitely due for another benchmark, so we’ll work on > that. Performance was the main motivator for creating the UnifiedHighlighter, > as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy > requirements) wasn’t fast enough, even with term vectors along with several > improvements we contributed back, and even after we forked it to highlight in > multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7438) UnifiedHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473896#comment-15473896 ] Timothy M. Rodriguez commented on LUCENE-7438: -- I'm not a fan of forking classes with Uxyz naming scheme. I think it'd be better to make the existing classes re-usable or keep the current naming scheme. That being said, if we make the existing classes re-usable, it might be better to plan on moving them into some common package later on so it's clearer that they are re-used. > UnifiedHighlighter > -- > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 6.2 >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from > analysis (a TokenStream). Lucene’s existing highlighters are mostly > demarcated along offset source lines, whereas here it is unified -- hence > this proposed name. In this highlighter, the offset source strategy is > separated from the core highlighting functionalty. The UnifiedHighlighter > further improves on the PostingsHighlighter’s design by supporting accurate > phrase highlighting using an approach similar to the standard highlighter’s > WeightedSpanTermExtractor. The next major improvement is a hybrid offset > source strategythat utilizes postings and “light” term vectors (i.e. just the > terms) for highlighting multi-term queries (wildcards) without resorting to > analysis. Phrase highlighting and wildcard highlighting can both be disabled > if you’d rather highlight a little faster albeit not as accurately reflecting > the query. > We’ve benchmarked an earlier version of this highlighter comparing it to the > other highlighters and the results were exciting! It’s tempting to share > those results but it’s definitely due for another benchmark, so we’ll work on > that. Performance was the main motivator for creating the UnifiedHighlighter, > as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy > requirements) wasn’t fast enough, even with term vectors along with several > improvements we contributed back, and even after we forked it to highlight in > multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7438) UnifiedHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470904#comment-15470904 ] Timothy M. Rodriguez edited comment on LUCENE-7438 at 9/7/16 7:31 PM: -- Pull request here: https://github.com/apache/lucene-solr/pull/79 I'd also like to specially acknowledge [~dsmiley] who has worked with us closely. He did the lion's share of the work represented here. (Including the genesis of the idea for unifying the disparate highlighters.) was (Author: timothy055): Pull request here: https://github.com/apache/lucene-solr/pull/79 I'd also like to specially acknowledge [~dsmiley] who has worked with us closely. He did a very significant share of the work represented here. (Including the genesis of the idea for unifying the disparate highlighters.) > UnifiedHighlighter > -- > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 6.2 >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from > analysis (a TokenStream). Lucene’s existing highlighters are mostly > demarcated along offset source lines, whereas here it is unified -- hence > this proposed name. In this highlighter, the offset source strategy is > separated from the core highlighting functionalty. The UnifiedHighlighter > further improves on the PostingsHighlighter’s design by supporting accurate > phrase highlighting using an approach similar to the standard highlighter’s > WeightedSpanTermExtractor. The next major improvement is a hybrid offset > source strategythat utilizes postings and “light” term vectors (i.e. just the > terms) for highlighting multi-term queries (wildcards) without resorting to > analysis. Phrase highlighting and wildcard highlighting can both be disabled > if you’d rather highlight a little faster albeit not as accurately reflecting > the query. > We’ve benchmarked an earlier version of this highlighter comparing it to the > other highlighters and the results were exciting! It’s tempting to share > those results but it’s definitely due for another benchmark, so we’ll work on > that. Performance was the main motivator for creating the UnifiedHighlighter, > as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy > requirements) wasn’t fast enough, even with term vectors along with several > improvements we contributed back, and even after we forked it to highlight in > multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-7438) UnifiedHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470904#comment-15470904 ] Timothy M. Rodriguez edited comment on LUCENE-7438 at 9/7/16 3:49 PM: -- Pull request here: https://github.com/apache/lucene-solr/pull/79 I'd also like to specially acknowledge [~dsmiley] who has worked with us closely. He did a very significant share of the work represented here. (Including the genesis of the idea for unifying the disparate highlighters.) was (Author: timothy055): Pull request here: https://github.com/apache/lucene-solr/pull/79 I'd also like to specially acknowledge [~dsmiley] who has worked with us closely. He did a very significant share of the work represented here. > UnifiedHighlighter > -- > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 6.2 >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from > analysis (a TokenStream). Lucene’s existing highlighters are mostly > demarcated along offset source lines, whereas here it is unified -- hence > this proposed name. In this highlighter, the offset source strategy is > separated from the core highlighting functionalty. The UnifiedHighlighter > further improves on the PostingsHighlighter’s design by supporting accurate > phrase highlighting using an approach similar to the standard highlighter’s > WeightedSpanTermExtractor. The next major improvement is a hybrid offset > source strategythat utilizes postings and “light” term vectors (i.e. just the > terms) for highlighting multi-term queries (wildcards) without resorting to > analysis. Phrase highlighting and wildcard highlighting can both be disabled > if you’d rather highlight a little faster albeit not as accurately reflecting > the query. > We’ve benchmarked an earlier version of this highlighter comparing it to the > other highlighters and the results were exciting! It’s tempting to share > those results but it’s definitely due for another benchmark, so we’ll work on > that. Performance was the main motivator for creating the UnifiedHighlighter, > as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy > requirements) wasn’t fast enough, even with term vectors along with several > improvements we contributed back, and even after we forked it to highlight in > multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7438) UnifiedHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470904#comment-15470904 ] Timothy M. Rodriguez commented on LUCENE-7438: -- Pull request here: https://github.com/apache/lucene-solr/pull/79 I'd also like to specially acknowledge [~dsmiley] who has worked with us closely. He did a very significant share of the work represented here. > UnifiedHighlighter > -- > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 6.2 >Reporter: Timothy M. Rodriguez >Assignee: David Smiley > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from > analysis (a TokenStream). Lucene’s existing highlighters are mostly > demarcated along offset source lines, whereas here it is unified -- hence > this proposed name. In this highlighter, the offset source strategy is > separated from the core highlighting functionalty. The UnifiedHighlighter > further improves on the PostingsHighlighter’s design by supporting accurate > phrase highlighting using an approach similar to the standard highlighter’s > WeightedSpanTermExtractor. The next major improvement is a hybrid offset > source strategythat utilizes postings and “light” term vectors (i.e. just the > terms) for highlighting multi-term queries (wildcards) without resorting to > analysis. Phrase highlighting and wildcard highlighting can both be disabled > if you’d rather highlight a little faster albeit not as accurately reflecting > the query. > We’ve benchmarked an earlier version of this highlighter comparing it to the > other highlighters and the results were exciting! It’s tempting to share > those results but it’s definitely due for another benchmark, so we’ll work on > that. Performance was the main motivator for creating the UnifiedHighlighter, > as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy > requirements) wasn’t fast enough, even with term vectors along with several > improvements we contributed back, and even after we forked it to highlight in > multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7438) UnifiedHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470848#comment-15470848 ] Timothy M. Rodriguez commented on LUCENE-7438: -- Some additional information: h2. Missing features & possible future improvements: Despite the offset source flexibility and accuracy options of this highlighter, it continues to be the case that some highlighters have unique features. The following features are in the standard Highlighter (and possibly FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not PostingsHighlighter either since UH is derived from PH): * Being able to disable “requireFieldMatch” to thus highlight a query insensitive to whatever fields are mentioned in the query. * Using boosts in the query to weight passages. * Regex pased passage delineation. Though I’m unsure if anyone cares given the existing BreakIterator options available. Aside from addressing the feature gaps listed above, there are a couple known things that would be nice to add: * The phrase highlighting (implemented by PhraseHelper) could be made more accurate, and probably faster too, by using techniques in Alan’s Luwak system that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t done this way to begin with because this highlighter was developed originally for Lucene 4.10. * Wildcard queries usually use TokenStreamFromTermVector, which uninverts the terms out of a Terms index. Instead, we now think it would be better to create a bunch of PostingsEnum for each matching term. This would bring about some simplifications and efficiencies, and can lead to better passage relevancy. A bonus would be aggregating terms matching the same automata into a merged PostingsEnum that has a freq() based on the sum of the underlying matching terms. h2. Changes from the PostingsHighlighter * The UH is more stateful ** Holds the IndexSearcher instead of asking most methods to pass it through. ** Options now have simple setters, and the per-field getters return these. This means the common case of a setting being non-specific to a field doesn’t require subclassing. * Multi-valued field handling is improved to ensure that a passage will never span across values, plus it honors the positionIncrementGap for an analyzed offset source. See MultiValueTokenStream and SplittingBreakIterator. * The PH caches all content to be highlighted for all docs and then highlights it all. The UH has a limit on this which led to a batching approach. But if all fields use an Analyzer or if more than one use term vectors, then instead highlighting happens one doc at a time since the up-front content caching is not helpful. * No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one doc to the next. This really simplified some code; it didn’t seem worth it. * MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it to guard against ramifications of exceptions being thrown during highlighting (e.g. a BreakIterator bug or TokenStream bug). Nasty to debug! * (from standard Highlighter) TokenStreamFromTermVector: optimizations to uninvert filtered (thus sparse) Terms. h2. Non-Core Dependencies * MemoryIndex: For Analyzer based highlighting when phrases need to be highlighted accurately. * Standard Highlighter things: ** TokenStreamFromTermVector: For most multi-term queries. The UH actually has its own derived copy that has been optimized to handle filtered (thus sparse) Terms. With further work, we could switch to a different approach and remove it (as indicated earlier). For as long as it stays, it’s also possible to replace the existing one with this if we want to do that. ** WeightedSpanTermExtractor: For highlighting phrases accurately to re-use it’s SpanQuery conversion and rewrite detecting abilities. Perhaps these parts of WSTE could move to general SpanQuery utilities. ** TermVectorLeafReader: When highlighting offsets from term vectors. * PostingHighlighter things: ** Technically, Nothing however it has multiple copies of some things that have not been modified: Passage, PassageScorer, PassageFormatter, DefaultPassageFormatter. ** Note: Utility BreakIterators are of use to the PH, UH, and even the FVH: WholeBreakIterator, CustomSeparatorBreakIterator. Maybe they should move to a utils package that isn’t in any of these highlighters? > UnifiedHighlighter > -- > > Key: LUCENE-7438 > URL: https://issues.apache.org/jira/browse/LUCENE-7438 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 6.2 >Reporter: Timothy M. Rodriguez > > The UnifiedHighlighter is an evolution of the PostingsHighlighter that is > able to highlight using offsets in either postings, term vectors, or from >
[jira] [Created] (LUCENE-7438) UnifiedHighlighter
Timothy M. Rodriguez created LUCENE-7438: Summary: UnifiedHighlighter Key: LUCENE-7438 URL: https://issues.apache.org/jira/browse/LUCENE-7438 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 6.2 Reporter: Timothy M. Rodriguez The UnifiedHighlighter is an evolution of the PostingsHighlighter that is able to highlight using offsets in either postings, term vectors, or from analysis (a TokenStream). Lucene’s existing highlighters are mostly demarcated along offset source lines, whereas here it is unified -- hence this proposed name. In this highlighter, the offset source strategy is separated from the core highlighting functionalty. The UnifiedHighlighter further improves on the PostingsHighlighter’s design by supporting accurate phrase highlighting using an approach similar to the standard highlighter’s WeightedSpanTermExtractor. The next major improvement is a hybrid offset source strategythat utilizes postings and “light” term vectors (i.e. just the terms) for highlighting multi-term queries (wildcards) without resorting to analysis. Phrase highlighting and wildcard highlighting can both be disabled if you’d rather highlight a little faster albeit not as accurately reflecting the query. We’ve benchmarked an earlier version of this highlighter comparing it to the other highlighters and the results were exciting! It’s tempting to share those results but it’s definitely due for another benchmark, so we’ll work on that. Performance was the main motivator for creating the UnifiedHighlighter, as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy requirements) wasn’t fast enough, even with term vectors along with several improvements we contributed back, and even after we forked it to highlight in multiple threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org