[jira] [Commented] (LUCENE-8145) UnifiedHighlighter should use single OffsetEnum rather than List

2018-01-31 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347823#comment-16347823
 ] 

Timothy M. Rodriguez commented on LUCENE-8145:
--

Thanks for the CC [~dsmiley].

[~romseygeek] really nice change!  Definitely simplifies things quite a bit and 
conceptually one meta OffsetEnum over the field makes more sense than the list 
from previous.

I'm in favor of keeping the summed frequency on MTQ or at least preserving a 
mechanism to keep it on.  The extra occurrences may not always seem spurious in 
all cases.  For example, consider "expert" systems where users are accustomed 
to using wildcards for stemming-like expressions.  E.g. purchas* for getting 
variants of the word purchase.  In those cases, the extra frequency counts 
would hopefully select a better passage.



I'm not so sure about setScore being passed in a scorer and content length to 
set the score though. That feels awkward to me.  If we were to keep it this 
way, I'd argue a Passage should receive the PassageScorer and content length at 
construction instead of via the setScore method.  If we did that, I think we 
could incrementally build the score instead of tracking terms and frequencies 
for a later score calculation?  Another choice is to move a lot of scoring 
behavior and perhaps introduce another class that's tracking the terms and 
score in a passage analagous to Weight?

 

 

> UnifiedHighlighter should use single OffsetEnum rather than List
> 
>
> Key: LUCENE-8145
> URL: https://issues.apache.org/jira/browse/LUCENE-8145
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-8145.patch
>
>
> The UnifiedHighlighter deals with several different aspects of highlighting: 
> finding highlight offsets, breaking content up into snippets, and passage 
> scoring.  It would be nice to split this up so that consumers can use them 
> separately.
> As a first step, I'd like to change the API of FieldOffsetStrategy to return 
> a single unified OffsetsEnum, rather than a collection of them.  This will 
> make it easier to expose the OffsetsEnum of a document directly from the 
> highlighter, bypassing snippet extraction and scoring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents

2017-10-20 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213052#comment-16213052
 ] 

Timothy M. Rodriguez commented on LUCENE-7976:
--

I didn't know that! Thanks for pointing out.

> Add a parameter to TieredMergePolicy to merge segments that have more than X 
> percent deleted documents
> --
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
>  (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213042#comment-16213042
 ] 

Timothy M. Rodriguez commented on LUCENE-8000:
--

[~rcmuir] thanks for the further explanation.  That helped clarify. It does 
seem the effect would be minor at best.  It'd be an interesting experiment at 
some point, though.  If I ever get to trying it, I'll post back.

[~gol...@detego-software.de] As an additional point, advanced use cases often 
utilize token "stacking" for additional uses as well and these would have 
further distortions on length.  For example, some folks use analysis chains 
that stack variants of urls, currencies, etc.

> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents

2017-10-20 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213024#comment-16213024
 ] 

Timothy M. Rodriguez commented on LUCENE-7976:
--

An additional place where deletions come up is in replica differences due to 
the way merging happened on a shard.  This can cause jitter in results where 
the ordering will depend on which shard answered a query because the 
frequencies are off significantly enough.  I know this problem will never go 
away completely as we can't flush away deletes immediately, but allowing some 
reclamation of deletes in large segments will help minimize the issue.

On max segment size, I also think the merge policy ought to dutifully respect 
maxSegmentSize.  If we don't, other smaller bugs can come up for users, such as 
ulimits on file size, that they thought they were safely under.

> Add a parameter to TieredMergePolicy to merge segments that have more than X 
> percent deleted documents
> --
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
>  (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-19 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211286#comment-16211286
 ] 

Timothy M. Rodriguez commented on LUCENE-8000:
--

Makes sense, agreed on both points.

> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
>  {quote} @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }{quote}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {quote}  protected float avgFieldLength(CollectionStatistics collectionStats) 
> {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }{quote}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in our case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-19 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211227#comment-16211227
 ] 

Timothy M. Rodriguez commented on LUCENE-8000:
--

+1 for keeping the existing behavior of true.  It definitely struck me as weird 
too, but for many indexes flipping the default would result in markedly worse 
behavior.  Rather than disabling discount overlaps, maybe the more ideal 
behavior would be making the average document length equal to the total number 
of positions across the collection divided by the number of documents? That way 
we'd be comparing position length to average position length? However, I 
haven't looked into the feasibility or expense of doing that.  If we were able 
to do that, discountOverlaps could move to something like countPositions vs 
countFrequencies.

> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
>  {quote} @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }{quote}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {quote}  protected float avgFieldLength(CollectionStatistics collectionStats) 
> {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }{quote}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in our case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents

2017-10-04 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191470#comment-16191470
 ] 

Timothy M. Rodriguez commented on LUCENE-7976:
--

If a collection has many 5GB segments, it's possible for many of them to be at 
less than 50% but still accumulate a fair amount of deletes.  Increasing the 
max segment helps, but increases the amount of churn on disk through large 
merges.

> Add a parameter to TieredMergePolicy to merge segments that have more than X 
> percent deleted documents
> --
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
>  (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7976) Add a parameter to TieredMergePolicy to merge segments that have more than X percent deleted documents

2017-10-04 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191319#comment-16191319
 ] 

Timothy M. Rodriguez commented on LUCENE-7976:
--

Agreed, it's not strictly a result of optimizations.  It can happen for large 
collections or with many updates to existing documents.

> Add a parameter to TieredMergePolicy to merge segments that have more than X 
> percent deleted documents
> --
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Erick Erickson
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
>  (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6513) Allow limits on SpanMultiTermQueryWrapper expansion

2017-10-02 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188913#comment-16188913
 ] 

Timothy M. Rodriguez commented on LUCENE-6513:
--

Apologies for the late alternative implementation.  For what it's worth, we've 
been utilizing this patch for about a year and it's helped improve 
responsiveness to queries while limiting the expansions.

> Allow limits on SpanMultiTermQueryWrapper expansion
> ---
>
> Key: LUCENE-6513
> URL: https://issues.apache.org/jira/browse/LUCENE-6513
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-6513.patch, LUCENE-6513.patch, LUCENE-6513.patch, 
> LUCENE-6513.patch
>
>
> SpanMultiTermQueryWrapper currently rewrites to a SpanOrQuery with as many 
> clauses as there are matching terms.  It would be nice to be able to limit 
> this in a slightly nicer way than using TopTerms, which for most queries just 
> translates to a lexicographical ordering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6513) Allow limits on SpanMultiTermQueryWrapper expansion

2017-07-05 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075601#comment-16075601
 ] 

Timothy M. Rodriguez commented on LUCENE-6513:
--

[~romseygeek] we've written a patch to solve this problem as well we've been 
meaning to share with the community.  It goes about the solution in a bit of a 
different way.  We'll try to get it up here in a day or two, though I'm not 
sure which approach will be preferable.

> Allow limits on SpanMultiTermQueryWrapper expansion
> ---
>
> Key: LUCENE-6513
> URL: https://issues.apache.org/jira/browse/LUCENE-6513
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-6513.patch, LUCENE-6513.patch, LUCENE-6513.patch
>
>
> SpanMultiTermQueryWrapper currently rewrites to a SpanOrQuery with as many 
> clauses as there are matching terms.  It would be nice to be able to limit 
> this in a slightly nicer way than using TopTerms, which for most queries just 
> translates to a lexicographical ordering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7844) UnifiedHighlighter: simplify "maxPassages" input API

2017-05-25 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16025178#comment-16025178
 ] 

Timothy M. Rodriguez commented on LUCENE-7844:
--

This syntax looks really good!
{code}
unifiedHighlighter.highlight(query, topDocs, 
 unifiedHighlighter.fieldOptionsWhole("title"),
 unifiedHighlighter.fieldOptions("body", 3)
);
{code}

with maybe {code}unifiedHighlighter.fieldOptionsWhole();{code} being a 
specialization of {code}unifiedHiglighter.fieldOptions("title", 3, 
BreakOption.WHOLE);{code} or something to that effect

Fair point on the performance difference being negligible.  In terms of now, 
I'd be in favor of leaving the current parallel array approach and working 
towards a fieldOption approach.  I can offer to help on that end!


> UnifiedHighlighter: simplify "maxPassages" input API
> 
>
> Key: LUCENE-7844
> URL: https://issues.apache.org/jira/browse/LUCENE-7844
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Minor
> Fix For: master (7.0)
>
> Attachments: LUCENE_7844__UH_maxPassages_simplification.patch
>
>
> The "maxPassages" input to the UnifiedHighlighter can be provided as an array 
> to some of the public methods on UnifiedHighlighter.  When it's provided as 
> an array, the index in the array is for the field in a parallel array. I 
> think this is awkward and furthermore it's inconsistent with the way this 
> highlighter customizes things on a by field basis.  Instead, the parameter 
> can be a simple int default (not an array), and then there can be a protected 
> method like {{getMaxPassageCount(String field}} that returns an Integer 
> which, when non-null, replaces the default value for this field.
> Aside from API simplicity and consistency, this will also remove some 
> annoying parallel array sorting going on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7844) UnifiedHighlighter: simplify "maxPassages" input API

2017-05-25 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16024765#comment-16024765
 ] 

Timothy M. Rodriguez commented on LUCENE-7844:
--

+1 on the comparator use.  That definitely cleaned up some code.

I'm a bit uncertain on the maxPassages change, however.  I think it may be 
pretty common to pivot the number of passages required per field.  For example, 
a user may want to highlight a title fully (one passage) and get several 
passages from the primary content field.  The motivation to get rid of the 
parallel arrays makes a lot of sense, maybe we could try to lump all these 
options into an object per field?  For lack of a better name something like 
FieldOptions[] or the like?  Longer term, I could even see options for the 
break iterator, scorer, and formatter being configured per field.  (In the 
previous example, it may be better to have a dummy iterator that chunks on 
value delineations, a noop scorer, and a formatter that just returns the entire 
stored value for the title, while the content would have more traditional 
options.  I know this is all still possible with overrides in the current 
design, but I'm not sure we should push it further into the "specialized" 
use-case area. What do you think?


> UnifiedHighlighter: simplify "maxPassages" input API
> 
>
> Key: LUCENE-7844
> URL: https://issues.apache.org/jira/browse/LUCENE-7844
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Minor
> Fix For: master (7.0)
>
> Attachments: LUCENE_7844__UH_maxPassages_simplification.patch
>
>
> The "maxPassages" input to the UnifiedHighlighter can be provided as an array 
> to some of the public methods on UnifiedHighlighter.  When it's provided as 
> an array, the index in the array is for the field in a parallel array. I 
> think this is awkward and furthermore it's inconsistent with the way this 
> highlighter customizes things on a by field basis.  Instead, the parameter 
> can be a simple int default (not an array), and then there can be a protected 
> method like {{getMaxPassageCount(String field}} that returns an Integer 
> which, when non-null, replaces the default value for this field.
> Aside from API simplicity and consistency, this will also remove some 
> annoying parallel array sorting going on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper

2017-01-09 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15813306#comment-15813306
 ] 

Timothy M. Rodriguez commented on LUCENE-7620:
--

Me too!

> UnifiedHighlighter: add target character width BreakIterator wrapper
> 
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Fix For: 6.4
>
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, 
> LUCENE_7620_UH_LengthGoalBreakIterator.patch, 
> LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates 
> fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  
> It's useful in its own right and of course it helps users transition to the 
> UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a 
> sentence one.  In this way you get back Passages that are a number of 
> sentences so they will look nice instead of breaking mid-way through a 
> sentence.  And you get some control by specifying a target number of 
> characters.  This BreakIterator wouldn't be a general purpose 
> java.text.BreakIterator since it would assume it's called in a manner exactly 
> as the UnifiedHighlighter uses it.  It would probably be compatible with the 
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your 
> BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper

2017-01-06 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806094#comment-15806094
 ] 

Timothy M. Rodriguez edited comment on LUCENE-7620 at 1/6/17 11:09 PM:
---

Very useful!  I like that it decorates an underlying BreakIterator.  For the 
following method, does it make sense to return the baseIter if the followingIdx 
< startIndex?  Maybe throw an exception instead or just have an assert that 
it's less?

This is subjective, but I find it's more useful to break out the different 
tests with methods for each condition.  For example: breakAtGoal, 
breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom,  etc. Similar for 
the defaultSummary tests.  This helps when coming back to the test and helps 
tease apart if one piece of functionality is broken vs another.


was (Author: timothy055):
Very useful!  I like that it decorates an underlying BreakIterator.  For the 
following method, does it make sense to return the baseIter if the followingIdx 
< startIndex?  Maybe throw an exception instead or just have an assert that 
it's less?

This is subjective, but I find it's more useful to break out the different 
tests with methods for each condition.  For example: breakAtGoal, 
breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom,  etc. Similar for 
the defaultSummary tests.

> UnifiedHighlighter: add target character width BreakIterator wrapper
> 
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Fix For: 6.4
>
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, 
> LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates 
> fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  
> It's useful in its own right and of course it helps users transition to the 
> UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a 
> sentence one.  In this way you get back Passages that are a number of 
> sentences so they will look nice instead of breaking mid-way through a 
> sentence.  And you get some control by specifying a target number of 
> characters.  This BreakIterator wouldn't be a general purpose 
> java.text.BreakIterator since it would assume it's called in a manner exactly 
> as the UnifiedHighlighter uses it.  It would probably be compatible with the 
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your 
> BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7620) UnifiedHighlighter: add target character width BreakIterator wrapper

2017-01-06 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806094#comment-15806094
 ] 

Timothy M. Rodriguez commented on LUCENE-7620:
--

Very useful!  I like that it decorates an underlying BreakIterator.  For the 
following method, does it make sense to return the baseIter if the followingIdx 
< startIndex?  Maybe throw an exception instead or just have an assert that 
it's less?

This is subjective, but I find it's more useful to break out the different 
tests with methods for each condition.  For example: breakAtGoal, 
breakLessThanGoal, breakMoreThanGoal, breakGoalPlusRandom,  etc. Similar for 
the defaultSummary tests.

> UnifiedHighlighter: add target character width BreakIterator wrapper
> 
>
> Key: LUCENE-7620
> URL: https://issues.apache.org/jira/browse/LUCENE-7620
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Fix For: 6.4
>
> Attachments: LUCENE_7620_UH_LengthGoalBreakIterator.patch, 
> LUCENE_7620_UH_LengthGoalBreakIterator.patch
>
>
> The original Highlighter includes a {{SimpleFragmenter}} that delineates 
> fragments (aka Passages) by a character width.  The default is 100 characters.
> It would be great to support something similar for the UnifiedHighlighter.  
> It's useful in its own right and of course it helps users transition to the 
> UH.  I'd like to do it as a wrapper to another BreakIterator -- perhaps a 
> sentence one.  In this way you get back Passages that are a number of 
> sentences so they will look nice instead of breaking mid-way through a 
> sentence.  And you get some control by specifying a target number of 
> characters.  This BreakIterator wouldn't be a general purpose 
> java.text.BreakIterator since it would assume it's called in a manner exactly 
> as the UnifiedHighlighter uses it.  It would probably be compatible with the 
> PostingsHighlighter too.
> I don't propose doing this by default; besides, it's easy enough to pick your 
> BreakIterator config.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8241) Evaluate W-TinyLfu cache

2017-01-05 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15801792#comment-15801792
 ] 

Timothy M. Rodriguez commented on SOLR-8241:


+1 for this issue.  Solr currently uses caffeine-1.0.1 in it's distribution, 
which can cause conflicts if you create any extensions that intend to use the 
new library.

> Evaluate W-TinyLfu cache
> 
>
> Key: SOLR-8241
> URL: https://issues.apache.org/jira/browse/SOLR-8241
> Project: Solr
>  Issue Type: Wish
>  Components: search
>Reporter: Ben Manes
>Priority: Minor
> Attachments: SOLR-8241.patch, SOLR-8241.patch, SOLR-8241.patch, 
> proposal.patch
>
>
> SOLR-2906 introduced an LFU cache and in-progress SOLR-3393 makes it O(1). 
> The discussions seem to indicate that the higher hit rate (vs LRU) is offset 
> by the slower performance of the implementation. An original goal appeared to 
> be to introduce ARC, a patented algorithm that uses ghost entries to retain 
> history information.
> My analysis of Window TinyLfu indicates that it may be a better option. It 
> uses a frequency sketch to compactly estimate an entry's popularity. It uses 
> LRU to capture recency and operate in O(1) time. When using available 
> academic traces the policy provides a near optimal hit rate regardless of the 
> workload.
> I'm getting ready to release the policy in Caffeine, which Solr already has a 
> dependency on. But, the code is fairly straightforward and a port into Solr's 
> caches instead is a pragmatic alternative. More interesting is what the 
> impact would be in Solr's workloads and feedback on the policy's design.
> https://github.com/ben-manes/caffeine/wiki/Efficiency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7578) UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API

2016-11-30 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709795#comment-15709795
 ] 

Timothy M. Rodriguez edited comment on LUCENE-7578 at 11/30/16 9:15 PM:


Some care would have to be taken with spans, especially with significant slop.  
It's arguably worse to have a single highlight across it.  But otherwise, this 
definitely is a desired improvement.


was (Author: timothy055):
Some care would have to be taken with spans, especially with significant slop.  
It's arguably worse to have a single highlight across it.

> UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API
> -
>
> Key: LUCENE-7578
> URL: https://issues.apache.org/jira/browse/LUCENE-7578
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>
> The PhraseHelper of the UnifiedHighlighter currently collects position-spans 
> per SpanQuery (and it knows which terms are in which SpanQuery), and then it 
> filters PostingsEnum based on that.  It's similar to how the original 
> Highlighter WSTE works.  The main problem with this approach is that it can 
> be inaccurate for some nested span queries -- LUCENE-2287, LUCENE-5455 (has 
> the clearest example), LUCENE-6796.  Non-nested SpanQueries (e.g. that which 
> is converted from a PhraseQuery or MultiPhraseQuery) are _not_ a problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7578) UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API

2016-11-30 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709795#comment-15709795
 ] 

Timothy M. Rodriguez commented on LUCENE-7578:
--

Some care would have to be taken with spans, especially with significant slop.  
It's arguably worse to have a single highlight across it.

> UnifiedHighlighter: Convert PhraseHelper to use SpanCollector API
> -
>
> Key: LUCENE-7578
> URL: https://issues.apache.org/jira/browse/LUCENE-7578
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>
> The PhraseHelper of the UnifiedHighlighter currently collects position-spans 
> per SpanQuery (and it knows which terms are in which SpanQuery), and then it 
> filters PostingsEnum based on that.  It's similar to how the original 
> Highlighter WSTE works.  The main problem with this approach is that it can 
> be inaccurate for some nested span queries -- LUCENE-2287, LUCENE-5455 (has 
> the clearest example), LUCENE-6796.  Non-nested SpanQueries (e.g. that which 
> is converted from a PhraseQuery or MultiPhraseQuery) are _not_ a problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7575) UnifiedHighlighter: add requireFieldMatch=false support

2016-11-30 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15709621#comment-15709621
 ] 

Timothy M. Rodriguez commented on LUCENE-7575:
--

Looks good to me too.  Some additional suggestions:

UnifiedHighlighter:
  * +1 on the suggestion to use HighlightFlags instead.

PhraseHelper:
  * It's clearer in my opinion to change the boolean branch to something like 
{code} if (!requireFieldMatch) {} else {} {code} instead of checking {code} 
requireFieldMatch == false {code}.  Even better would be swapping the branches 
so it's {code}if (requireFieldBranch) {} else {}{code}
  * Similar point for line 287 {code} if (requireFieldMatch && 
fieldName.equals(queryTerm.field()) == false) {} {code}

TestUnifiedHiglighter:
  * I think it'd be clearer to separate the the cases for 
term/phrase/multi-term queries into separate tests.  This makes it easier to 
chase bugs down the line if only 1 fails.  (And provides more information if 
all 3 fail)

> UnifiedHighlighter: add requireFieldMatch=false support
> ---
>
> Key: LUCENE-7575
> URL: https://issues.apache.org/jira/browse/LUCENE-7575
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Assignee: David Smiley
> Attachments: LUCENE-7575.patch
>
>
> The UnifiedHighlighter (like the PostingsHighlighter) only supports 
> highlighting queries for the same fields that are being highlighted.  The 
> original Highlighter and FVH support loosening this, AKA 
> requireFieldMatch=false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-24 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15693717#comment-15693717
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


Haha, no problem.  It'll improve usability quite a bit to be able to 
dynamically invoke it per request (and the other highlighters).  I'm glad it 
landed with the initial Solr release of the unified highlighter.

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
> Attachments: SOLR-9708.patch
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-23 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691807#comment-15691807
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


Looks great!  Adding the other highlighters to method really fleshed it out.  
Also in favor of the change from "default" to "original".  No further suggested 
changes other than a rename on the FASTVECTOR enum to FAST_VECTOR. Let me know 
if you need any help with the wiki in December.  Would be glad to contribute 
there as well.

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-21 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15683730#comment-15683730
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


Added a normalizeParameters method that will set tag.pre or post if simple.pre 
or post are set.

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-21 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15683698#comment-15683698
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


I've posted an initial commit that allows the user to override the configured 
highlighter based on the "hl.method" parameter.

Two things I want to highlight:

* The highlighter can no longer safely be statically determined using 
HighlightComponent.getHiglighter since a request parameter can override the 
pre-configured one.  I've marked this usage deprecated as it affects quite a 
few places outside of this change.  Is that okay?

* Use of an enum for collecting all the highlight methods and giving a bit 
extra type safety when switching over the values in the override.  I'm not sure 
if this is out of style and several static String fields is preferred (although 
I personally prefer the former).

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-15 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668824#comment-15668824
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


I was suggesting instead of hl.tag.pre, but realized that's used too. No sense 
adding a third. Even though both names are not so ideal IMO

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-15 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668766#comment-15668766
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


I thought the suggestion was to use hl.tag.pre instead of hl.simple.pre?

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-15 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15668715#comment-15668715
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


I'm okay with hl.tag.pre/post, but it may not always be a tag.  Perhaps 
something like hl.pre.marker? or hl.pre.sigil?

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

2016-11-14 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15665317#comment-15665317
 ] 

Timothy M. Rodriguez commented on LUCENE-7526:
--

Added the proposed changes.  I'm on the fence around the refactor for 
MultiValueTokenStream, I'd much prefer to get rid of it completely if we could. 
 But for now having some symmetry between the two impls seems worthwhile to me? 
 I'd like to punt on that one.

> Improvements to UnifiedHighlighter OffsetStrategies
> ---
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>Priority: Minor
> Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-14 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15665291#comment-15665291
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


Thanks for catching those things. I've fixed them and pushed to the pr.

Regarding the hl.useUnifiedHighlighter I'm actually very in favor of that idea, 
but perhaps that logic would be better in the highlight component?  In that way 
the actual highlighters would be more like the facet params that help tweak 
with algorithm gets used.  I agree that we really shouldn't have to "configure" 
the highlighters.  Perhaps that should be a separate issue though more in line 
with the the other changes mentioned?

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-13 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662087#comment-15662087
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


Let me know what you think.  If it looks good, I think we can commit it.

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-11-13 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662086#comment-15662086
 ] 

Timothy M. Rodriguez commented on SOLR-9708:


I've pushed tests for the configurable items in the UH as well as for support 
of multiple snippets.  In addition a change was done to push highlighter 
specific logic down into the DefaultSolrHighlighter that was in the 
HighlightComponent (thanks [~dsmiley] for pointing that out).

> Expose UnifiedHighlighter in Solr
> -
>
> Key: SOLR-9708
> URL: https://issues.apache.org/jira/browse/SOLR-9708
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.4
>
>
> This ticket is for creating a Solr plugin that can utilize the new 
> UnifiedHighlighter which was initially committed in 
> https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

2016-11-12 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660330#comment-15660330
 ] 

Timothy M. Rodriguez commented on LUCENE-7526:
--

Other than that, I think this code is in good shape for committing.

> Improvements to UnifiedHighlighter OffsetStrategies
> ---
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>Priority: Minor
> Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

2016-11-12 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15660328#comment-15660328
 ] 

Timothy M. Rodriguez commented on LUCENE-7526:
--

I've merged with the changes from LUCENE-7544 and also ran some benchmarks. 
(Thanks [~dsmiley] for the fix on LUCENE-7546!)

Original:

||Impl||Terms||Phrases||Wildcards||
|(search)|1.14|1.43|2.44|
|SH_A|7.36|7.49|16.37|
|UH_A|5.32|4.55|9.24|
|SH_V|4.12|4.42|8.47|
|FVH_V|3.46|2.98|7.13|
|UH_V|3.7|3.45|6.61|
|PH_P|3.76|3.45|9.6|
|UH_P|3.34|2.91|9.33|
|UH_PV|3.26|2.8|6.72|

With improvements from LUCENE-7526:

||Impl||Terms||Phrases||Wildcards||
|(search)|1.18|1.38|2.52|
|SH_A|7.98|7.53|16.62|
|UH_A|5.46|4.6|9.43|
|SH_V|4.13|4.42|8.26|
|FVH_V|3.45|3.05|6.93|
|UH_V|3.79|3.43|6.62|
|PH_P|3.82|3.47|9.4|
|UH_P|3.33|3.03|9.46|
|UH_PV|3.24|2.81|6.92|

If you disable the new option to prefer passage relevancy over speed you'll get 
the following for analysis:

||Impl||Terms||Phrases||Wildcards||
|(search)|1.1|1.43|2.44|
|UH_A|5.31|4.66|9.14|

I wasn't able to get very consistent times with the benchmarks, but it looks 
like the changes keep close performance while simplifying the code and 
improving relevancy in the Analysis case (unless 
preferPassageRelevancyOverSpeed is disabled).  If that option is disabled the 
timings line up pretty closely with the originals, providing a minor speed 
boost. There should also be a memory savings by avoiding re-creation of 
TokenStreams, but that was difficult to measure, but could prove beneficial if 
there is memory pressure.

I performed these benchmark on a machine with the following configuration:

Processor: AMD Phenom II X4 960T 3.0GHz
Memory: 24GB DDR3
Disk: Crucial CT256MX SSD
OS: Windows 10
Java: Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

All versions of the benchmarks incorporated above included the changes from 
LUCENE-7544.

[~dsmiley] It looks like my older processor took significantly longer to 
highlight across the board than in your initial run for LUCENE-7438.  I'd be 
curious how this set of changes performs on your machine now.


> Improvements to UnifiedHighlighter OffsetStrategies
> ---
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>Priority: Minor
> Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7546) Rename uses of people.apache.org to home.apache.org

2016-11-08 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649031#comment-15649031
 ] 

Timothy M. Rodriguez commented on LUCENE-7546:
--

I like the idea of dist.apache.org as an alternative source.  That should 
definitely be more stable than web space tied to an individual account.

> Rename uses of people.apache.org to home.apache.org
> ---
>
> Key: LUCENE-7546
> URL: https://issues.apache.org/jira/browse/LUCENE-7546
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: David Smiley
>
> The people.apache.org server was replaced by a different server 
> home.apache.org officially last year, and it appears to have completed 
> sometime this year.  DNS for both points to the same machine but we should 
> reference home.apache.org now.  *Unfortunately, some data was large enough 
> that ASF Infra didn't automatically move it, leaving that up to the 
> individuals to do.  I think any data that hasn't been moved by now might be 
> gone.*
> Here's a useful reference to this: EMPIREDB-234   The second part of that 
> issue also informs us that RC artifacts don't belong on home.apache.org; 
> there is https://dist.apache.org/repos/dist/dev/ for that.  6.3 was 
> done the right way... yet I see references to using people.apache.org in the 
> build for RCs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-10-31 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623451#comment-15623451
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

Not yet, we have an initial general implementation, but it's lacking tests.  
(We have a customized extension internally that does have tests.)  I've created 
a new ticket https://issues.apache.org/jira/browse/SOLR-9708 with a PR 
containing the initial impl so folks can follow or help the work towards 
finishing it up.  Thanks for asking though, hopefully this gets the ball 
rolling faster.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Fix For: 6.3
>
> Attachments: LUCENE-7438.patch, LUCENE_7438_UH_benchmark.patch, 
> LUCENE_7438_UH_benchmark.patch, LUCENE_7438_UH_small_changes.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-9708) Expose UnifiedHighlighter in Solr

2016-10-31 Thread Timothy M. Rodriguez (JIRA)
Timothy M. Rodriguez created SOLR-9708:
--

 Summary: Expose UnifiedHighlighter in Solr
 Key: SOLR-9708
 URL: https://issues.apache.org/jira/browse/SOLR-9708
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: highlighter
Reporter: Timothy M. Rodriguez
Priority: Minor


This ticket is for creating a Solr plugin that can utilize the new 
UnifiedHighlighter which was initially committed in 
https://issues.apache.org/jira/browse/LUCENE-7438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

2016-10-28 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616294#comment-15616294
 ] 

Timothy M. Rodriguez commented on LUCENE-7526:
--

Thanks [~dsmiley] :).  I've just submitted the pull request.  You're right this 
only removes an additional use of token streams.  In the case of the Analysis 
strategies a TokenStream is still necessary at least initially to analyze the 
field.  I'm glad I got to work on this during the wonderful Boston Hackday 
event (https://github.com/flaxsearch/london-hackday-2016).  Thanks [~dsmiley] 
for some tips while there and [~mbraun688] for some initial feedback on the pr.

> Improvements to UnifiedHighlighter OffsetStrategies
> ---
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>Priority: Minor
> Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

2016-10-27 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612996#comment-15612996
 ] 

Timothy M. Rodriguez commented on LUCENE-7526:
--

Pull request forthcoming - I had some more merging work to do with master than 
I anticipated!

> Improvements to UnifiedHighlighter OffsetStrategies
> ---
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Timothy M. Rodriguez
>Priority: Minor
>  Labels: highlighter, unified-highlighter
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

2016-10-27 Thread Timothy M. Rodriguez (JIRA)
Timothy M. Rodriguez created LUCENE-7526:


 Summary: Improvements to UnifiedHighlighter OffsetStrategies
 Key: LUCENE-7526
 URL: https://issues.apache.org/jira/browse/LUCENE-7526
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Timothy M. Rodriguez
Priority: Minor


This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by 
reducing reliance on creating or re-creating TokenStreams.

The primary changes are as follows:

* AnalysisOffsetStrategy - split into two offset strategies
  * MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
MemoryIndex for producing Offsets
  * TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
MemoryIndex.  Can only be used if the query distills down to terms and automata.

* TokenStream removal 
  * MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
the memory index and then once consumed a new one was generated by uninverting 
the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq 
queries) involved.  Now this is avoided, which should save memory and avoid a 
second pass over the data.
  * TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
generating a TokenStream if automata are involved.
  * PostingsWithTermVectorsOffsetStrategy - similar refactoring

* CompositePostingsEnum - aggregates several underlying PostingsEnums for 
wildcard/mtq queries.  This should improve relevancy by providing unified 
metrics for a wildcard across all it's term matches

* Added a HighlightFlag for enabling the newly separated 
TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

2016-10-27 Thread Timothy M. Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy M. Rodriguez updated LUCENE-7526:
-
Description: 
This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by 
reducing reliance on creating or re-creating TokenStreams.

The primary changes are as follows:

* AnalysisOffsetStrategy - split into two offset strategies
  ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
MemoryIndex for producing Offsets
  ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
MemoryIndex.  Can only be used if the query distills down to terms and automata.

* TokenStream removal 
  ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
the memory index and then once consumed a new one was generated by uninverting 
the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq 
queries) involved.  Now this is avoided, which should save memory and avoid a 
second pass over the data.
  ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
generating a TokenStream if automata are involved.
  ** PostingsWithTermVectorsOffsetStrategy - similar refactoring

* CompositePostingsEnum - aggregates several underlying PostingsEnums for 
wildcard/mtq queries.  This should improve relevancy by providing unified 
metrics for a wildcard across all it's term matches

* Added a HighlightFlag for enabling the newly separated 
TokenStreamOffsetStrategy since it can adversely affect passage relevancy

  was:
This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by 
reducing reliance on creating or re-creating TokenStreams.

The primary changes are as follows:

* AnalysisOffsetStrategy - split into two offset strategies
  * MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
MemoryIndex for producing Offsets
  * TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
MemoryIndex.  Can only be used if the query distills down to terms and automata.

* TokenStream removal 
  * MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
the memory index and then once consumed a new one was generated by uninverting 
the MemoryIndex back into a TokenStream if there were automata (wildcard/mtq 
queries) involved.  Now this is avoided, which should save memory and avoid a 
second pass over the data.
  * TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
generating a TokenStream if automata are involved.
  * PostingsWithTermVectorsOffsetStrategy - similar refactoring

* CompositePostingsEnum - aggregates several underlying PostingsEnums for 
wildcard/mtq queries.  This should improve relevancy by providing unified 
metrics for a wildcard across all it's term matches

* Added a HighlightFlag for enabling the newly separated 
TokenStreamOffsetStrategy since it can adversely affect passage relevancy


> Improvements to UnifiedHighlighter OffsetStrategies
> ---
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Timothy M. Rodriguez
>Priority: Minor
>  Labels: highlighter, unified-highlighter
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SOLR-9542) Kerberos delegation tokens requires missing Jackson library

2016-09-30 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537247#comment-15537247
 ] 

Timothy M. Rodriguez commented on SOLR-9542:


Not sure it makes sense to introduce a Jackson dependency here. I'm conflicted 
on how big of an issue this is though.  It's a really old version of jackson 
since it depends on the org.codehaus version.  On the other hand, it's probably 
less likely to conflict as such.

> Kerberos delegation tokens requires missing Jackson library
> ---
>
> Key: SOLR-9542
> URL: https://issues.apache.org/jira/browse/SOLR-9542
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Ishan Chattopadhyaya
> Attachments: SOLR-9542.patch
>
>
> GET, RENEW or CANCEL operations for the delegation tokens support requires 
> the Solr server to have old jackson added as a dependency.
> Steps to reproduce the problem:
> 1) Configure Solr to use delegation tokens
> 2) Start Solr
> 3) Use a SolrJ application to get a delegation token.
> The server throws the following:
> {code}
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper
> at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationHandler.managementOperation(DelegationTokenAuthenticationHandler.java:279)
> at 
> org.apache.solr.security.KerberosPlugin$RequestContinuesRecorderAuthenticationHandler.managementOperation(KerberosPlugin.java:566)
> at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:514)
> at 
> org.apache.solr.security.DelegationTokenKerberosFilter.doFilter(DelegationTokenKerberosFilter.java:123)
> at 
> org.apache.solr.security.KerberosPlugin.doAuthenticate(KerberosPlugin.java:265)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.authenticateRequest(SolrDispatchFilter.java:318)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:518)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-30 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537117#comment-15537117
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

After further consideration, it seems best to leave some of the classes common 
between Postings and the Unified highlighters separate.  If we were to use the 
same classes they'd ideally move to a common sub-package that both could share 
and this would introduce unneeded change and hurt potential compatibility for 
any users of those classes.  Keeping them separate also allows for a possible 
improvement to the method highlightFieldsAsObjects which internally creates a 
Map that is promptly thrown away again in the highlight methods.  I briefly 
investigated changing this to return the internal Object[][] array and avoid 
the extra Map allocation, but this creates some awkwardness since the 
Object[][] array sorts the input fields before filling the arrays, which would 
make the API somewhat of a trap for callers.  This undesired behavior is likely 
why the map is being created.  One way to fix this is to generify 
PassageFormatter over it's output type which would allow for a 
PassageFormatter in the case of the DefaultPassageFormatter.  However, 
changing this is a rather involved change that could ultimately result in the 
UnifiedHighlighter itself having a generic type and it was not clear that 
muddying the waters with that right now was a good idea.  However, keeping 
these classes separate will allow for an attempt at that in the future.

In the meantime, I've also pushed a commit to reduce the visibility of the 
MultiTermHighlighting to package protected.  As it stands, I think this patch 
is ready.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
> Attachments: LUCENE_7438_UH_benchmark.patch
>
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-08 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15474731#comment-15474731
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

Actually, I think passage relevancy might be something we'd look into in more 
details down the line.  Definitely, some of the things in LUCENE-4909 could be 
useful. :)  I see merit in keeping things separate to allow for flexibility.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-08 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15473896#comment-15473896
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

I'm not a fan of forking classes with Uxyz naming scheme.  I think it'd be 
better to make the existing classes re-usable or keep the current naming 
scheme.  That being said, if we make the existing classes re-usable, it might 
be better to plan on moving them into some common package later on so it's 
clearer that they are re-used.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7438) UnifiedHighlighter

2016-09-07 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470904#comment-15470904
 ] 

Timothy M. Rodriguez edited comment on LUCENE-7438 at 9/7/16 7:31 PM:
--

Pull request here: https://github.com/apache/lucene-solr/pull/79

I'd also like to specially acknowledge [~dsmiley] who has worked with us 
closely.  He did the lion's share of the work represented here. (Including the 
genesis of the idea for unifying the disparate highlighters.)


was (Author: timothy055):
Pull request here: https://github.com/apache/lucene-solr/pull/79

I'd also like to specially acknowledge [~dsmiley] who has worked with us 
closely.  He did a very significant share of the work represented here. 
(Including the genesis of the idea for unifying the disparate highlighters.)

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7438) UnifiedHighlighter

2016-09-07 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470904#comment-15470904
 ] 

Timothy M. Rodriguez edited comment on LUCENE-7438 at 9/7/16 3:49 PM:
--

Pull request here: https://github.com/apache/lucene-solr/pull/79

I'd also like to specially acknowledge [~dsmiley] who has worked with us 
closely.  He did a very significant share of the work represented here. 
(Including the genesis of the idea for unifying the disparate highlighters.)


was (Author: timothy055):
Pull request here: https://github.com/apache/lucene-solr/pull/79

I'd also like to specially acknowledge [~dsmiley] who has worked with us 
closely.  He did a very significant share of the work represented here.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-07 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470904#comment-15470904
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

Pull request here: https://github.com/apache/lucene-solr/pull/79

I'd also like to specially acknowledge [~dsmiley] who has worked with us 
closely.  He did a very significant share of the work represented here.

> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>Assignee: David Smiley
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> analysis (a TokenStream). Lucene’s existing highlighters are mostly 
> demarcated along offset source lines, whereas here it is unified -- hence 
> this proposed name. In this highlighter, the offset source strategy is 
> separated from the core highlighting functionalty. The UnifiedHighlighter 
> further improves on the PostingsHighlighter’s design by supporting accurate 
> phrase highlighting using an approach similar to the standard highlighter’s 
> WeightedSpanTermExtractor. The next major improvement is a hybrid offset 
> source strategythat utilizes postings and “light” term vectors (i.e. just the 
> terms) for highlighting multi-term queries (wildcards) without resorting to 
> analysis. Phrase highlighting and wildcard highlighting can both be disabled 
> if you’d rather highlight a little faster albeit not as accurately reflecting 
> the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the 
> other highlighters and the results were exciting! It’s tempting to share 
> those results but it’s definitely due for another benchmark, so we’ll work on 
> that. Performance was the main motivator for creating the UnifiedHighlighter, 
> as the standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
> requirements) wasn’t fast enough, even with term vectors along with several 
> improvements we contributed back, and even after we forked it to highlight in 
> multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7438) UnifiedHighlighter

2016-09-07 Thread Timothy M. Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470848#comment-15470848
 ] 

Timothy M. Rodriguez commented on LUCENE-7438:
--

Some additional information:

h2. Missing features & possible future improvements:
Despite the offset source flexibility and accuracy options of this highlighter, 
it continues to be the case that some highlighters have unique features.  The 
following features are in the standard Highlighter (and possibly 
FastVectorHighlighter) but are not in the UnifiedHighlighter (and thus not 
PostingsHighlighter either since UH is derived from PH):
* Being able to disable “requireFieldMatch” to thus highlight a query 
insensitive to whatever fields are mentioned in the query.
* Using boosts in the query to weight passages.
* Regex pased passage delineation. Though I’m unsure if anyone cares given the 
existing BreakIterator options available.
Aside from addressing the feature gaps listed above, there are a couple known 
things that would be nice to add:
* The phrase highlighting (implemented by PhraseHelper) could be made more 
accurate, and probably faster too, by using techniques in Alan’s Luwak system 
that uses the Lucene SpanCollector API introduced in Lucene 5.3. It wasn’t done 
this way to begin with because this highlighter was developed originally for 
Lucene 4.10.
* Wildcard queries usually use TokenStreamFromTermVector, which uninverts the 
terms out of a Terms index.  Instead, we now think it would be better to create 
a bunch of PostingsEnum for each matching term. This would bring about some 
simplifications and efficiencies, and can lead to better passage relevancy. A 
bonus would be aggregating terms matching the same automata into a merged 
PostingsEnum that has a freq() based on the sum of the underlying matching 
terms.

h2. Changes from the PostingsHighlighter 
* The UH is more stateful
** Holds the IndexSearcher instead of asking most methods to pass it through.
** Options now have simple setters, and the per-field getters return these. 
This means the common case of a setting being non-specific to a field doesn’t 
require subclassing.
* Multi-valued field handling is improved to ensure that a passage will never 
span across values, plus it honors the positionIncrementGap for an analyzed 
offset source. See MultiValueTokenStream and SplittingBreakIterator.
* The PH caches all content to be highlighted for all docs and then highlights 
it all.  The UH has a limit on this which led to a batching approach.  But if 
all fields use an Analyzer or if more than one use term vectors, then instead 
highlighting happens one doc at a time since the up-front content caching is 
not helpful.
* No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one 
doc to the next. This really simplified some code; it didn’t seem worth it.
* MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it 
to guard against ramifications of exceptions being thrown during highlighting 
(e.g. a BreakIterator bug or TokenStream bug). Nasty to debug!
* (from standard Highlighter) TokenStreamFromTermVector: optimizations to 
uninvert filtered (thus sparse) Terms.

h2. Non-Core Dependencies
* MemoryIndex: For Analyzer based highlighting when phrases need to be 
highlighted accurately.
* Standard Highlighter things:
** TokenStreamFromTermVector: For most multi-term queries. The UH actually has 
its own derived copy that has been optimized to handle filtered (thus sparse) 
Terms. With further work, we could switch to a different approach and remove it 
(as indicated earlier).  For as long as it stays, it’s also possible to replace 
the existing one with this if we want to do that.
** WeightedSpanTermExtractor: For highlighting phrases accurately to re-use 
it’s SpanQuery conversion and rewrite detecting abilities.  Perhaps these parts 
of WSTE could move to general SpanQuery utilities.
** TermVectorLeafReader: When highlighting offsets from term vectors.
* PostingHighlighter things:
** Technically, Nothing however it has multiple copies of some things that have 
not been modified: Passage, PassageScorer, PassageFormatter, 
DefaultPassageFormatter.
** Note: Utility BreakIterators are of use to the PH, UH, and even the FVH: 
WholeBreakIterator, CustomSeparatorBreakIterator.  Maybe they should move to a 
utils package that isn’t in any of these highlighters?


> UnifiedHighlighter
> --
>
> Key: LUCENE-7438
> URL: https://issues.apache.org/jira/browse/LUCENE-7438
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 6.2
>Reporter: Timothy M. Rodriguez
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is 
> able to highlight using offsets in either postings, term vectors, or from 
> 

[jira] [Created] (LUCENE-7438) UnifiedHighlighter

2016-09-07 Thread Timothy M. Rodriguez (JIRA)
Timothy M. Rodriguez created LUCENE-7438:


 Summary: UnifiedHighlighter
 Key: LUCENE-7438
 URL: https://issues.apache.org/jira/browse/LUCENE-7438
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 6.2
Reporter: Timothy M. Rodriguez


The UnifiedHighlighter is an evolution of the PostingsHighlighter that is able 
to highlight using offsets in either postings, term vectors, or from analysis 
(a TokenStream). Lucene’s existing highlighters are mostly demarcated along 
offset source lines, whereas here it is unified -- hence this proposed name. In 
this highlighter, the offset source strategy is separated from the core 
highlighting functionalty. The UnifiedHighlighter further improves on the 
PostingsHighlighter’s design by supporting accurate phrase highlighting using 
an approach similar to the standard highlighter’s WeightedSpanTermExtractor. 
The next major improvement is a hybrid offset source strategythat utilizes 
postings and “light” term vectors (i.e. just the terms) for highlighting 
multi-term queries (wildcards) without resorting to analysis. Phrase 
highlighting and wildcard highlighting can both be disabled if you’d rather 
highlight a little faster albeit not as accurately reflecting the query.
We’ve benchmarked an earlier version of this highlighter comparing it to the 
other highlighters and the results were exciting! It’s tempting to share those 
results but it’s definitely due for another benchmark, so we’ll work on that. 
Performance was the main motivator for creating the UnifiedHighlighter, as the 
standard Highlighter (the only one meeting Bloomberg Law’s accuracy 
requirements) wasn’t fast enough, even with term vectors along with several 
improvements we contributed back, and even after we forked it to highlight in 
multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org