Thanks David! That's almost exactly what I ended up doing. I don't mind casting Object to my own type; you can always make it a covariant override in your subclass (which you have to do to access those expert-level methods anyway).
I still kind of think startOffset/endOffset and other related methods could be made public to allow tinkering with them in FieldHighlighter#highlightOffsetsEnums (otherwise this method is protected for overriding, but useless in practice). There is another API problem I found too. If you wish to override FieldHighlighter.getSummaryPassagesNoHighlight you can't return anything sensible because Passage is final, contains only package-private fields and addMatch is package-private too. So you can't create a "custom" passage. I can file an issue and provide a patch if these changes are not against the design of the unified highlighter? Dawid On Wed, Jan 11, 2017 at 2:24 PM, David Smiley <[email protected]> wrote: > Hi Dawid, > > You could write a trivial PassageFormatter that simply returns the Passage > list instead of doing formatting. Passages contain offsets. And yes, > WholeBreakIterator if you don't need passage fragmentation. Unless I'm > missing some aspect of your requirements, this doesn't involve any internal > highlighter customizing. Perhaps Javadocs could be improved to make this > more clear... and perhaps this Passage-returning PassageFormatter could be > included to clarify how it's done. I recall doing or seeing this recently > months ago but I'm not sure. > > One ugly aspect of the API (shared with it's PostingsHighlighter lineage) > related to this discussion is that the PassageFormatter is declared to > return Object. It's kinda hard to rectify it to be typed, perhaps with > generics, while also not spilling lots of generics to other places (the UH > itself) just because of this. Perhaps UH.highlightFieldsAsObjects() could > be modified to take a Class to thus provide a type for the output... and > maybe the PassageFormatter could declare not only with generics but with a > method what types of results it produces. I'm curious what you think. > > ~ David > > > On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss <[email protected]> wrote: >> >> To follow-up: I hacked into the offsets by passing WholeBreakIterator >> and a custom PassageFormatter that just returns the matches from the >> singleton resulting passage. This is suboptimal though, as there's >> still some complex logic going on in highlightOffsetsEnums that could >> be avoided. >> >> Dawid >> >> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss <[email protected]> >> wrote: >> > Can any of the folks who contributed to UnifiedHighlighter (David?) >> > clarify my thinking here? >> > >> > I have a requirement to extract (for a set of search results) a list >> > of exact "hit" ranges (field offsets, with support for multi-term >> > queries and span queries). Obviously, I'm only talking about queries >> > that relate to field content somehow, but this has always been quite >> > problematic and required the use of multiple helper classes >> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty >> > hairy logic. >> > >> > So I turned to look at UnifiedHighlighter for help. >> > >> > Seems like the right way (?) to do it would be to override (and abuse) >> > UnifiedHighlighter's getFieldHighlighter method and return a field >> > highlighter with an override of: >> > >> > protected Passage[] highlightOffsetsEnums(List<OffsetsEnum> >> > offsetsEnums) throws IOException { >> > >> > so that I can capture and return a separate Passage for each >> > OffsetsEnum (I have my own code to deal with overlaps and merging, so >> > I can skip this entirely). Then, with a custom no-op PassageFormatter >> > I could simply get a list of those offsets. >> > >> > The problem with this approach is that there is currently no way to >> > access offsets in OffsetsEnum -- everything is protected (so >> > subclassable), but OffsetsEnum are closed to package-private scope. >> > Namely these two: >> > >> > int startOffset() throws IOException { >> > return postingsEnum.startOffset(); >> > } >> > >> > int endOffset() throws IOException { >> > return postingsEnum.endOffset(); >> > } >> > >> > Should these two be protected to allow such customizations (I agree >> > it's *very* low-level, but I have a practical use case where this >> > would be useful). >> > >> > Am I on the right track here? >> > >> > Separately from that, I think it'd be nice to have some sort of >> > generic utility that, for a given document (or a set of documents) >> > would return such hit ranges... UnifiedHighlighter seems >> > >> > Dawid > > -- > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > http://www.solrenterprisesearchserver.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
