Dawid, I'm guessing what you're seeing is from browsing the 6.3 code. The extensibility has been improved and committed for 6.4; see CHANGES.txt and LUCENE-7559 which did it. In particular, all Passage methods are now public.
I agree that OffsetsEnum methods should be public so that someone could override FieldHighlighter#highlightOffsetsEnums usefully. This is an oversight; good catch! We should further enhance TestUnifiedHighlighterExtensibility to help us check for this. I'll file an issue. Come to think of it... one could argue LUCENE-7559 isn't really done as it's scope should have included OffsetsEnums methods. *Jim:* can I change some visibility there for getting this into 6.4 as part of the same issue? Very low risk of course. If not; no big deal. ~ David On Wed, Jan 11, 2017 at 8:37 AM Dawid Weiss <[email protected]> wrote: > Thanks David! > > That's almost exactly what I ended up doing. I don't mind casting > Object to my own type; you can always make it a covariant override in > your subclass (which you have to do to access those expert-level > methods anyway). > > I still kind of think startOffset/endOffset and other related methods > could be made public to allow tinkering with them in > FieldHighlighter#highlightOffsetsEnums (otherwise this method is > protected for overriding, but useless in practice). > > There is another API problem I found too. If you wish to override > FieldHighlighter.getSummaryPassagesNoHighlight you can't return > anything sensible because Passage is final, contains only > package-private fields and addMatch is package-private too. So you > can't create a "custom" passage. > > I can file an issue and provide a patch if these changes are not > against the design of the unified highlighter? > > Dawid > > On Wed, Jan 11, 2017 at 2:24 PM, David Smiley <[email protected]> > wrote: > > Hi Dawid, > > > > You could write a trivial PassageFormatter that simply returns the > Passage > > list instead of doing formatting. Passages contain offsets. And yes, > > WholeBreakIterator if you don't need passage fragmentation. Unless I'm > > missing some aspect of your requirements, this doesn't involve any > internal > > highlighter customizing. Perhaps Javadocs could be improved to make this > > more clear... and perhaps this Passage-returning PassageFormatter could > be > > included to clarify how it's done. I recall doing or seeing this > recently > > months ago but I'm not sure. > > > > One ugly aspect of the API (shared with it's PostingsHighlighter lineage) > > related to this discussion is that the PassageFormatter is declared to > > return Object. It's kinda hard to rectify it to be typed, perhaps with > > generics, while also not spilling lots of generics to other places (the > UH > > itself) just because of this. Perhaps UH.highlightFieldsAsObjects() > could > > be modified to take a Class to thus provide a type for the output... and > > maybe the PassageFormatter could declare not only with generics but with > a > > method what types of results it produces. I'm curious what you think. > > > > ~ David > > > > > > On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss <[email protected]> > wrote: > >> > >> To follow-up: I hacked into the offsets by passing WholeBreakIterator > >> and a custom PassageFormatter that just returns the matches from the > >> singleton resulting passage. This is suboptimal though, as there's > >> still some complex logic going on in highlightOffsetsEnums that could > >> be avoided. > >> > >> Dawid > >> > >> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss <[email protected]> > >> wrote: > >> > Can any of the folks who contributed to UnifiedHighlighter (David?) > >> > clarify my thinking here? > >> > > >> > I have a requirement to extract (for a set of search results) a list > >> > of exact "hit" ranges (field offsets, with support for multi-term > >> > queries and span queries). Obviously, I'm only talking about queries > >> > that relate to field content somehow, but this has always been quite > >> > problematic and required the use of multiple helper classes > >> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty > >> > hairy logic. > >> > > >> > So I turned to look at UnifiedHighlighter for help. > >> > > >> > Seems like the right way (?) to do it would be to override (and abuse) > >> > UnifiedHighlighter's getFieldHighlighter method and return a field > >> > highlighter with an override of: > >> > > >> > protected Passage[] highlightOffsetsEnums(List<OffsetsEnum> > >> > offsetsEnums) throws IOException { > >> > > >> > so that I can capture and return a separate Passage for each > >> > OffsetsEnum (I have my own code to deal with overlaps and merging, so > >> > I can skip this entirely). Then, with a custom no-op PassageFormatter > >> > I could simply get a list of those offsets. > >> > > >> > The problem with this approach is that there is currently no way to > >> > access offsets in OffsetsEnum -- everything is protected (so > >> > subclassable), but OffsetsEnum are closed to package-private scope. > >> > Namely these two: > >> > > >> > int startOffset() throws IOException { > >> > return postingsEnum.startOffset(); > >> > } > >> > > >> > int endOffset() throws IOException { > >> > return postingsEnum.endOffset(); > >> > } > >> > > >> > Should these two be protected to allow such customizations (I agree > >> > it's *very* low-level, but I have a practical use case where this > >> > would be useful). > >> > > >> > Am I on the right track here? > >> > > >> > Separately from that, I think it'd be nice to have some sort of > >> > generic utility that, for a given document (or a set of documents) > >> > would return such hit ranges... UnifiedHighlighter seems > >> > > >> > Dawid > > > > -- > > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > > http://www.solrenterprisesearchserver.com > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
