[
https://issues.apache.org/jira/browse/LUCENE-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-2508:
---------------------------------
Fix Version/s: (was: 4.7)
4.8
> Consolidate Highlighter implementations and a major refactor of the
> non-termvector highlighter
> ----------------------------------------------------------------------------------------------
>
> Key: LUCENE-2508
> URL: https://issues.apache.org/jira/browse/LUCENE-2508
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/highlighter
> Environment: irrelevant
> Reporter: Edward Drapkin
> Priority: Minor
> Labels: highlight, search
> Fix For: 4.8
>
> Attachments: LUCENE-2508.patch
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Originally, I had planned to create a contrib module to allow people to
> highlight multiple documents in parallel, but after talking to Uwe in IRC
> about it, I realized that it was pretty useless. However, I was already
> sitting on an iterative highlighting algorithm that was much faster (my tests
> show 20% - 40%) and more accurate and, based on that same IRC conversation, I
> decided to not let all the work that I had done go to waste and try to
> contribute it back again. Uwe had mentioned that "More like this" detected
> term vectors when called and use the term vector implementation when
> possible, if I recall correctly, so I decided to do that.
> The patch that I've attached is my first stab at this. It's not nearly
> complete and full disclosure dictates that I say that it's not fully
> documented and there are not any unit tests written. I wanted to go ahead
> and open an issue to get some feedback on the approach that I've taken as
> well as the fact that it exists will be a proverbial kick in my pants to
> continue working on it.
> In short, what I've changed:
> * Completely rewritten the non-tv highlighter to be faster and cleaner.
> There is some small loss in functionality for now, namely the loss of the
> GradientHighlighter (I just haven't done this yet) and the lack of exposure
> of TermFragments and their scores (I can expose this if it is deemed
> necessary, this is one of the things I'd like feedback on).
> * Moved org.apache.lucene.search.vectorhighlight and
> org.apache.lucene.search.highlight to a single package with a unified
> interface, search.highlight (with two sub-packages:
> search.highlight.termvector and search.highlight.iterative, respectively).
> * Unified the highlighted term formatting into a single interface:
> highlighter/Formatter and both highlighters use this now.
> What I need to do before I personally would consider this finished:
> * Finish documentation, most specifically on TermVectorHighlighter. I
> haven't done this now as I expect things to change up quite a bit before
> they're finalized and I really hate writing documentation that goes to waste,
> but I do intend to complete this bullet :)
> * "Flesh out" the API of search.highlight.Highlighter as it's very barebones
> right now
> * Continue removing and consolidating duplicate functionality, like I've done
> with the highlighted word tag generation.
> What I think I need feedback on, before I can proceed:
> * FastTermVectorHighlighter and the iterative highlighters need completely
> different sets of information in order to work. The approach I've taken is
> exposing a vectorHighlight method in the unified interface and a
> iterativeHighlight method, as well as a single highlight method that takes
> all the information needed for either of them and I'm unsure if this is the
> best way to do this.
> * The naming of things; I'm not sure if this is a big issue, or even an issue
> at all, but I'd like to not break any conventions that may exist that I'm
> unaware of.
> * How big of a deal is exposing the particular score of a segment from the
> highlighting interface and does this need to be extended into the term vector
> highlighting as well?
> * There are a lot of methods in the tv implementation that are marked
> depracted; since this release will almost definitely break backwards
> compatibility anyway, are these safe to remove?
> * Any other input anyone else may have :)
> I'm going to continue to work on things that I can work on, at least unless
> someone tells me I'm wasting my time and will look forward to hearing you
> guys' feedback! :)
> As a sidenote because it does seem rather random that I would arbitrarily
> re-write a working algorithm in the non-tv highlighter, I did it originally
> because I wanted to parallelize the highlighting (which was a failed
> experiment) and simply to see if I could make the algorithm faster, as I find
> that sort of thing particularly fun :)
> As a second sidenote, if anyone would like an explanation of the algorithm
> for the highlighting I devised, and why I feel that it's more accurate, I'd
> be happy to provide them with one (and benchmarks as well).
> Thanks,
> Eddie
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]