[ 
https://issues.apache.org/jira/browse/LUCENE-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-2508:
---------------------------------

    Fix Version/s:     (was: 4.7)
                   4.8

> Consolidate Highlighter implementations and a major refactor of the 
> non-termvector highlighter
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2508
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2508
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/highlighter
>         Environment: irrelevant
>            Reporter: Edward Drapkin
>            Priority: Minor
>              Labels: highlight, search
>             Fix For: 4.8
>
>         Attachments: LUCENE-2508.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Originally, I had planned to create a contrib module to allow people to 
> highlight multiple documents in parallel, but after talking to Uwe in IRC 
> about it, I realized that it was pretty useless.  However, I was already 
> sitting on an iterative highlighting algorithm that was much faster (my tests 
> show 20% - 40%) and more accurate and, based on that same IRC conversation, I 
> decided to not let all the work that I had done go to waste and try to 
> contribute it back again.  Uwe had mentioned that "More like this" detected 
> term vectors when called and use the term vector implementation when 
> possible, if I recall correctly, so I decided to do that.
> The patch that I've attached is my first stab at this.  It's not nearly 
> complete and full disclosure dictates that I say that it's not fully 
> documented and there are not any unit tests written.  I wanted to go ahead 
> and open an issue to get some feedback on the approach that I've taken as 
> well as the fact that it exists will be a proverbial kick in my pants to 
> continue working on it.
> In short, what I've changed:
> * Completely rewritten the non-tv highlighter to be faster and cleaner.  
> There is some small loss in functionality for now, namely the loss of the 
> GradientHighlighter (I just haven't done this yet) and the lack of exposure 
> of TermFragments and their scores (I can expose this if it is deemed 
> necessary, this is one of the things I'd like feedback on). 
> * Moved org.apache.lucene.search.vectorhighlight and 
> org.apache.lucene.search.highlight to a single package with a unified 
> interface, search.highlight (with two sub-packages: 
> search.highlight.termvector and search.highlight.iterative, respectively).
> * Unified the highlighted term formatting into a single interface: 
> highlighter/Formatter and both highlighters use this now.  
> What I need to do before I personally would consider this finished:
> * Finish documentation, most specifically on TermVectorHighlighter.  I 
> haven't done this now as I expect things to change up quite a bit before 
> they're finalized and I really hate writing documentation that goes to waste, 
> but I do intend to complete this bullet :)
> * "Flesh out" the API of search.highlight.Highlighter as it's very barebones 
> right now
> * Continue removing and consolidating duplicate functionality, like I've done 
> with the highlighted word tag generation.
> What I think I need feedback on, before I can proceed:
> * FastTermVectorHighlighter and the iterative highlighters need completely 
> different sets of information in order to work.  The approach I've taken is 
> exposing a vectorHighlight method in the unified interface and a 
> iterativeHighlight method, as well as a single highlight method that takes 
> all the information needed for either of them and I'm unsure if this is the 
> best way to do this.
> * The naming of things; I'm not sure if this is a big issue, or even an issue 
> at all, but I'd like to not break any conventions that may exist that I'm 
> unaware of.
> * How big of a deal is exposing the particular score of a segment from the 
> highlighting interface and does this need to be extended into the term vector 
> highlighting as well?
> * There are a lot of methods in the tv implementation that are marked 
> depracted; since this release will almost definitely break backwards 
> compatibility anyway, are these safe to remove?
> * Any other input anyone else may have :)
> I'm going to continue to work on things that I can work on, at least unless 
> someone tells me I'm wasting my time and will look forward to hearing you 
> guys' feedback! :)
> As a sidenote because it does seem rather random that I would arbitrarily 
> re-write a working algorithm in the non-tv highlighter, I did it originally 
> because I wanted to parallelize the highlighting (which was a failed 
> experiment) and simply to see if I could make the algorithm faster, as I find 
> that sort of thing particularly fun :)
> As a second sidenote, if anyone would like an explanation of the algorithm 
> for the highlighting I devised, and why I feel that it's more accurate, I'd 
> be happy to provide them with one (and benchmarks as well).
> Thanks,
> Eddie



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to