This will be of no immediate help, but in the next iteration of LUCENE-5317, which I'll post in a few weeks (if I can find the time), I'll have an option to pull concordance windows from character offsets which can be stored at index time (so you wouldn't have to re-analyze). The current version of the non-committed patch relies on re-analysis.
The basic strategy in LUCENE-5317 is to convert every query to a SpanQuery and then run getSpans on an index. This won't meet your needs for back-compat, and it also suffers from the "relying on a string" sin you mention. You mention the first point, but may also be interested in the second... 1) depending on the highlighter and the settings, make sure that you are able to highlight variants (fuzzy, wildcard, etc) if you want to, and 2) be sure that you are able to highlight phrases (as opposed to terms that are phrasal pieces that aren't actually in a phrase). It was a surprise to me that both weren't default and handled by all highlighters when I first came to Lucene, but they make complete sense to me now. On your point about text being very large...is there a way to break your text into smaller documents and still meet your users' expectations (breaking books into chapters etc.). In the highlighting/concordance realm, I've found that Lucene is still totally fast enough for my needs on large texts, but that it is far faster on lots of small docs vs fewer large docs. Best Tim -----Original Message----- From: Trejkaz [mailto:trej...@trypticon.org] Sent: Tuesday, February 04, 2014 1:20 AM To: Lucene Users Mailing List Subject: Highlighting text, do I seriously have to reimplement this from scratch? Hi all. I'm trying to find a precise and reasonably efficient way to highlight all occurrences of terms in the query, only highlighting fields which match the corresponding fields used in the query. This seems like it would be a fairly common requirement in applications. We have an existing implementation, but it works by re-reading the entire text back through the analyser. This is slow for large text, and sometimes we analyse the same text twice - and both variants could well be in the query. So I'm looking for a shortcut. Perhaps due to the name, Lucene's highlighter module got my attention, so I tried using that. The prototype I wrote *did* produce acceptable results for the highlighting itself, but when it came time to think about integrating into the real application, there didn't seem to be a single part of the highlighter API designed to allow for that. So I guess I will be forced to categorise lucene-highlighter as a "toy", or perhaps as a fairly complete example of how to do highlighting, and it might be useful for that at least. What's wrong with the API? Issue #1 - The API forces me to pass in a String. Just because the highlighter wants some character data, I have to pass String. Text can be very large and I would rather not have to wait for the entire text to read into memory before I can pass it off to the highlighter. String is a final class, so any API which requires it for feeding in something like character data is committing a massive sin, in my opinion. If your text is in a database, you will have to retrieve *all* of the text before you can use *any* of it for highlighting. Had the API accepted something like Reader, CharBuffer or even CharSequence, there would be no problem. We could make an alternative implementation which reads directly from whatever storage it's in. I notice that PostingsHighlighter has improved on this, by removing the need for the text entirely. That's awesome, actually. We can't use it. We're stuck on version 3.6.2 as we are expected to be able to open indexes created in 2.x. Plus, all our existing indexes lack the required level of indexing to use it, and reindexing is not yet an option. (Even if we get lucky enough to update to Lucene 4, I will probably have to write a codec to read Lucene 2 indexes...) Issue #2 - The API returns all results as String. To actually integrate a highlighter, the absolute offsets are the bare minimum requirement to highlight the text: http://docs.oracle.com/javase/7/docs/api/javax/swing/text/Highlighter.html But the highlighter API only returns results as String. Even if there were enough information in the string (and I don't think there is!), getting the results back as String is what I call the "pseudo-API anti-pattern." I shouldn't have to parse values out of a string which the API I'm calling just formatted into it. In this particular instance, it would have been nice to have a way to programmatically get the offset of the highlights in each fragment. As for our own requirements, the bit about computing the fragments is completely unnecessary. We have a piece of view-time logic which figures out the fragments based on where the highlights are vertically. This works better than using text proximity, because using text proximity causes the number of highlighted lines to visibly shuffle, whereas showing consistently the same number of lines above and below produces an effect similar to resizing a text editor window, which people should already be used to. For getting the highlights themselves, is there any faster way than reading the whole text every time you want to run it? TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org