RE: Highlighting text, do I seriously have to reimplement this from scratch?

Allison, Timothy B. Tue, 04 Feb 2014 05:04:05 -0800

This will be of no immediate help, but in the next iteration of LUCENE-5317, 
which I'll post in a few weeks (if I can find the time), I'll have an option to 
pull concordance windows from character offsets which can be stored at index 
time (so you wouldn't have to re-analyze).  The current version of the 
non-committed patch relies on re-analysis.


The basic strategy in LUCENE-5317 is to convert every query to a SpanQuery and 
then run getSpans on an index. 

This won't meet your needs for back-compat, and it also suffers from the 
"relying on a string" sin you mention.

You mention the first point, but may also be interested in the second... 1) 
depending on the highlighter and the settings, make sure that you are able to 
highlight variants (fuzzy, wildcard, etc) if you want to, and 2) be sure that 
you are able to highlight phrases (as opposed to terms that are phrasal pieces 
that aren't actually in a phrase).  It was a surprise to me that both weren't 
default and handled by all highlighters when I first came to Lucene, but they 
make complete sense to me now.

On your point about text being very large...is there a way to break your text 
into smaller documents and still meet your users' expectations (breaking books 
into chapters etc.).  In the highlighting/concordance realm, I've found that 
Lucene is still totally fast enough for my needs on large texts, but that it is 
far faster on lots of small docs vs fewer large docs.

Best

    Tim

-----Original Message-----
From: Trejkaz [mailto:trej...@trypticon.org] 
Sent: Tuesday, February 04, 2014 1:20 AM
To: Lucene Users Mailing List
Subject: Highlighting text, do I seriously have to reimplement this from 
scratch?

Hi all.

I'm trying to find a precise and reasonably efficient way to highlight
all occurrences of terms in the query, only highlighting fields which
match the corresponding fields used in the query. This seems like it
would be a fairly common requirement in applications. We have an
existing implementation, but it works by re-reading the entire text
back through the analyser. This is slow for large text, and sometimes
we analyse the same text twice - and both variants could well be in
the query. So I'm looking for a shortcut.

Perhaps due to the name, Lucene's highlighter module got my attention,
so I tried using that. The prototype I wrote *did* produce acceptable
results for the highlighting itself, but when it came time to think
about integrating into the real application, there didn't seem to be a
single part of the highlighter API designed to allow for that.

So I guess I will be forced to categorise lucene-highlighter as a
"toy", or perhaps as a fairly complete example of how to do
highlighting, and it might be useful for that at least.

What's wrong with the API?


Issue #1 - The API forces me to pass in a String.

Just because the highlighter wants some character data, I have to pass
String. Text can be very large and I would rather not have to wait for
the entire text to read into memory before I can pass it off to the
highlighter.

String is a final class, so any API which requires it for feeding in
something like character data is committing a massive sin, in my
opinion. If your text is in a database, you will have to retrieve
*all* of the text before you can use *any* of it for highlighting.

Had the API accepted something like Reader, CharBuffer or even
CharSequence, there would be no problem. We could make an alternative
implementation which reads directly from whatever storage it's in.

I notice that PostingsHighlighter has improved on this, by removing
the need for the text entirely. That's awesome, actually. We can't use
it. We're stuck on version 3.6.2 as we are expected to be able to open
indexes created in 2.x. Plus, all our existing indexes lack the
required level of indexing to use it, and reindexing is not yet an
option. (Even if we get lucky enough to update to Lucene 4, I will
probably have to write a codec to read Lucene 2 indexes...)


Issue #2 - The API returns all results as String.

To actually integrate a highlighter, the absolute offsets are the bare
minimum requirement to highlight the text:

    http://docs.oracle.com/javase/7/docs/api/javax/swing/text/Highlighter.html

But the highlighter API only returns results as String.

Even if there were enough information in the string (and I don't think
there is!), getting the results back as String is what I call the
"pseudo-API anti-pattern." I shouldn't have to parse values out of a
string which the API I'm calling just formatted into it. In this
particular instance, it would have been nice to have a way to
programmatically get the offset of the highlights in each fragment.


As for our own requirements, the bit about computing the fragments is
completely unnecessary. We have a piece of view-time logic which
figures out the fragments based on where the highlights are
vertically. This works better than using text proximity, because using
text proximity causes the number of highlighted lines to visibly
shuffle, whereas showing consistently the same number of lines above
and below produces an effect similar to resizing a text editor window,
which people should already be used to.

For getting the highlights themselves, is there any faster way than
reading the whole text every time you want to run it?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Highlighting text, do I seriously have to reimplement this from scratch?

Reply via email to