RE: Highlighting query results, my method is too crude, but how to improve it?

Trevor Nicholls Tue, 21 Feb 2023 09:07:02 -0800

Thank you David, very useful

cheers
T


-----Original Message-----
From: Dawid Weiss <[email protected]> 
Sent: Tuesday, February 21, 2023 7:17 PM
To: [email protected]
Subject: Re: Highlighting query results, my method is too crude, but how to 
improve it?

You can use two different queries - the query is just used as a source of 
information on what to highlight (it can even be completely different and 
unrelated to the query that retrieved the documents).

Separately, unified highlighter is great but you may also try the matches API - 
I found it to be a much better source of information to get accurate highlight 
ranges for more complex queries (a mix of term, intervals, spans, etc.). This 
test class uses a highlighter implementation that leverages this API:

https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L241-L269

If things work for you then no rush to switch over - it's yet another option to 
use.

Dawid

On Mon, Feb 20, 2023 at 4:21 PM Trevor Nicholls <[email protected]>
wrote:

> Well I don't know; I suppose that's part of my question.
>
> It's not immediately obvious to me that the "query" in these two lines:
>
>   Highlighter highlighter = new Highlighter( htmlFormatter, new 
> QueryScorer( query ));
>   TopDocs results = searcher.search( query, max );
>
> has to be the same. Maybe you can use the full query in 
> searcher.search() to select your documents, and use the text-only 
> query in Highlighter(htmlFormatter, new QueryScorer(query)) to only 
> find the matching terms in the selected records text fields.
>
> Can you? If so, this is not as difficult as I thought. But I might be 
> missing something.
>
> cheers
> T
>
> -----Original Message-----
> From: Mikhail Khludnev <[email protected]>
> Sent: Tuesday, February 21, 2023 3:22 AM
> To: [email protected]
> Subject: Re: Highlighting query results, my method is too crude, but 
> how to improve it?
>
> Hello,
> Maybe I'm missing some point. But, can you highlight another query 
> than one you search for?
>
> On Mon, Feb 20, 2023 at 5:07 PM Trevor Nicholls 
> <[email protected]
> >
> wrote:
>
> > Sorry I apologize for this being a bit long and for explaining the 
> > problem at the very bottom after all the background, rather than 
> > starting with it at the top. I thought it was easier to explain like 
> > this, please bear with me!
> >
> >
> >
> > So I've indexed a library of technical documentation, and the index 
> > has stored several fields per document: category, volume, title, 
> > text,
> etc.
> > Title and text are tokenised and stored, all other fields are just
> indexed.
> >
> >
> >
> > When searching the index I am using the standard queryparser, and a 
> > typical query might look like
> >
> >
> >
> >     "(title:graph AND title:axis) OR (text:graph AND text:axis)"
> >
> >
> >
> > Because indexing includes synonym matching, I need the search to 
> > identify matched terms in the content, e.g. in the above "graph" and 
> > "chart" are synonyms, and "axis" and "axes" are as well.
> >
> >
> >
> > So my search method executes the query to get a set of matching 
> > documents, and uses the highlighter methods to identify the matches 
> > in
> the content:
> >
> >
> >
> >   private void doSearch( IndexReader reader, IndexSearcher searcher, 
> > Query query, int max, FileWriter, writer, FileWriter matchlist ) {
> >
> >
> >
> >     SimpleHTMLFormatter htmlFormatter = newSimpleHTMLFormatter( 
> > hlPre, hlPost );  // hlPre="\001"; hlPost="\002";
> >
> >     Highlighter highlighter = new Highlighter( htmlFormatter, new 
> > QueryScorer( query ));
> >
> >
> >
> >     TopDocs results = searcher.search( query, max );
> >
> >     ScoreDoc[] hits = results.scoreDocs;
> >
> >     int numTotalHits = Math.toIntExact( results.totalHits.value );
> >
> >
> >
> >     HashSet<String> matchedWords = new HashSet<String>();
> >
> >     int start = 0;
> >
> >     int end = Math.min( numTotalHits, max );
> >
> >
> >
> >     for (int i = start; I < end; i++) {
> >
> >       Document doc = searcher.doc( hits[i].doc );
> >
> >       String text = doc.get( "text" );
> >
> >       try {
> >
> >         TokenStream tokens = TokenSources.getTokenStream( "text", 
> > null, text, analyzer, -1 );
> >
> >         TextFragment[] frag = highlighter.getBestTextFragments( 
> > tokens, text, true, 100 );
> >
> >         for ( int j = 0; j < frag.length; j++) {
> >
> >           if (( frag[j] != null ) && ( frag[j].getScore() > 0 )) {
> >
> >             addMatchedTerms( matchedWords, frag[j].toString() );
> >
> >           }
> >
> >         }
> >
> >       } catch .{
> >
> >       }
> >
> >       writer.write( doc.get("id") + "\n" );
> >
> >     }
> >
> >     for ( String word : matchedWords ) {
> >
> >       matchlist.write( word.toString() + "\n" );
> >
> >     }
> >
> >   }
> >
> >
> >
> > There's more of course but that's the guts of it; I haven't shown 
> > the analyzer or the method which extracts the delimited words from 
> > the fragment and adds them to the matchedWords hashset.
> >
> >
> >
> > In the simple example shown this works fine, and the matched words 
> > include graph and axis and any other synonyms found in the selected
> documents.
> >
> >
> >
> > The problem occurs when I use the query to filter the search by 
> > category or by volume. I'm doing this by adding extra conditions to 
> > the
> query, e.g.
> >
> >
> >
> >     "(category:note AND volume:extra) AND ((title:graph AND
> > title:axis) OR (text:graph AND text:axis))"
> >
> >
> >
> > When we do this the search correctly returns only documents in the 
> > selected category/volume, but unfortunately the
> > highlighter.getBestTextFragments()
> > method marks all the occurrences of "note" and "extra" in the 
> > content
> too.
> > This we don't want.
> >
> > I can't see how to separate that part of the query out in the 
> > highlighter methods, and I wonder what best practice would be here.
> > I'm probably being naive in using a single query for the whole job. 
> > Do I need to run a query for category/volume, and then a subquery on 
> > text and title, and just use the subquery in the highlighter? If 
> > that's the approach, is there a nice simple explanation somewhere 
> > you could point me to? Because I'm a simple user who has never done 
> > anything beyond using the simple QueryParser for everything.
> >
> >
> >
> > cheers
> >
> > T
> >
> >
> >
> >
> >
> >
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Highlighting query results, my method is too crude, but how to improve it?

Reply via email to