Re: Highlighting query results, my method is too crude, but how to improve it?

Mikhail Khludnev Mon, 20 Feb 2023 06:23:03 -0800

Hello,
Maybe I'm missing some point. But, can you highlight another query than one
you search for?


On Mon, Feb 20, 2023 at 5:07 PM Trevor Nicholls <tre...@castingthevoid.com>
wrote:

> Sorry I apologize for this being a bit long and for explaining the problem
> at the very bottom after all the background, rather than starting with it
> at
> the top. I thought it was easier to explain like this, please bear with me!
>
>
>
> So I've indexed a library of technical documentation, and the index has
> stored several fields per document: category, volume, title, text, etc.
> Title and text are tokenised and stored, all other fields are just indexed.
>
>
>
> When searching the index I am using the standard queryparser, and a typical
> query might look like
>
>
>
>     "(title:graph AND title:axis) OR (text:graph AND text:axis)"
>
>
>
> Because indexing includes synonym matching, I need the search to identify
> matched terms in the content, e.g. in the above "graph" and "chart" are
> synonyms, and "axis" and "axes" are as well.
>
>
>
> So my search method executes the query to get a set of matching documents,
> and uses the highlighter methods to identify the matches in the content:
>
>
>
>   private void doSearch( IndexReader reader, IndexSearcher searcher, Query
> query, int max, FileWriter, writer, FileWriter matchlist ) {
>
>
>
>     SimpleHTMLFormatter htmlFormatter = newSimpleHTMLFormatter( hlPre,
> hlPost );  // hlPre="\001"; hlPost="\002";
>
>     Highlighter highlighter = new Highlighter( htmlFormatter, new
> QueryScorer( query ));
>
>
>
>     TopDocs results = searcher.search( query, max );
>
>     ScoreDoc[] hits = results.scoreDocs;
>
>     int numTotalHits = Math.toIntExact( results.totalHits.value );
>
>
>
>     HashSet<String> matchedWords = new HashSet<String>();
>
>     int start = 0;
>
>     int end = Math.min( numTotalHits, max );
>
>
>
>     for (int i = start; I < end; i++) {
>
>       Document doc = searcher.doc( hits[i].doc );
>
>       String text = doc.get( "text" );
>
>       try {
>
>         TokenStream tokens = TokenSources.getTokenStream( "text", null,
> text, analyzer, -1 );
>
>         TextFragment[] frag = highlighter.getBestTextFragments( tokens,
> text, true, 100 );
>
>         for ( int j = 0; j < frag.length; j++) {
>
>           if (( frag[j] != null ) && ( frag[j].getScore() > 0 )) {
>
>             addMatchedTerms( matchedWords, frag[j].toString() );
>
>           }
>
>         }
>
>       } catch .{
>
>       }
>
>       writer.write( doc.get("id") + "\n" );
>
>     }
>
>     for ( String word : matchedWords ) {
>
>       matchlist.write( word.toString() + "\n" );
>
>     }
>
>   }
>
>
>
> There's more of course but that's the guts of it; I haven't shown the
> analyzer or the method which extracts the delimited words from the fragment
> and adds them to the matchedWords hashset.
>
>
>
> In the simple example shown this works fine, and the matched words include
> graph and axis and any other synonyms found in the selected documents.
>
>
>
> The problem occurs when I use the query to filter the search by category or
> by volume. I'm doing this by adding extra conditions to the query, e.g.
>
>
>
>     "(category:note AND volume:extra) AND ((title:graph AND title:axis) OR
> (text:graph AND text:axis))"
>
>
>
> When we do this the search correctly returns only documents in the selected
> category/volume, but unfortunately the highlighter.getBestTextFragments()
> method marks all the occurrences of "note" and "extra" in the content too.
> This we don't want.
>
> I can't see how to separate that part of the query out in the highlighter
> methods, and I wonder what best practice would be here. I'm probably being
> naive in using a single query for the whole job. Do I need to run a query
> for category/volume, and then a subquery on text and title, and just use
> the
> subquery in the highlighter? If that's the approach, is there a nice simple
> explanation somewhere you could point me to? Because I'm a simple user who
> has never done anything beyond using the simple QueryParser for everything.
>
>
>
> cheers
>
> T
>
>
>
>
>
>
>
>

-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Highlighting query results, my method is too crude, but how to improve it?

Reply via email to