Well I don't know; I suppose that's part of my question. It's not immediately obvious to me that the "query" in these two lines:
Highlighter highlighter = new Highlighter( htmlFormatter, new QueryScorer( query )); TopDocs results = searcher.search( query, max ); has to be the same. Maybe you can use the full query in searcher.search() to select your documents, and use the text-only query in Highlighter(htmlFormatter, new QueryScorer(query)) to only find the matching terms in the selected records text fields. Can you? If so, this is not as difficult as I thought. But I might be missing something. cheers T -----Original Message----- From: Mikhail Khludnev <m...@apache.org> Sent: Tuesday, February 21, 2023 3:22 AM To: java-user@lucene.apache.org Subject: Re: Highlighting query results, my method is too crude, but how to improve it? Hello, Maybe I'm missing some point. But, can you highlight another query than one you search for? On Mon, Feb 20, 2023 at 5:07 PM Trevor Nicholls <tre...@castingthevoid.com> wrote: > Sorry I apologize for this being a bit long and for explaining the > problem at the very bottom after all the background, rather than > starting with it at the top. I thought it was easier to explain like > this, please bear with me! > > > > So I've indexed a library of technical documentation, and the index > has stored several fields per document: category, volume, title, text, etc. > Title and text are tokenised and stored, all other fields are just indexed. > > > > When searching the index I am using the standard queryparser, and a > typical query might look like > > > > "(title:graph AND title:axis) OR (text:graph AND text:axis)" > > > > Because indexing includes synonym matching, I need the search to > identify matched terms in the content, e.g. in the above "graph" and > "chart" are synonyms, and "axis" and "axes" are as well. > > > > So my search method executes the query to get a set of matching > documents, and uses the highlighter methods to identify the matches in the > content: > > > > private void doSearch( IndexReader reader, IndexSearcher searcher, > Query query, int max, FileWriter, writer, FileWriter matchlist ) { > > > > SimpleHTMLFormatter htmlFormatter = newSimpleHTMLFormatter( hlPre, > hlPost ); // hlPre="\001"; hlPost="\002"; > > Highlighter highlighter = new Highlighter( htmlFormatter, new > QueryScorer( query )); > > > > TopDocs results = searcher.search( query, max ); > > ScoreDoc[] hits = results.scoreDocs; > > int numTotalHits = Math.toIntExact( results.totalHits.value ); > > > > HashSet<String> matchedWords = new HashSet<String>(); > > int start = 0; > > int end = Math.min( numTotalHits, max ); > > > > for (int i = start; I < end; i++) { > > Document doc = searcher.doc( hits[i].doc ); > > String text = doc.get( "text" ); > > try { > > TokenStream tokens = TokenSources.getTokenStream( "text", > null, text, analyzer, -1 ); > > TextFragment[] frag = highlighter.getBestTextFragments( > tokens, text, true, 100 ); > > for ( int j = 0; j < frag.length; j++) { > > if (( frag[j] != null ) && ( frag[j].getScore() > 0 )) { > > addMatchedTerms( matchedWords, frag[j].toString() ); > > } > > } > > } catch .{ > > } > > writer.write( doc.get("id") + "\n" ); > > } > > for ( String word : matchedWords ) { > > matchlist.write( word.toString() + "\n" ); > > } > > } > > > > There's more of course but that's the guts of it; I haven't shown the > analyzer or the method which extracts the delimited words from the > fragment and adds them to the matchedWords hashset. > > > > In the simple example shown this works fine, and the matched words > include graph and axis and any other synonyms found in the selected documents. > > > > The problem occurs when I use the query to filter the search by > category or by volume. I'm doing this by adding extra conditions to the > query, e.g. > > > > "(category:note AND volume:extra) AND ((title:graph AND > title:axis) OR (text:graph AND text:axis))" > > > > When we do this the search correctly returns only documents in the > selected category/volume, but unfortunately the > highlighter.getBestTextFragments() > method marks all the occurrences of "note" and "extra" in the content too. > This we don't want. > > I can't see how to separate that part of the query out in the > highlighter methods, and I wonder what best practice would be here. > I'm probably being naive in using a single query for the whole job. Do > I need to run a query for category/volume, and then a subquery on text > and title, and just use the subquery in the highlighter? If that's the > approach, is there a nice simple explanation somewhere you could point > me to? Because I'm a simple user who has never done anything beyond > using the simple QueryParser for everything. > > > > cheers > > T > > > > > > > > -- Sincerely yours Mikhail Khludnev https://t.me/MUST_SEARCH A caveat: Cyrillic! --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org