Sorry I apologize for this being a bit long and for explaining the problem at the very bottom after all the background, rather than starting with it at the top. I thought it was easier to explain like this, please bear with me!
So I've indexed a library of technical documentation, and the index has stored several fields per document: category, volume, title, text, etc. Title and text are tokenised and stored, all other fields are just indexed. When searching the index I am using the standard queryparser, and a typical query might look like "(title:graph AND title:axis) OR (text:graph AND text:axis)" Because indexing includes synonym matching, I need the search to identify matched terms in the content, e.g. in the above "graph" and "chart" are synonyms, and "axis" and "axes" are as well. So my search method executes the query to get a set of matching documents, and uses the highlighter methods to identify the matches in the content: private void doSearch( IndexReader reader, IndexSearcher searcher, Query query, int max, FileWriter, writer, FileWriter matchlist ) { SimpleHTMLFormatter htmlFormatter = newSimpleHTMLFormatter( hlPre, hlPost ); // hlPre="\001"; hlPost="\002"; Highlighter highlighter = new Highlighter( htmlFormatter, new QueryScorer( query )); TopDocs results = searcher.search( query, max ); ScoreDoc[] hits = results.scoreDocs; int numTotalHits = Math.toIntExact( results.totalHits.value ); HashSet<String> matchedWords = new HashSet<String>(); int start = 0; int end = Math.min( numTotalHits, max ); for (int i = start; I < end; i++) { Document doc = searcher.doc( hits[i].doc ); String text = doc.get( "text" ); try { TokenStream tokens = TokenSources.getTokenStream( "text", null, text, analyzer, -1 ); TextFragment[] frag = highlighter.getBestTextFragments( tokens, text, true, 100 ); for ( int j = 0; j < frag.length; j++) { if (( frag[j] != null ) && ( frag[j].getScore() > 0 )) { addMatchedTerms( matchedWords, frag[j].toString() ); } } } catch .{ } writer.write( doc.get("id") + "\n" ); } for ( String word : matchedWords ) { matchlist.write( word.toString() + "\n" ); } } There's more of course but that's the guts of it; I haven't shown the analyzer or the method which extracts the delimited words from the fragment and adds them to the matchedWords hashset. In the simple example shown this works fine, and the matched words include graph and axis and any other synonyms found in the selected documents. The problem occurs when I use the query to filter the search by category or by volume. I'm doing this by adding extra conditions to the query, e.g. "(category:note AND volume:extra) AND ((title:graph AND title:axis) OR (text:graph AND text:axis))" When we do this the search correctly returns only documents in the selected category/volume, but unfortunately the highlighter.getBestTextFragments() method marks all the occurrences of "note" and "extra" in the content too. This we don't want. I can't see how to separate that part of the query out in the highlighter methods, and I wonder what best practice would be here. I'm probably being naive in using a single query for the whole job. Do I need to run a query for category/volume, and then a subquery on text and title, and just use the subquery in the highlighter? If that's the approach, is there a nice simple explanation somewhere you could point me to? Because I'm a simple user who has never done anything beyond using the simple QueryParser for everything. cheers T