Hello, Maybe I'm missing some point. But, can you highlight another query than one you search for?
On Mon, Feb 20, 2023 at 5:07 PM Trevor Nicholls <tre...@castingthevoid.com> wrote: > Sorry I apologize for this being a bit long and for explaining the problem > at the very bottom after all the background, rather than starting with it > at > the top. I thought it was easier to explain like this, please bear with me! > > > > So I've indexed a library of technical documentation, and the index has > stored several fields per document: category, volume, title, text, etc. > Title and text are tokenised and stored, all other fields are just indexed. > > > > When searching the index I am using the standard queryparser, and a typical > query might look like > > > > "(title:graph AND title:axis) OR (text:graph AND text:axis)" > > > > Because indexing includes synonym matching, I need the search to identify > matched terms in the content, e.g. in the above "graph" and "chart" are > synonyms, and "axis" and "axes" are as well. > > > > So my search method executes the query to get a set of matching documents, > and uses the highlighter methods to identify the matches in the content: > > > > private void doSearch( IndexReader reader, IndexSearcher searcher, Query > query, int max, FileWriter, writer, FileWriter matchlist ) { > > > > SimpleHTMLFormatter htmlFormatter = newSimpleHTMLFormatter( hlPre, > hlPost ); // hlPre="\001"; hlPost="\002"; > > Highlighter highlighter = new Highlighter( htmlFormatter, new > QueryScorer( query )); > > > > TopDocs results = searcher.search( query, max ); > > ScoreDoc[] hits = results.scoreDocs; > > int numTotalHits = Math.toIntExact( results.totalHits.value ); > > > > HashSet<String> matchedWords = new HashSet<String>(); > > int start = 0; > > int end = Math.min( numTotalHits, max ); > > > > for (int i = start; I < end; i++) { > > Document doc = searcher.doc( hits[i].doc ); > > String text = doc.get( "text" ); > > try { > > TokenStream tokens = TokenSources.getTokenStream( "text", null, > text, analyzer, -1 ); > > TextFragment[] frag = highlighter.getBestTextFragments( tokens, > text, true, 100 ); > > for ( int j = 0; j < frag.length; j++) { > > if (( frag[j] != null ) && ( frag[j].getScore() > 0 )) { > > addMatchedTerms( matchedWords, frag[j].toString() ); > > } > > } > > } catch .{ > > } > > writer.write( doc.get("id") + "\n" ); > > } > > for ( String word : matchedWords ) { > > matchlist.write( word.toString() + "\n" ); > > } > > } > > > > There's more of course but that's the guts of it; I haven't shown the > analyzer or the method which extracts the delimited words from the fragment > and adds them to the matchedWords hashset. > > > > In the simple example shown this works fine, and the matched words include > graph and axis and any other synonyms found in the selected documents. > > > > The problem occurs when I use the query to filter the search by category or > by volume. I'm doing this by adding extra conditions to the query, e.g. > > > > "(category:note AND volume:extra) AND ((title:graph AND title:axis) OR > (text:graph AND text:axis))" > > > > When we do this the search correctly returns only documents in the selected > category/volume, but unfortunately the highlighter.getBestTextFragments() > method marks all the occurrences of "note" and "extra" in the content too. > This we don't want. > > I can't see how to separate that part of the query out in the highlighter > methods, and I wonder what best practice would be here. I'm probably being > naive in using a single query for the whole job. Do I need to run a query > for category/volume, and then a subquery on text and title, and just use > the > subquery in the highlighter? If that's the approach, is there a nice simple > explanation somewhere you could point me to? Because I'm a simple user who > has never done anything beyond using the simple QueryParser for everything. > > > > cheers > > T > > > > > > > > -- Sincerely yours Mikhail Khludnev https://t.me/MUST_SEARCH A caveat: Cyrillic!