RE: Bewildered by my search results, can anyone explain where I might be going wrong?

Trevor Nicholls Mon, 21 Jun 2021 13:37:58 -0700

Unfortunately it looks like my mailer has decided to monkey with the
patterns, sorry about that
Pattern alphanum = Pattern.compile( a pattern that matches one or more
'word' characters );
Pattern nonalpha = Pattern.compile( a pattern that matches any single
'non-word' character );


I forgot to include in my already too long message that I haven't been able
to add my custom analyzer to Luke to test the search side; I can add my jar
file and Luke says "custom analyzer built" but it doesn't offer my analyzer
as an option for use in parsing the query string.

cheers
T

-----Original Message-----
From: Trevor Nicholls <[email protected]> 
Sent: Tuesday, 22 June 2021 08:10
To: [email protected]
Subject: Bewildered by my search results, can anyone explain where I might
be going wrong?

Sorry in advance for writing a small novel.

 

Background: I am indexing and searching technical reference documents, so
the standard language analyzers aren't appropriate. For example, the content
needs to be indexed so that a search for total matches total value,
total[value], and total(value), but a search for total[ only matches the
second of these.

 

As a first step I wrote a custom analyzer which uses
PatternCaptureGroupTokenFilter to split the token stream into word character
sequences (total, value) and non-word single characters (, ), [, ].

 

class TechAnalyzer extends Analyzer {

  @override

  protected TokenStreamComponents createComponents(String fieldname) {

    WhitespaceTokenizer src = new WhitespaceTokenizer();

    TokenStream result = new LowerCaseFilter(src);

    Pattern alphanum = Pattern.compile("(\\w+) <file://w+)> );

    Pattern nonalpha = Pattern.compile("(\\W) <file://W)> ");

    result = new
PatternCaptureGroupTokenFilter(result,false,alphanum,nonalpha);

    return new TokenStreamComponents(src,result);

  }

}

 

I have tested this analyzer on diverse input files to verify that:

"total value" produces 2 tokens: total, value

"total(value)" produces 4 tokens: total, (, value, )

"total[value]" also produces 4 tokens: total, [, value, ]

 

So this is the analyzer used to build the index:

..

TechAnalyzer analyzer = new TechAnalyzer();

IndexWriterConfig iwc = new IndexWriterConfig(analyzer);

..

 

and it surely does. I can use Luke to inspect the terms in the index and see
that ( ) [ ] total and value are all present as separate terms.

 

So as far as I can tell, the indexing is happening as per requirement. Now
for searching, which is where it is going wrong.

If I want to search for the single word total everything is fine. The
problem is if I want to search for "total[".

 

String queryStr = "total[";

Query q = new QueryParser("text",new TechAnalyzer()).parse(queryStr);

..

 

This matches far too many documents because the query is being treated as a
synonym which matches either total or [. To confirm this, if I output
q.toString() I see "Synonym([ total)".

 

If instead I modify the input query so that it searches for a phrase ("total
[") then it appears to be looking for consecutive terms (q.toString() is
"total [") and it comes back with four results. All four matched documents
do indeed have "total[" in them; the trouble is that there are sixteen other
documents that should match as well, and it is not obvious to me why they
aren't being selected.

 

Using Luke again to find the two tokens total and [ in the documents, I see
the following for the first match:

For "total"

Position  Offsets  Payload

  19

  56

  56

 

For "["

Position  Offsets  Payload

  20

  56

  57

 

The actual string "total[" is in the document twice, if I inspect it myself.

 

For a document which Lucene does not match, but which it should, I can see
in Luke

"total"

Position  Offsets  Payload

  78

  80

 

"["

Position  Offsets  Payload

  78

  80

 

Again, if I inspect this document by hand, it contains "total[" twice.

 

I don't know if the empty offsets and payloads indicate a problem, and I
don't know if the duplicated positions in the first example are a problem
either (although that document is selected correctly!)

What I do know is that there must be something wrong somewhere because
despite sending the correct token stream to the indexer, querying the data
is performing worse than dumping the documents as text and grepping them.

 

Any pointers gratefully received.

 

cheers

T

 



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Bewildered by my search results, can anyone explain where I might be going wrong?

Reply via email to