Unfortunately it looks like my mailer has decided to monkey with the patterns, sorry about that Pattern alphanum = Pattern.compile( a pattern that matches one or more 'word' characters ); Pattern nonalpha = Pattern.compile( a pattern that matches any single 'non-word' character );
I forgot to include in my already too long message that I haven't been able to add my custom analyzer to Luke to test the search side; I can add my jar file and Luke says "custom analyzer built" but it doesn't offer my analyzer as an option for use in parsing the query string. cheers T -----Original Message----- From: Trevor Nicholls <tre...@castingthevoid.com> Sent: Tuesday, 22 June 2021 08:10 To: java-user@lucene.apache.org Subject: Bewildered by my search results, can anyone explain where I might be going wrong? Sorry in advance for writing a small novel. Background: I am indexing and searching technical reference documents, so the standard language analyzers aren't appropriate. For example, the content needs to be indexed so that a search for total matches total value, total[value], and total(value), but a search for total[ only matches the second of these. As a first step I wrote a custom analyzer which uses PatternCaptureGroupTokenFilter to split the token stream into word character sequences (total, value) and non-word single characters (, ), [, ]. class TechAnalyzer extends Analyzer { @override protected TokenStreamComponents createComponents(String fieldname) { WhitespaceTokenizer src = new WhitespaceTokenizer(); TokenStream result = new LowerCaseFilter(src); Pattern alphanum = Pattern.compile("(\\w+) <file://w+)> ); Pattern nonalpha = Pattern.compile("(\\W) <file://W)> "); result = new PatternCaptureGroupTokenFilter(result,false,alphanum,nonalpha); return new TokenStreamComponents(src,result); } } I have tested this analyzer on diverse input files to verify that: "total value" produces 2 tokens: total, value "total(value)" produces 4 tokens: total, (, value, ) "total[value]" also produces 4 tokens: total, [, value, ] So this is the analyzer used to build the index: .. TechAnalyzer analyzer = new TechAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); .. and it surely does. I can use Luke to inspect the terms in the index and see that ( ) [ ] total and value are all present as separate terms. So as far as I can tell, the indexing is happening as per requirement. Now for searching, which is where it is going wrong. If I want to search for the single word total everything is fine. The problem is if I want to search for "total[". String queryStr = "total["; Query q = new QueryParser("text",new TechAnalyzer()).parse(queryStr); .. This matches far too many documents because the query is being treated as a synonym which matches either total or [. To confirm this, if I output q.toString() I see "Synonym([ total)". If instead I modify the input query so that it searches for a phrase ("total [") then it appears to be looking for consecutive terms (q.toString() is "total [") and it comes back with four results. All four matched documents do indeed have "total[" in them; the trouble is that there are sixteen other documents that should match as well, and it is not obvious to me why they aren't being selected. Using Luke again to find the two tokens total and [ in the documents, I see the following for the first match: For "total" Position Offsets Payload 19 56 56 For "[" Position Offsets Payload 20 56 57 The actual string "total[" is in the document twice, if I inspect it myself. For a document which Lucene does not match, but which it should, I can see in Luke "total" Position Offsets Payload 78 80 "[" Position Offsets Payload 78 80 Again, if I inspect this document by hand, it contains "total[" twice. I don't know if the empty offsets and payloads indicate a problem, and I don't know if the duplicated positions in the first example are a problem either (although that document is selected correctly!) What I do know is that there must be something wrong somewhere because despite sending the correct token stream to the indexer, querying the data is performing worse than dumping the documents as text and grepping them. Any pointers gratefully received. cheers T --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org