Sorry in advance for writing a small novel.
Background: I am indexing and searching technical reference documents, so the standard language analyzers aren't appropriate. For example, the content needs to be indexed so that a search for total matches total value, total[value], and total(value), but a search for total[ only matches the second of these. As a first step I wrote a custom analyzer which uses PatternCaptureGroupTokenFilter to split the token stream into word character sequences (total, value) and non-word single characters (, ), [, ]. class TechAnalyzer extends Analyzer { @override protected TokenStreamComponents createComponents(String fieldname) { WhitespaceTokenizer src = new WhitespaceTokenizer(); TokenStream result = new LowerCaseFilter(src); Pattern alphanum = Pattern.compile("(\\w+) <file://w+)> ); Pattern nonalpha = Pattern.compile("(\\W) <file://W)> "); result = new PatternCaptureGroupTokenFilter(result,false,alphanum,nonalpha); return new TokenStreamComponents(src,result); } } I have tested this analyzer on diverse input files to verify that: "total value" produces 2 tokens: total, value "total(value)" produces 4 tokens: total, (, value, ) "total[value]" also produces 4 tokens: total, [, value, ] So this is the analyzer used to build the index: .. TechAnalyzer analyzer = new TechAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); .. and it surely does. I can use Luke to inspect the terms in the index and see that ( ) [ ] total and value are all present as separate terms. So as far as I can tell, the indexing is happening as per requirement. Now for searching, which is where it is going wrong. If I want to search for the single word total everything is fine. The problem is if I want to search for "total[". String queryStr = "total["; Query q = new QueryParser("text",new TechAnalyzer()).parse(queryStr); .. This matches far too many documents because the query is being treated as a synonym which matches either total or [. To confirm this, if I output q.toString() I see "Synonym([ total)". If instead I modify the input query so that it searches for a phrase ("total [") then it appears to be looking for consecutive terms (q.toString() is "total [") and it comes back with four results. All four matched documents do indeed have "total[" in them; the trouble is that there are sixteen other documents that should match as well, and it is not obvious to me why they aren't being selected. Using Luke again to find the two tokens total and [ in the documents, I see the following for the first match: For "total" Position Offsets Payload 19 56 56 For "[" Position Offsets Payload 20 56 57 The actual string "total[" is in the document twice, if I inspect it myself. For a document which Lucene does not match, but which it should, I can see in Luke "total" Position Offsets Payload 78 80 "[" Position Offsets Payload 78 80 Again, if I inspect this document by hand, it contains "total[" twice. I don't know if the empty offsets and payloads indicate a problem, and I don't know if the duplicated positions in the first example are a problem either (although that document is selected correctly!) What I do know is that there must be something wrong somewhere because despite sending the correct token stream to the indexer, querying the data is performing worse than dumping the documents as text and grepping them. Any pointers gratefully received. cheers T