Hi, this is a general problem of using Analyzers in combination with QueryParser. Query Parsing is done *before* the terms are tokenized: QueryParser uses a JavaCC grammar to parse the query. This involves some query-parsing specific tokenization. Once the query parser has analyzed the syntax, it sends the syntactic parts through the analyzer (unfortunately - for english text - this is tokens only).
You have 2 possibilities: - Move the pattern replacement as a tokenfilter. This is more likely to help for query parsing where the tokenization is done by the parser. For your example a StopFilter would be good (removes some tokens from a list) - In many cases people use query parsing when it is not applicable. If your users only enter terms but you don't need any syntax then query parsing is the wrong thing to do. What you need more is a simplified analysis process that just creates a query out of the tokens emitted by the Analyzer. Lucene has the QueryBuilder class for that. Query Builder takes an Analyzer and you can pass in a string that gets tokenized and converted into a query. You have the option to create simple term queries in a booleanquery or alternatively parse them as a phrase. If you use this component, the whole analyzer would be used on the input string and Analyzer's output used to build the query - without any syntax. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Bahaa Eldesouky [mailto:bahaab...@gmail.com] > Sent: Thursday, April 28, 2016 11:54 AM > To: java-user@lucene.apache.org > Subject: QueryParser with CustomAnalyzer wrongly uses > PatternReplaceCharFilter > > I am using org.apache.lucene.queryparser.classic.QueryParser in lucene > 6.0.0 to parse queries using a CustomAnalyzer as shown below: > > public static void testFilmAnalyzer() throws IOException, ParseException { > CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder() > .addCharFilter("patternreplace", > "pattern", "(movie|film|picture).*", > "replacement", "") > .withTokenizer("standard") > .build(); > > QueryParser qp = new QueryParser("name", nameAnalyzer); > qp.setDefaultOperator(QueryParser.Operator.AND); > String[] strs = {"avatar film fiction", "avatar-film fiction", > "avatar-film-fiction"}; > > for (String str : strs) { > System.out.println("Analyzing \"" + str + "\":"); > showTokens(str, nameAnalyzer); > Query q = qp.parse(str); > System.out.println("Parsed query of \"" + str + "\":"); > System.out.println(q + "\n"); > }} > private static void showTokens(String text, Analyzer analyzer) throws > IOException { > StringReader reader = new StringReader(text); > TokenStream stream = analyzer.tokenStream("name", reader); > CharTermAttribute term = stream.addAttribute(CharTermAttribute.class); > stream.reset(); > while (stream.incrementToken()) { > System.out.print("[" + term.toString() + "]"); > } > stream.close(); > System.out.println();} > > > > > I get the following output, when I invoke testFilmAnalyzer(): > > Analyzing "avatar film fiction":[avatar]Parsed query of "avatar film > fiction":+name:avatar +name:fiction > Analyzing "avatar-film fiction":[avatar]Parsed query of "avatar-film > fiction":+name:avatar +name:fiction > Analyzing "avatar-film-fiction":[avatar]Parsed query of "avatar-film-fiction": > name:avatar > > > It seems like the analyzer uses the PatternReplaceCharFilter in its correct > intended order (i.e. before tokenization), while the QueryParser does so > afterwards. Does anyone have an explanation for that? Isn't that a bug? --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org