Hi,

this is a general problem of using Analyzers in combination with QueryParser. 
Query Parsing is done *before* the terms are tokenized: QueryParser uses a 
JavaCC grammar to parse the query. This involves some query-parsing specific 
tokenization. Once the query parser has analyzed the syntax, it sends the 
syntactic parts through the analyzer (unfortunately - for english text - this 
is tokens only).

You have 2 possibilities:

- Move the pattern replacement as a tokenfilter. This is more likely to help 
for query parsing where the tokenization is done by the parser. For your 
example a StopFilter would be good (removes some tokens from a list)
- In many cases people use query parsing when it is not applicable. If your 
users only enter terms but you don't need any syntax then query parsing is the 
wrong thing to do. What you need more is a simplified analysis process that 
just creates a query out of the tokens emitted by the Analyzer. Lucene has the 
QueryBuilder class for that. Query Builder takes an Analyzer and you can pass 
in a string that gets tokenized and converted into a query. You have the option 
to create simple term queries in a booleanquery or alternatively parse them as 
a phrase. If you use this component, the whole analyzer would be used on the 
input string and Analyzer's output used to build the query - without any syntax.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Bahaa Eldesouky [mailto:bahaab...@gmail.com]
> Sent: Thursday, April 28, 2016 11:54 AM
> To: java-user@lucene.apache.org
> Subject: QueryParser with CustomAnalyzer wrongly uses
> PatternReplaceCharFilter
> 
>  I am using org.apache.lucene.queryparser.classic.QueryParser in lucene
> 6.0.0 to parse queries using a CustomAnalyzer as shown below:
> 
> public static void testFilmAnalyzer() throws IOException, ParseException {
>     CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
>             .addCharFilter("patternreplace",
>                     "pattern", "(movie|film|picture).*",
>                     "replacement", "")
>             .withTokenizer("standard")
>             .build();
> 
>     QueryParser qp = new QueryParser("name", nameAnalyzer);
>     qp.setDefaultOperator(QueryParser.Operator.AND);
>     String[] strs = {"avatar film fiction", "avatar-film fiction",
> "avatar-film-fiction"};
> 
>     for (String str : strs) {
>         System.out.println("Analyzing \"" + str + "\":");
>         showTokens(str, nameAnalyzer);
>         Query q = qp.parse(str);
>         System.out.println("Parsed query of \"" + str + "\":");
>         System.out.println(q + "\n");
>     }}
> private static void showTokens(String text, Analyzer analyzer) throws
> IOException {
>     StringReader reader = new StringReader(text);
>     TokenStream stream = analyzer.tokenStream("name", reader);
>     CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
>     stream.reset();
>     while (stream.incrementToken()) {
>         System.out.print("[" + term.toString() + "]");
>     }
>     stream.close();
>     System.out.println();}
> 
> 
> 
> 
> I get the following output, when I invoke testFilmAnalyzer():
> 
> Analyzing "avatar film fiction":[avatar]Parsed query of "avatar film
> fiction":+name:avatar +name:fiction
> Analyzing "avatar-film fiction":[avatar]Parsed query of "avatar-film
> fiction":+name:avatar +name:fiction
> Analyzing "avatar-film-fiction":[avatar]Parsed query of "avatar-film-fiction":
> name:avatar
> 
> 
> It seems like the analyzer uses the PatternReplaceCharFilter in its correct
> intended order (i.e. before tokenization), while the QueryParser does so
> afterwards. Does anyone have an explanation for that? Isn't that a bug?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to