RE: KeywordAnalyzer still getting tokenized on spaces

Milind Tue, 09 Sep 2014 05:25:36 -0700

I simplified the program to show this.  I actually use a multiterm query
parser and a join query across 2 Lucene Indexes. It's already complicated.


I can understand the logic of parsing the query first (I need that in fact
because I'm using different analyzers for different fields), but I don't
understand why it would need to parse within the field before using the
analyzer.

I don't know if I can use my own query builder because there are multiple
fields and I'd have to now understand the query syntax to identify the
fields and the search texts in the field - essentially doing everything
that Lucene parsers are doing.

Thanks for your input. Let me see how I can work around this. Thanks. I may
be back with a lot more questions.
On Sep 9, 2014 3:52 AM, "Uwe Schindler" <u...@thetaphi.de> wrote:

> Hi,
>
> the QueryParser does not analyze the whole query text with the analyzer.
> It first parses the query syntax and then only passes those parts through
> the analyzer, which are considered as "tokens" by the query parser. If you
> want such an analyzer be respected by the query parser you may need a
> nother one with a simplified syntax (e.g. SimpleQueryParser).
>
> Ideally, if you want to just pass a text through an analyzer, you should
> not use a query parser (because there is nothing to parse, just to
> analyze). So approach #2 is the right one. To make it easier, Lucene
> contains the following class:
>
>
> http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/util/QueryBuilder.html
>
> This one uses no syntax and just passes the string through the Analyzer to
> create the query:
>
> So solution #2 looks like:
>
> Query currQuery = new QueryBuilder(theAnalyzer)
>     .createBooleanQuery("sn", currQueryStr, BooleanClause.Occur.MUST);
>
> In your case this would return a Boolean query with one clause, but that
> gets rewritten by the query execution, so its identical to a single term
> query. This approach is  like Elasticsearch's "matchQuery" and is in most
> cases the approach you should use, if you don't need "syntax".
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -----Original Message-----
> > From: atawfik [mailto:contact.txl...@gmail.com]
> > Sent: Tuesday, September 09, 2014 9:37 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: KeywordAnalyzer still getting tokenized on spaces
> >
> > The result of QueryParser is confusing. The problem is that you assume
> the
> > query parser uses the analyzer to parse your query. However, that is not
> the
> > case. The query parser first parses the query string, then applies the
> > analyzer.
> >
> > In other words, the query parser will split the query string using
> spaces.
> > So, you will get three terms : 1023, 4567 and 8765. In fact, you can see
> that in
> > the output of the second query; you have three boolean clauses instead of
> > one. After parsing query, the query parser applies the analyzer.
> >
> > To fix that, you have two solutions:
> >
> > 1- Use term query instead directly without using query parser. In this
> case,
> > you will not apply the analyzer.
> >      Query currQuery = new TermQuery(new Term("sn",currQueryStr));
> > 2- Analyze the query, then create the Term query:
> >       TokenStream ts = theAnalyzer.tokenStream("sn",new
> > StringReader(currQueryStr));
> >       ts.reset();
> >       ts.incrementToken();
> >      CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);
> >      String query = ca.toString();
> >      ts.close();
> >      Query currQuery = new TermQuery(new Term("sn",query));
> >      System.out.println(currQuery.getClass() + ", " + currQuery);
> >
> > I am not aware of any method that uses QueryParser to achieve that. May
> > someone here can correct me.
> >
> > Regards
> > Ameer
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/KeywordAnalyzer-still-getting-
> > tokenized-on-spaces-tp4157537p4157560.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

RE: KeywordAnalyzer still getting tokenized on spaces

Reply via email to