Re: Lucene and Chinese language

Robert Muir Thu, 01 Jul 2010 04:52:33 -0700

you can make your own analyzer, or do something like the below at
query-time.


QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname" ,
new PositionHackAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_30)));

public class PositionHackAnalyzerWrapper extends Analyzer {
  Analyzer wrapped;

  public PositionHackAnalyzerWrapper(Analyzer wrapped) {
    this.wrapped = wrapped;
  }

  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream ts = wrapped.tokenStream(fieldName, reader);
    return new PositionFilter(ts);
  }
}

2010/7/1 Kolhoff, Jacqueline - ENCOWAY <kolh...@encoway.de>

> How can I add this PositionFilter? I can't see anything in the API. I use
> lucene version 3.0.1, this is my query parser:
>
> QueryParser queryParser = new QueryParser(Version.LUCENE_30, "myfieldname"
> , new StandardAnalyzer(Version.LUCENE_30));
>
> -----Ursprüngliche Nachricht-----
> Von: Robert Muir [mailto:rcm...@gmail.com]
> Gesendet: Donnerstag, 1. Juli 2010 12:34
> An: java-user@lucene.apache.org
> Betreff: Re: Lucene and Chinese language
>
> This is a bug in the queryparser. (
> https://issues.apache.org/jira/browse/LUCENE-2458)
>
> the problem has nothing to do with your choice of analyzer, it has to do
> with how the query is formed.
>
> Currently the queryparser uses a convoluted algorithm involving whitespace
> (and not just the double quote operator as you would expect) to form phrase
> queries. So, queries like this with no whitespace form phrase queries
> always.
>
> The only workaround for reasonably good results consists of two steps:
> 1. at query time (only!) add a
> org.apache.lucene.analysis.position.PositionFilter (from contrib/analyzers)
> to your analyzer. don't do this at index-time, just query-time!
> 2. this will make all terms in the query "synonyms" of each other to bypass
> this problem, but will screw up scoring, so you might want to also extend
> QueryParser in a custom way:
>
> @Override
>  protected BooleanQuery newBooleanQuery(boolean disableCoord) {
>   // intentionally ignore disabled
>   // coord() factor from the PositionFilter hack.
>   return new BooleanQuery(false);
>  }
>
> 2010/7/1 Kolhoff, Jacqueline - ENCOWAY <kolh...@encoway.de>
>
> >
> > Hi!
> >
> > We are using lucene in our project to search through information objects
> > which works fine. For indexing we use the StandardAnalyzer.
> > Now, we have to support the Chinese language. I found out that the
> Chinese
> > words and letters are correctly saved in the index but the query to
> search
> > for them does not work. Example: in English language the query is “text”
> > which we parse to “*text*”. If we search for Chinese words / phrases like
> > “佛山东方书城”the query is “*佛山东方书城*“ but there are no search results. If the
> > query places blanks between the single letters / symbols like this “*佛 山
> 东 方
> > 书 城*“ we are getting results. Does the StandardAnalyzer interpret each
> > Chinese letter as one word? What are best practices for this case? Shall
> we
> > use another analyzer (Chinese analyzer)? Or is it better to replace the
> > query parser in this case?
> >
> > Regards,
> > Jacqueline.
> >
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com

Re: Lucene and Chinese language

Reply via email to