Re: Add flag to CJKBigramFilter to also output unigrams (Single character Han queries)

Tom Burton-West Fri, 03 Aug 2012 15:59:38 -0700

Thanks Robert,

Opened:LUCENE-4286 <https://issues.apache.org/jira/browse/LUCENE-4286>


Tom

On Fri, Aug 3, 2012 at 6:22 PM, Robert Muir <rcm...@gmail.com> wrote:

> Tom, please open an issue for this.
>
> On Fri, Aug 3, 2012 at 6:19 PM, Tom Burton-West <tburt...@umich.edu>
> wrote:
> > Hello all,
> >
> > About 10% of our queries that contain Han characters are single character
> > queries.   It looks like the CJKBigram filter only outputs single
> characters
> > when there are no adjacent bigrammable characters in the input.   This
> means
> > we have to create a separate field to index Han unigrams in order to
> address
> > single character queries and then write application code to search that
> > separate field if we detect a single character Han query.  This is rather
> > kludgey.    As an alternative approach to dealing with single character
> Han
> > queryies, would it be possible to add an optional  flag to the
> > CJKBigramFilter to tell it to also output unigrams?
> >
> > That way on indexing we could set the flag so that both unigrams and
> bigrams
> > would be indexed.  On querying we would not set the flag so that the
> current
> > logic which outputs bigrams unless there is a single Han character (in
> which
> > case that gets output) would take care of queries containing a single Han
> > unigram.
> >
> > This is somewhat analogus to the flags in LUCENE-1370 for the
> ShingleFilter.
> >
> > If this makes sense I'll open a JIRA issue.
> >
> > Tom Burton-West
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: Add flag to CJKBigramFilter to also output unigrams (Single character Han queries)

Reply via email to