Re: Chinese Segmentation with Phase Query

Uwe Goetzke Sat, 10 Nov 2007 04:09:14 -0800

Hi Cedric,

Although I have no idea how to use the Chinese language 
but I went a different route to overcome language specific problems.


Instead of using a language specific segmentation we use now the statistical 
segmentation with bigrams 

e.g.
Given a your sentence XYZABCDEF
suppose the segmentation is
XY YZ ZA AB BC CD DE EF 

A SpanNearQuery of (XY, BC, DE, EF) with distance of 10 should work to
match this document.

I am not sure if this works in your case because we index product information 
and their descriptions which are not language friendly anyway because of the 
abbreviations) 

Regards

Uwe Goetzke


-----Ursprüngliche Nachricht-----
Von: Cedric Ho [mailto:[EMAIL PROTECTED] 
Gesendet: Samstag, 10. November 2007 02:28
An: java-user@lucene.apache.org
Betreff: - Re: Chinese Segmentation with Phase Query 

On Nov 10, 2007 2:08 AM, Steven A Rowe <[EMAIL PROTECTED]> wrote:
> Hi Cedric,
>
> On 11/08/2007, Cedric Ho wrote:
> > a sentence containing characters ABC, it may be segmented into AB, C or A, 
> > BC.
> [snip]
> > In this cases we would like to index both segmentation into the index:
> >
> > AB offset (0,1) position 0            A offset (0,0) position 0
> > C offset (2,2) position 1             BC offset (1,2) position 1
> >
> > Now the problem is, when someone search using a PhraseQuery (AC) it
> > will find this line ABC because it match A (position 0) and C
> > (position 1).
> >
> > Are there any ways to search for exact match using the offset
> > information instead of the position information ?
>
> Since you are writing the tokenizer (the Lucene term for the module that 
> performs the segmentation), you yourself can substitute the beginning offset 
> for the position.  But I think that without the end offset, it won't get you 
> what you want.
>
> For example, if your above example were indexed with beginning offsets as 
> positions, a phrase query for "AB, C" will fail to match -- even though it 
> should match -- because the segments' beginning offsets (0 and 2) are not 
> contiguous.
>
> The new Payloads feature could provide the basis for storing beginning and 
> ending offsets required to determine contiguity when matching phrases, but 
> you would have to write matching and scoring for this representation, and 
> that may not be the quickest route available to you.
>
> Solution #1: Create multiple fields, one for each full alternative 
> segmentation, and then query against all of them.
>
> Solution #2: Store the alternative segmentations in the same field, but 
> instead of interleaving the segments' positions, as in your example, make the 
> position ranges of the alternatives non-contiguous.  Recasting your example:
>
>         lternative #1   Alternative #2  Alternative #3
>         -------------   --------------  --------------
>         AB position 0   A position 100  A position 200
>         C position 1    BC position 101 B position 201
>                                                         C position 202
>
> There is a problem with both of the above-described solutions: in my limited 
> experience with Chinese segmentation, substantially less than half the text 
> has alternative segmentations.  As a result, the segments on which all of 
> alternatives agree (call them "uncontested segments") will have higher term 
> frequencies than those segments which differ among the alternatives 
> ("contested segments").  This means that document scores will be influenced 
> by the variable density of the contested segments they contain.
>
> However, if you were to use my above-described Solution #1 along with a 
> DisjunctionMaxQuery[1] as a wrapper around one query per alternative 
> segmentation field, the term frequency problem would no longer be an issue.  
> From the API doc for DisjunctionMaxQuery:
>
>     A query that generates the union of documents produced by its
>     subqueries, and that scores each document with the maximum
>     score for that document as produced by any subquery, plus a
>     tie breaking increment for any additional matching subqueries.
>     This is useful when searching for a word in multiple fields
>     with different boost factors (so that the fields cannot be
>     combined equivalently into a single search field).  We want
>     the primary score to be the one associated with the highest
>     boost, not the sum of the field scores (as BooleanQuery would
>     give).
>
> Unlike the use-case mentioned above, where each field will be boosted 
> differently, you probably don't have any information about the relative 
> probability of the alternative segmentations, so you'll want to use the same 
> boost for each sub-query.
>
> Steve
>
> [1] 
> <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html>
>
> --
> Steve Rowe
> Center for Natural Language Processing
> http://www.cnlp.org/tech/lucene.asp
>

Hi Steve,

We have actually thought about solution #1, and in our case, sorting
by scoring is not a very important factor as well. However this would
double the index size. A full index of our documents now would take >
80G already, and it's expected to grow much faster in the near future.
As you've mentioned, the ambiguities in segmentation are rare indeed.
So we are not very willing to double the index size just for that.

For solution #2, it is a good solution indeed, but you see, our
problem is much deeper in fact. We are currently try to replace our
old search engine with Lucene. Which means whatever features the old
engine have, we need to simulate it in Lucene. One such feature is
similar to Lucene's SpanNearQuery, which, unfortunately, doesn't work
with solution #2 if the search terms contains the term BC in
Alternative #2.

e.g.
Given a sentence XYZABCDEF
suppose the segmentation is
XY position 0
Z position 1           Alternative #2
AB position 2         A position 102
C position 3          BC position 103
DEF position 4

A SpanNearQuery of (XY, BC, DEF) with distance of 10 would fail to
match this document.

So it seems we may have to go the more difficult route in exploring
the Payload feature that you mentioned. I've seen it being announced
during the release of 2.2.0, but the API says "experimental" and there
ain't any example about how can it be used.

But thanks very much for your good suggestions.

Cheers,
Cedric

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-----------------------------------------------------------------------
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Chinese Segmentation with Phase Query

Reply via email to