about bigram based word segment

Che Dong Thu, 12 Sep 2002 19:45:46 -0700

> I don't know any Asian languages but from earlier experimentations, I
> remember that some time bigram tokenization could hurt matching, e.g.:
> 
> w1w2w3 == tokenized as ==> w1w2 w2w3 (or even _w1 w1w2 w2w3 w3_) would
> miss a search for w2. w1 w2 w3 would work better.
> 
if Chinese segment with single charactor like: w1w2w3 => w1 w2 w3, 
you search "w1w2" and "w2w1" will return with same the result. isn't it?


with bigram based word segment "w1w2w3" => "w1w2" "w2w3"
or even trigram base word segment "w1w2w3w4" => "w1w2w3" "w2w3w4"
will avoid above charactor sequence problem.

According to the stat. the bigram based word segment returned best resutls. but need 
queryParser parser query with "and" relation by default 

You can try the bigram based word segment at http://search.163.com  in  category 
search and news search(web page is powered by google).
google's Chinese language analysis is provided by basistech with Dictionary based word 
segment.
http://www.basistech.com/products/language-analysis/cma.html



Che, Dong

about bigram based word segment

Reply via email to