[Nutch-dev] Re: NutchAnalysis and CJK

Transbuerg Tian Mon, 25 Jul 2005 22:43:02 -0700

hi, Jack
My chinese phrase dictionary and chinese sentence segmentation programm you 
could get at the below link:
 
[sandbox]Lucene中文分词的2个试验模块<http://www.grass.org.cn/blog/archives/2005/07/sandboxluceneae.html>»
gRaSSland开发日记<http://www.grass.org.cn/blog/>


and now , I had solved the problems I mentioned at previous mail.
good luck

tian

2005/7/22, Jack Tang <[EMAIL PROTECTED]>:
> 
> Hi Transbuerg
> 
> Firstly, could you please explain "same condition with u"? Thanks
> 
> I have no constructive suggestion to you, because your dictionary
> are really huge. If you can share with me, I will appricate that.
> 
> And here is my tips on the dictionary structure. Optimization? I
> don't think so.
> 1. I split the whole dictionary into small pieces according
> Chinese pronunciation.
> 2. using dictionary-lazy-loading when indexing
> 3. loading on demand when search, and you can control how many
> pieces should be in memory
> 
> Thoughts?
> 
> /Jack
> 
> On 7/20/05, Transbuerg Tian <[EMAIL PROTECTED]> wrote:
> > hi,
> > the weblucene do not use dictionary base segmentation. on the
> > contrary , it use the bi-gram segmentation. you could get more 
> infomation
> > at : http://www.chedong.com or search ³µ¶« and lucene for more 
> information.
> >
> > at this time , I am try to use dictionary base segmentation , you
> > could visit my blog :
> > http://blog.csdn.net/accesine960/category/35308.aspx
> >
> > I have written a dictionary based segmentation java programe , but
> > now under test condition.
> > I meet two question :
> > 1. my dictionary term all about 150000 chinese phrases , so I put it
> > to a hashmap when segmenting.
> > 2. use the java programe , the build Index process is very good , but
> > when searching , my computer server CPU alway 99.% busy. ( my server: 4G
> > mem and 4 cpu and the index file size about: 2.2G )
> >
> >
> > so , recent days, I am strive to solve the above 2 questions.
> >
> > good luck
> > if you are chinese , we could use chinese for further exchange......
> >
> > 2005/7/19, Jack Tang <[EMAIL PROTECTED]>:
> > > Hi Transbuerg
> > >
> > > Could you please describe your solution in detail? Appreciate your 
> time.
> > >
> > > Regards
> > > /Jack
> > >
> > > On 7/15/05, Transbuerg Tian <[EMAIL PROTECTED] > wrote:
> > > > hi,
> > > > Jack Tang
> > > >
> > > > I have the same condition with u , could you share your total
> > > > NutchAnalysis.jj file at here, I am not use nutch but lucene .
> > > >
> > > > good luck.
> > > >
> > > >
> > > >
> > http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx
> > > >
> > > >
> > > > 2005/7/15, Jack Tang < [EMAIL PROTECTED]>:
> > > > > Hi All
> > > > >
> > > > > It takes long time for me to think about embedding improved
> > > > > CJKAnalysis into NutchAnalysis. I got nothing but some failure
> > > > > experiences, and share with you, maybe you can hack it( well, I am 
> not
> > > > > going to give up).
> > > > >
> > > > > I have written several Chinese words segmentation, some are 
> dictionary
> > > > > based, such as Forward Maximum Matching(FMM) and Backward Maximum
> > > > > Matching(BMM), and some auto-segmentation, say bi-gram. And they 
> work
> > > > > fine in pure Chinese words env.(not the mixture of Chinese and 
> other
> > > > > languages).
> > > > >
> > > > > Why I only aim at pure Chinese words env.? In NutchAnalysis.jj
> > > > >
> > > > > <orig>
> > > > >
> > > > > // chinese, japanese and korean characters
> > > > > | <SIGRAM: <CJK> >
> > > > >
> > > > > </orig>
> > > > >
> > > > > <modified>
> > > > >
> > > > > // chinese, japanese and korean characters
> > > > > | <SIGRAM: (<CJK>)+ >
> > > > >
> > > > > </modified>
> > > > >
> > > > > SIGRAM only contains CJK words.
> > > > >
> > > > > Well, I am not much familiar with JavaCC, so the big puzzle pauses 
> me.
> > > > > As you know:
> > > > >
> > > > > // basic word -- lowercase it
> > > > > <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ |
> > > > <IRREGULAR_WORD>)>
> > > > > { matchedToken.image = matchedToken.image.toLowerCase(); }
> > > > >
> > > > > this statement means if the sentence matches "WORD" rule, then the
> > > > > wrapped object matchedToken will extract
> > > > > target word. *ONE* word is extracted in one matching.
> > > > >
> > > > > so, in term() function, it is simple.
> > > > >
> > > > > /** Parse a single term. */
> > > > > String term() :
> > > > > {
> > > > > Token token;
> > > > > }
> > > > > {
> > > > > ( token=<WORD> | token=<ACRONYM>) // I don't think it is 
> reasonable
> > > > > put "token=<SIGRAM>" here.
> > > > >
> > > > > { return token.image; }
> > > > > }
> > > > >
> > > > > For CJK it is quite different. We have to extract *MANY* words in 
> one
> > > > matching.
> > > > >
> > > > > // chinese, japanese and korean characters
> > > > > | <SIGRAM: (<CJK>)+ >
> > > > > {
> > > > > // parse <CJK>+ will generate many words(tokens) here!
> > > > > }
> > > > >
> > > > > And my approach is constructing one TokenList to hold these 
> tokens.
> > > > > The pseudocode looks like
> > > > >
> > > > > // chinese, japanese and korean characters
> > > > > | <SIGRAM: (<CJK>)+ >
> > > > > {
> > > > > for (int i = 0; i < image.length ();...) {
> > > > > Token token = extract in bi-gram.
> > > > > tokenList.add(token);
> > > > > }
> > > > > }
> > > > >
> > > > > accordingly, the term() function should return ArrayList.
> > > > >
> > > > > /** .... **/
> > > > > ArrayList term():
> > > > > {
> > > > > Token token;
> > > > > }
> > > > > {
> > > > > (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> > > > > {
> > > > > return tokenList;
> > > > > }
> > > > >
> > > > > }
> > > > >
> > > > > After these modification, running NutchAnalysis.class, you will 
> get
> > odd
> > > > result.
> > > > > Say, I input some Chinese characters:C1C2C3
> > > > > the result will be: "C1C2 C2C3" (NOTICE the quotation mark).
> > > > >
> > > > > I am in the wrong direction? Or will someone share any thoughts on
> > > > > NutchAnalysis.jj? Thanks
> > > > >
> > > > >
> > > > >
> > > > > Regards
> > > > > /Jack
> > > > >
> > > > > --
> > > > > Keep Discovering ... ...
> > > > > http://www.jroller.com/page/jmars
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Keep Discovering ... ...
> > > http://www.jroller.com/page/jmars
> > >
> >
> >
> 
> 
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

[Nutch-dev] Re: NutchAnalysis and CJK

Reply via email to