For asian language, Chinese Korean Japanese, bigram based word segment is easy way to solve the word segment problem. Bigram based word segment is: C1C2C3C4 => C1C2 C2C3 C3C4 (C# is single CJK charator term) I think the the make a StandardTokenizer can handle multi language mixed content : Chinese/English, Japanese/French mixed content. In CJKTokenizer(modify from StopTokenizer) I use one char buffer remember previous CJK charactor to make overlap term(Ci + Ci-1)。 but in StandardTokenizer I still don't know how to make: T1T2T3T4 => T1T2 T2T3 T3T4. (T# is single CJK charator term) for more article on word segment for asian languages: http://www.google.com/search?q=chinese+word+segment+bigram Regards Che, Dong ----- Original Message ----- From: "Eric Isakson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, December 07, 2002 12:40 AM Subject: Analyzers for various languages > Hi All, > > I want to volunteer to help get language modules organized into the CVS and builds. > > I've been lurking on the lists here for a couple months and working with and getting >familiar with Lucene. I'm investigating the use of lucene to support our help >system's fulltext search requirements. I have to build indices for multiple >languages. I just poked around the CVS archives and found only the German, Russian >and standard(English) analyzers in the core and nothing in the sandbox. In the list >archives I've found many references to folks using Lucene for several other >languages. I did find the CJKTokenizer, Dutch and French analyzers and have put those >into my tests. Is there somewhere these analyzers are organized that I might get a >hold of the sources for other languages to build into my toolset? There were a couple >mentioned that several of you appear to be using that I can't find the sources for >(most notably http://www.halyava.ru/do/org.apache.lucene.analysis.zip ><http://www.halyava.ru/do/org.apache.lucene.analysis.zip> which gives a "Cannot find >server" error). > > In order to meet the requirements for my product these are the languages I have to >support: > > Must Support > ------------ > English > Japanese > Chinese > Korean > French > German > Italian > Polish > > Not Sure Yet > ------------ > Czech > Danish > Hebrew > Hungarian > Russian > Spanish > Swedish > > I understand the issues that were raised about putting language modules in the core >and then not being able to support them, but it seems they have not been put >anywhere. I would be willing to try and get them into a central place that people can >access them or help someone that is already working on that. I can't commit today to >being able to maintain or bugfix contributions, but should my company adopt Lucene as >our search engine (which seems likely at this point) I'll do what I can to contribute >back any fixes we make. I also have a personal interest in the project since I've >found Lucene quite interesting to be working with and I've enjoyed learning about >internationalizing java apps. > > I'll volunteer to help gather and organize these somewhere if I were given committer >rights to the appropriate area and folks would be willing to send me their language >modules. > > I recall some discussion about moving language modules out of the core, but I don't >think any decisions were made about where to put them (perhaps this is why they >aren't in the CVS at all). I was thinking perhaps give each language a sandbox >project or create language packages in the core build that could be enabled via >settings in the build.properties file. Using the build.properties file could allow us >to create a jar for each language during the core build so folks could install just >the language modules they want and if a language module starts breaking due to >changes in the core it could easily be turned off until fixes were made to that >module. I can start working on a setup like this in my local source tree next week >using the existing language modules in the core if you all think this would be a good >approach. If not, does anyone have a proposal for where these belong so we can get >some movement on getting them committed to CVS? > > Regards, > Eric > -- > Eric D. Isakson SAS Institute Inc. > Application Developer SAS Campus Drive > XML Technologies Cary, NC 27513 > (919) 531-3639 http://www.sas.com <http://www.sas.com> > > >
