For asian language, Chinese Korean Japanese, bigram based word segment is easy way to
solve the word segment problem.
Bigram based word segment is: C1C2C3C4 => C1C2 C2C3 C3C4 (C# is single CJK
charator term)
I think the the make a StandardTokenizer can handle multi language mixed content :
Chinese/English, Japanese/French mixed content.
In CJKTokenizer(modify from StopTokenizer) I use one char buffer remember previous CJK
charactor to make overlap term(Ci + Ci-1)。
but in StandardTokenizer I still don't know how to make:
T1T2T3T4 => T1T2 T2T3 T3T4. (T# is single CJK charator term)
for more article on word segment for asian languages:
http://www.google.com/search?q=chinese+word+segment+bigram
Regards
Che, Dong
----- Original Message -----
From: "Eric Isakson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, December 07, 2002 12:40 AM
Subject: Analyzers for various languages
> Hi All,
>
> I want to volunteer to help get language modules organized into the CVS and builds.
>
> I've been lurking on the lists here for a couple months and working with and getting
>familiar with Lucene. I'm investigating the use of lucene to support our help
>system's fulltext search requirements. I have to build indices for multiple
>languages. I just poked around the CVS archives and found only the German, Russian
>and standard(English) analyzers in the core and nothing in the sandbox. In the list
>archives I've found many references to folks using Lucene for several other
>languages. I did find the CJKTokenizer, Dutch and French analyzers and have put those
>into my tests. Is there somewhere these analyzers are organized that I might get a
>hold of the sources for other languages to build into my toolset? There were a couple
>mentioned that several of you appear to be using that I can't find the sources for
>(most notably http://www.halyava.ru/do/org.apache.lucene.analysis.zip
><http://www.halyava.ru/do/org.apache.lucene.analysis.zip> which gives a "Cannot find
>server" error).
>
> In order to meet the requirements for my product these are the languages I have to
>support:
>
> Must Support
> ------------
> English
> Japanese
> Chinese
> Korean
> French
> German
> Italian
> Polish
>
> Not Sure Yet
> ------------
> Czech
> Danish
> Hebrew
> Hungarian
> Russian
> Spanish
> Swedish
>
> I understand the issues that were raised about putting language modules in the core
>and then not being able to support them, but it seems they have not been put
>anywhere. I would be willing to try and get them into a central place that people can
>access them or help someone that is already working on that. I can't commit today to
>being able to maintain or bugfix contributions, but should my company adopt Lucene as
>our search engine (which seems likely at this point) I'll do what I can to contribute
>back any fixes we make. I also have a personal interest in the project since I've
>found Lucene quite interesting to be working with and I've enjoyed learning about
>internationalizing java apps.
>
> I'll volunteer to help gather and organize these somewhere if I were given committer
>rights to the appropriate area and folks would be willing to send me their language
>modules.
>
> I recall some discussion about moving language modules out of the core, but I don't
>think any decisions were made about where to put them (perhaps this is why they
>aren't in the CVS at all). I was thinking perhaps give each language a sandbox
>project or create language packages in the core build that could be enabled via
>settings in the build.properties file. Using the build.properties file could allow us
>to create a jar for each language during the core build so folks could install just
>the language modules they want and if a language module starts breaking due to
>changes in the core it could easily be turned off until fixes were made to that
>module. I can start working on a setup like this in my local source tree next week
>using the existing language modules in the core if you all think this would be a good
>approach. If not, does anyone have a proposal for where these belong so we can get
>some movement on getting them committed to CVS?
>
> Regards,
> Eric
> --
> Eric D. Isakson SAS Institute Inc.
> Application Developer SAS Campus Drive
> XML Technologies Cary, NC 27513
> (919) 531-3639 http://www.sas.com <http://www.sas.com>
>
>
>