Re: Analyzers for various languages

Che Dong Mon, 30 Dec 2002 22:59:23 -0800
For asian language, Chinese Korean Japanese,  bigram based word segment is easy way to 
solve the word segment problem. 
Bigram based word segment is:  C1C2C3C4  =>  C1C2 C2C3 C3C4  (C# is single CJK 
charator term)
I think the the make a StandardTokenizer can handle multi language mixed content : 
Chinese/English, Japanese/French mixed content. 

In CJKTokenizer(modify from StopTokenizer) I use one char buffer remember previous CJK 
charactor to make overlap term(Ci + Ci-1)。
but in StandardTokenizer I still don't know how to make:
T1T2T3T4 => T1T2 T2T3 T3T4.  (T# is single CJK charator term)

for more article on word segment for asian languages:
http://www.google.com/search?q=chinese+word+segment+bigram

Regards

Che, Dong
----- Original Message ----- 
From: "Eric Isakson" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, December 07, 2002 12:40 AM
Subject: Analyzers for various languages


> Hi All, 
> 
> I want to volunteer to help get language modules organized into the CVS and builds.
> 
> I've been lurking on the lists here for a couple months and working with and getting 
>familiar with Lucene. I'm investigating the use of lucene to support our help 
>system's fulltext search requirements. I have to build indices for multiple 
>languages. I just poked around the CVS archives and found only the German, Russian 
>and standard(English) analyzers in the core and nothing in the sandbox. In the list 
>archives I've found many references to folks using Lucene for several other 
>languages. I did find the CJKTokenizer, Dutch and French analyzers and have put those 
>into my tests. Is there somewhere these analyzers are organized that I might get a 
>hold of the sources for other languages to build into my toolset? There were a couple 
>mentioned that several of you appear to be using that I can't find the sources for 
>(most notably http://www.halyava.ru/do/org.apache.lucene.analysis.zip 
><http://www.halyava.ru/do/org.apache.lucene.analysis.zip>  which gives a "Cannot find 
>server" error). 
> 
> In order to meet the requirements for my product these are the languages I have to 
>support: 
> 
> Must Support 
> ------------ 
> English
> Japanese 
> Chinese 
> Korean 
> French 
> German 
> Italian 
> Polish 
> 
> Not Sure Yet 
> ------------ 
> Czech 
> Danish 
> Hebrew 
> Hungarian 
> Russian 
> Spanish 
> Swedish 
> 
> I understand the issues that were raised about putting language modules in the core 
>and then not being able to support them, but it seems they have not been put 
>anywhere. I would be willing to try and get them into a central place that people can 
>access them or help someone that is already working on that. I can't commit today to 
>being able to maintain or bugfix contributions, but should my company adopt Lucene as 
>our search engine (which seems likely at this point) I'll do what I can to contribute 
>back any fixes we make. I also have a personal interest in the project since I've 
>found Lucene quite interesting to be working with and I've enjoyed learning about 
>internationalizing java apps.
> 
> I'll volunteer to help gather and organize these somewhere if I were given committer 
>rights to the appropriate area and folks would be willing to send me their language 
>modules. 
> 
> I recall some discussion about moving language modules out of the core, but I don't 
>think any decisions were made about where to put them (perhaps this is why they 
>aren't in the CVS at all). I was thinking perhaps give each language a sandbox 
>project or create language packages in the core build that could be enabled via 
>settings in the build.properties file. Using the build.properties file could allow us 
>to create a jar for each language during the core build so folks could install just 
>the language modules they want and if a language module starts breaking due to 
>changes in the core it could easily be turned off until fixes were made to that 
>module. I can start working on a setup like this in my local source tree next week 
>using the existing language modules in the core if you all think this would be a good 
>approach. If not, does anyone have a proposal for where these belong so we can get 
>some movement on getting them committed to CVS?
> 
> Regards,
> Eric
> -- 
> Eric D. Isakson        SAS Institute Inc. 
> Application Developer  SAS Campus Drive 
> XML Technologies       Cary, NC 27513 
> (919) 531-3639         http://www.sas.com <http://www.sas.com>  
> 
> 
>
Re: Analyzers for various languages

Reply via email to