Hi All,
I want to volunteer to help get language modules organized into the CVS and builds.
I've been lurking on the lists here for a couple months and working with and getting
familiar with Lucene. I'm investigating the use of lucene to support our help system's
fulltext search requirements. I have to build indices for multiple languages. I just
poked around the CVS archives and found only the German, Russian and standard(English)
analyzers in the core and nothing in the sandbox. In the list archives I've found many
references to folks using Lucene for several other languages. I did find the
CJKTokenizer, Dutch and French analyzers and have put those into my tests. Is there
somewhere these analyzers are organized that I might get a hold of the sources for
other languages to build into my toolset? There were a couple mentioned that several
of you appear to be using that I can't find the sources for (most notably
http://www.halyava.ru/do/org.apache.lucene.analysis.zip
<http://www.halyava.ru/do/org.apache.lucene.analysis.zip> which gives a "Cannot find
server" error).
In order to meet the requirements for my product these are the languages I have to
support:
Must Support
------------
English
Japanese
Chinese
Korean
French
German
Italian
Polish
Not Sure Yet
------------
Czech
Danish
Hebrew
Hungarian
Russian
Spanish
Swedish
I understand the issues that were raised about putting language modules in the core
and then not being able to support them, but it seems they have not been put anywhere.
I would be willing to try and get them into a central place that people can access
them or help someone that is already working on that. I can't commit today to being
able to maintain or bugfix contributions, but should my company adopt Lucene as our
search engine (which seems likely at this point) I'll do what I can to contribute back
any fixes we make. I also have a personal interest in the project since I've found
Lucene quite interesting to be working with and I've enjoyed learning about
internationalizing java apps.
I'll volunteer to help gather and organize these somewhere if I were given committer
rights to the appropriate area and folks would be willing to send me their language
modules.
I recall some discussion about moving language modules out of the core, but I don't
think any decisions were made about where to put them (perhaps this is why they aren't
in the CVS at all). I was thinking perhaps give each language a sandbox project or
create language packages in the core build that could be enabled via settings in the
build.properties file. Using the build.properties file could allow us to create a jar
for each language during the core build so folks could install just the language
modules they want and if a language module starts breaking due to changes in the core
it could easily be turned off until fixes were made to that module. I can start
working on a setup like this in my local source tree next week using the existing
language modules in the core if you all think this would be a good approach. If not,
does anyone have a proposal for where these belong so we can get some movement on
getting them committed to CVS?
Regards,
Eric
--
Eric D. Isakson SAS Institute Inc.
Application Developer SAS Campus Drive
XML Technologies Cary, NC 27513
(919) 531-3639 http://www.sas.com <http://www.sas.com>