Re: Indexing multiple languages

2005-06-07 Thread sergiu gordea
Tansley, Robert wrote: Hi all, The DSpace (www.dspace.org) currently uses Lucene to index metadata (Dublin Core standard) and extracted full-text content of documents stored in it. Now the system is being used globally, it needs to support multi-language indexing. I've looked through the

Re: Indexing multiple languages

2005-06-03 Thread Andy Roberts
On Friday 03 Jun 2005 01:06, Bob Cheung wrote: For the StandardAnalyzer, will it have to be modified to accept different character encodings. We have customers in China, Taiwan and Hong Kong. Chinese data may come in 3 different encoding: Big5, GB and UTF8. What is the default encoding

Re: Indexing multiple languages

2005-06-03 Thread Erik Hatcher
On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote: Btw, I did try running the lucene demo (web template) to index the HTML files after I added one including English and Chinese characters. I was not able to search for any Chinese in that HTML file (returned no hits). I wonder whether I need

Re: Indexing multiple languages

2005-06-03 Thread Grant Ingersoll
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages [EMAIL PROTECTED] 6/3/2005 6:03:31 AM On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote: Btw, I did try running the lucene demo (web template) to index the HTML files after I added one including English and Chinese characters. I

RE: Indexing multiple languages

2005-06-03 Thread Max Pfingsthorn
] Sent: Friday, June 03, 2005 14:23 To: java-user@lucene.apache.org Subject: Re: Indexing multiple languages Robert, Le 2 juin 05, à 21:42, Tansley, Robert a écrit : It seems that there are even more options -- 4/ One index, with a separate Lucene document for each (item,language) combination

Re: Indexing multiple languages

2005-06-03 Thread Doug Cutting
Tansley, Robert wrote: What if we're trying to index multiple languages in the same site? Is it best to have: 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can be constrained to a particular language 3/ separate indices for each

RE: Indexing multiple languages

2005-06-02 Thread Tansley, Robert
Libbrecht [mailto:[EMAIL PROTECTED] Sent: 01 June 2005 04:10 To: java-user@lucene.apache.org Subject: Re: Indexing multiple languages Le 1 juin 05, à 01:12, Erik Hatcher a écrit : 1/ one index for all languages 2/ one index for all languages, with an extra language field so searches can

RE: Indexing multiple languages

2005-06-02 Thread Bob Cheung
Hi Erik, I am a new comer to this list and please allow me to ask a dumb question. For the StandardAnalyzer, will it have to be modified to accept different character encodings. We have customers in China, Taiwan and Hong Kong. Chinese data may come in 3 different encoding: Big5, GB and UTF8.

Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Interesting topic. I thought about this as well. I wanted to index Chinese text with English, i.e., I want to treat the English text inside Chinese text as English tokens rather than Chinese text tokens. Right now I think maybe I have to write a special analyzer that takes the text input,

Re: Indexing multiple languages

2005-05-31 Thread Erik Hatcher
Jian - have you tried Lucene's StandardAnalyzer with Chinese? It will keep English as-is (removing stop words, lowercasing, and such) and separate CJK characters into separate tokens also. Erik On May 31, 2005, at 5:49 PM, jian chen wrote: Hi, Interesting topic. I thought about

Re: Indexing multiple languages

2005-05-31 Thread jian chen
Hi, Erik, Thanks for your info. No, I haven't tried it yet. I will give it a try and maybe produce some Chinese/English text search demo online. Currently I used Lucene as the indexing engine for Velocity mailing list search. I have a demo at www.jhsystems.net. It is yet another mailing list