Tansley, Robert wrote:
Hi all,
The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it. Now the system is being used globally, it needs to
support multi-language indexing.
I've looked through the
On Friday 03 Jun 2005 01:06, Bob Cheung wrote:
For the StandardAnalyzer, will it have to be modified to accept
different character encodings.
We have customers in China, Taiwan and Hong Kong. Chinese data may come
in 3 different encoding: Big5, GB and UTF8.
What is the default encoding
On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
Btw, I did try running the lucene demo (web template) to index the
HTML
files after I added one including English and Chinese characters.
I was
not able to search for any Chinese in that HTML file (returned no
hits).
I wonder whether I need
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages
[EMAIL PROTECTED] 6/3/2005 6:03:31 AM
On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
Btw, I did try running the lucene demo (web template) to index the
HTML
files after I added one including English and Chinese characters.
I
]
Sent: Friday, June 03, 2005 14:23
To: java-user@lucene.apache.org
Subject: Re: Indexing multiple languages
Robert,
Le 2 juin 05, à 21:42, Tansley, Robert a écrit :
It seems that there are even more options --
4/ One index, with a separate Lucene document for each (item,language)
combination
Tansley, Robert wrote:
What if we're trying to index multiple languages in the same site? Is
it best to have:
1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each
Libbrecht [mailto:[EMAIL PROTECTED]
Sent: 01 June 2005 04:10
To: java-user@lucene.apache.org
Subject: Re: Indexing multiple languages
Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
1/ one index for all languages
2/ one index for all languages, with an extra language field so
searches
can
Hi Erik,
I am a new comer to this list and please allow me to ask a dumb
question.
For the StandardAnalyzer, will it have to be modified to accept
different character encodings.
We have customers in China, Taiwan and Hong Kong. Chinese data may come
in 3 different encoding: Big5, GB and UTF8.
Hi,
Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.
Right now I think maybe I have to write a special analyzer that takes
the text input,
Jian - have you tried Lucene's StandardAnalyzer with Chinese? It
will keep English as-is (removing stop words, lowercasing, and such)
and separate CJK characters into separate tokens also.
Erik
On May 31, 2005, at 5:49 PM, jian chen wrote:
Hi,
Interesting topic. I thought about
Hi, Erik,
Thanks for your info.
No, I haven't tried it yet. I will give it a try and maybe produce
some Chinese/English text search demo online.
Currently I used Lucene as the indexing engine for Velocity mailing
list search. I have a demo at www.jhsystems.net.
It is yet another mailing list
11 matches
Mail list logo