I have got nearly 4 million chinese documents, each size ranges from 1k - 300k. So I use org.apache.lucene.analysis.cn.ChineseAnalyzer as the analyzer for the text. The index have four fields:
content - tokenized not stored
title - tokenized and stored
path - stored only
date - stored only

For some reason, I divide these documents into 12 sets and use IndexSearcher over MultiReader for search. For all the english query, the speed is very fast, only cost about 10-100ms. But when I use the Chinese words for query, the situation is a bit confused: If the word is only one char, so the Query is actually a TermQuery, the speed is very fast. however, If the word is more than one char, the Query is actually a PhraseQuery with slop 0,
IndexSearcher usually cost 3000-5000ms to return the Hits.

I have also tested with the QueryParser and get the same results, and my environment is a Dell PE2600 2G*2 Xeon, 2GRAM, 10000R/s SCSI, Debian/sarge, Sun JDK 1.5 + lucene 2.0.0

thanks.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to