I have got nearly 4 million chinese documents, each size ranges from 1k -
300k. So I use
org.apache.lucene.analysis.cn.ChineseAnalyzer as the analyzer for the text.
The index have
four fields:
content - tokenized not stored
title - tokenized and stored
path - stored only
date - stored only
For some reason, I divide these documents into 12 sets and use
IndexSearcher over
MultiReader for search. For all the english query, the speed is very fast,
only cost about
10-100ms. But when I use the Chinese words for query, the situation is a
bit confused:
If the word is only one char, so the Query is actually a TermQuery, the
speed is very fast.
however, If the word is more than one char, the Query is actually a
PhraseQuery with slop 0,
IndexSearcher usually cost 3000-5000ms to return the Hits.
I have also tested with the QueryParser and get the same results, and my
environment is a
Dell PE2600 2G*2 Xeon, 2GRAM, 10000R/s SCSI, Debian/sarge, Sun JDK 1.5 +
lucene 2.0.0
thanks.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]