If your data to be searched lies in dfs it is slow. You need to first
copy it out to local file system. Split your data into smaller slices
which you then distribute evenly on your search nodes.
This part of process is not that well covered and I am looking for much
improvement in this area from this proposal:
http://mail-archives.apache.org/mod_mbox/lucene-general/200610.mbox/[EMAIL
PROTECTED]
--
Sami Siren
Håvard W. Kongsgård wrote:
DistributedSearch
2x datanodes, 2x Task Trackers
Sami Siren wrote:
You are using DistributedSearch? and local filesystem to store index
and related data?
--
Sami Siren
Håvard W. Kongsgård wrote:
I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000
memory), searching with queries like 'China Nuclear Forces' takes 20
– 25 s.
My config:
http.content.limit = 6165536
dfs.replication = 1
mapred.submit.replication = 2
mapred.child.java.opts = -Xmx800m
My data:
TOTAL urls: 3748140
retry 0: 3614731
retry 1: 85999
retry 2: 20772
retry 3: 26638
min score: 0.0
avg score: 0.64956105
max score: 3922.723
status 1 (DB_unfetched): 1316016
status 2 (DB_fetched): 2168397
status 3 (DB_gone): 263727
Status: HEALTHY
Total size: 254534723272 B
Total blocks: 5140 (avg. block size 49520374 B)
Total dirs: 260
Total files: 1466
Over-replicated blocks: 8 (0.15564202 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0015564
The filesystem under path '/' is HEALTHY