If your data to be searched lies in dfs it is slow. You need to first copy it out to local file system. Split your data into smaller slices which you then distribute evenly on your search nodes.

This part of process is not that well covered and I am looking for much improvement in this area from this proposal:

http://mail-archives.apache.org/mod_mbox/lucene-general/200610.mbox/[EMAIL 
PROTECTED]

--
 Sami Siren



Håvard W. Kongsgård wrote:
DistributedSearch
2x datanodes, 2x Task Trackers

Sami Siren wrote:
You are using DistributedSearch? and local filesystem to store index and related data?

--
 Sami Siren


Håvard W. Kongsgård wrote:
I have nutch 0.8.1 running on 3 servers (AMD X2 3800 with 4 000 memory), searching with queries like 'China Nuclear Forces' takes 20 – 25 s.

My config:
http.content.limit = 6165536
dfs.replication = 1
mapred.submit.replication = 2
mapred.child.java.opts = -Xmx800m

My data:
TOTAL urls: 3748140
retry 0: 3614731
retry 1: 85999
retry 2: 20772
retry 3: 26638
min score: 0.0
avg score: 0.64956105
max score: 3922.723
status 1 (DB_unfetched): 1316016
status 2 (DB_fetched): 2168397
status 3 (DB_gone): 263727

Status: HEALTHY
Total size: 254534723272 B
Total blocks: 5140 (avg. block size 49520374 B)
Total dirs: 260
Total files: 1466
Over-replicated blocks: 8 (0.15564202 %)
Under-replicated blocks: 0 (0.0 %)
Target replication factor: 1
Real replication factor: 1.0015564

The filesystem under path '/' is HEALTHY






Reply via email to