051028 083415 DONE indexing segment 20051019000305:
total 100000 records in 520.156 s (192.3077 rec/s).
051028 083415 done indexing

Been doing some testing and i've pretty much peaked
out at 192-200 rec/s on a 2.8ghz machine with lang
ident enabled on 512bytes data @ 3ngrams which after
tweaking even exceeded before i tried lang ident.

I wonder what's going on with our fetch performance - we're at about 50 pages/second, on a 3GHz quad CPU Xeon box with SCSI RAID 5 disks and a 100Mbps pipe.

Still not seeing any heavy IO, so i'm going to try and
see where my limits are - seems after a while of
increasing max this and that i don't see any
performance differences and even some degradation...
will try and plot this out :)

BTW, is this something that could be done on the fetch
process so the db contains the language and that could
be used to control your fetch list creation to begin
with?

If I understand your question correctly, you want to focus on fetching pages for particular languages, or rather defer fetching of pages that aren't in a target language, right?

Once you've parsed a page & identified (to some level of confidence) the language, you could use the language to adjust the nextScore value for outlinks to pages that don't currently exist. Then in FetchListTool use this nextScore value, and provide some topN value such that the top links are going to be in your target language.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to