Hi folks,

  I'm struggling with the Nutch crawler at closing database step. I'm
running on Redhat Enterprise, 4G RAM, JDK 1.5.06. The database size is
around few million pages. I usually get the system crash when Nutch comes to
closing database (both during update database or fetchlist). The log file
contains following:

060115 200846 Processing pagesByURL: Sorted 18630 instructions in 0.067seconds.
060115 200846 Processing pagesByURL: Sorted 278059.7014925373instructions/second
060115 200915 Processing pagesByURL: Merged to new DB containing 3975608
records in 28.646 seconds
060115 200915 Processing pagesByURL: Merged 138784.05362005165records/second
060115 200915 Processing pagesByMD5: Sorted 18630 instructions in 0.121seconds.
060115 200915 Processing pagesByMD5: Sorted
153966.94214876034instructions/second
060115 201007 Processing pagesByMD5: Merged to new DB containing 3975608
records in 51.773 seconds
060115 201007 Processing pagesByMD5: Merged 76789.21445541111 records/second
060115 201409 Processing linksByMD5: Copied file (96 bytes) in 241.822 secs.
[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL 
PROTECTED]@[EMAIL PROTECTED]@^@


First of all, i can't understand why it takes so long to copy a 96-byte file
compared with other operations. Should it be a folder? When the system
crashed, webdb, webdb.new and webdb.old are presented.

I found that Nutch really has some issues with the database. Few days ago, I
encountered a problem of not having webdb.new deleted. Some folks suggest to
switch to Linux, and it worked, but still this problem remains. Thanks for
any help.

Regards,
Giang

Reply via email to