Hi folks, I'm struggling with the Nutch crawler at closing database step. I'm running on Redhat Enterprise, 4G RAM, JDK 1.5.06. The database size is around few million pages. I usually get the system crash when Nutch comes to closing database (both during update database or fetchlist). The log file contains following:
060115 200846 Processing pagesByURL: Sorted 18630 instructions in 0.067seconds. 060115 200846 Processing pagesByURL: Sorted 278059.7014925373instructions/second 060115 200915 Processing pagesByURL: Merged to new DB containing 3975608 records in 28.646 seconds 060115 200915 Processing pagesByURL: Merged 138784.05362005165records/second 060115 200915 Processing pagesByMD5: Sorted 18630 instructions in 0.121seconds. 060115 200915 Processing pagesByMD5: Sorted 153966.94214876034instructions/second 060115 201007 Processing pagesByMD5: Merged to new DB containing 3975608 records in 51.773 seconds 060115 201007 Processing pagesByMD5: Merged 76789.21445541111 records/second 060115 201409 Processing linksByMD5: Copied file (96 bytes) in 241.822 secs. [EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@^@ First of all, i can't understand why it takes so long to copy a 96-byte file compared with other operations. Should it be a folder? When the system crashed, webdb, webdb.new and webdb.old are presented. I found that Nutch really has some issues with the database. Few days ago, I encountered a problem of not having webdb.new deleted. Some folks suggest to switch to Linux, and it worked, but still this problem remains. Thanks for any help. Regards, Giang
