Just split your run into multiple segments of smaller sizes so as they are done you can index them and search them.
BTW, I have a P4 with 2 gigs ram and 4x120gig SATA drives and i can fetch 250k pdf's in an hour or so. What would take 5 days for you to run? limited bandwidth? i guess in your case if everything is on a single host your probably only processing a few docs at a time to keep from hammering your server :) -byron -----Original Message----- From: "Hauck, William B." <[EMAIL PROTECTED]> To: [email protected] Date: Tue, 5 Apr 2005 15:59:40 -0400 Subject: Quickstart to indexing/searching large site > Hi. > > I'm very new to Nutch so please forgive me if this seems simple. > Please > also note that I'm a Java newbie (go Perl!). > > I'm trying to index 250,000 pdfs (roughly 40GB). I estimate the > initial > crawl to take 5-10 days (another, commercial product took 5 days to > index this collection.) The issue I have is that I'd like to have part > of the index available while the remainder of the document collection > is > being fetched, analyzed, and indexed. The way I'm trying to create the > index is: > > bin/nutch crawl conf/root_urls.txt -dir > /mnt/storage/app_data/nutch-data/site1 > > If I put a subdirectory of the main site in the root_urls.txt Nutch > finishes quickly, but I cannot run it again with another subdirectory. > It says the data directory is a directory ... > Exception in thread "main" java.io.FileNotFoundException: > /mnt/storage/app_data/nutch-data/site1 (Is a directory) > > Any help is really appreciated. > > Thanks, > > bill > > > > > CONFIDENTIALITY NOTICE: This E-Mail is intended only > for the use of the individual or entity to which it is addressed and > may contain information that is privileged, confidential and exempt > from disclosure under applicable law. If you have received this > communication in error, please do not distribute and delete the > original message. Please notify the sender by E-Mail at the address > shown. Thank you for your compliance.. >
