Just split your run into multiple segments of smaller sizes so as they are
done you can index them and search them.

BTW, I have a P4 with 2 gigs ram and 4x120gig SATA drives and i can fetch
250k pdf's in an hour or so.  What would take 5 days for you to run?
limited bandwidth?  i guess in your case if everything is on a single host
your probably only processing a few docs at a time to keep from hammering
your server :)

-byron

-----Original Message-----
From: "Hauck, William B." <[EMAIL PROTECTED]>
To: [email protected]
Date: Tue, 5 Apr 2005 15:59:40 -0400
Subject: Quickstart to indexing/searching large site

> Hi.
> 
> I'm very new to Nutch so please forgive me if this seems simple. 
> Please
> also note that I'm a Java newbie (go Perl!).
> 
> I'm trying to index 250,000 pdfs (roughly 40GB).  I estimate the
> initial
> crawl to take 5-10 days (another, commercial product took 5 days to
> index this collection.)  The issue I have is that I'd like to have part
> of the index available while the remainder of the document collection
> is
> being fetched, analyzed, and indexed.  The way I'm trying to create the
> index is:
> 
> bin/nutch crawl conf/root_urls.txt -dir
> /mnt/storage/app_data/nutch-data/site1
> 
> If I put a subdirectory of the main site in the root_urls.txt Nutch
> finishes quickly, but I cannot run it again with another subdirectory.
> It says the data directory is a directory ...
> Exception in thread "main" java.io.FileNotFoundException:
> /mnt/storage/app_data/nutch-data/site1 (Is a directory)
> 
> Any help is really appreciated.
> 
> Thanks,
> 
> bill
> 
> 
> 
> 
> CONFIDENTIALITY NOTICE: This E-Mail is intended only 
> for the use of the individual or entity to which it is addressed and
> may contain information that is privileged, confidential and exempt
> from disclosure under applicable law. If you have received this
> communication in error, please do not distribute and delete the
> original message.  Please notify the sender by E-Mail at the address
> shown. Thank you for your compliance..
> 

Reply via email to