Hi.

I'm very new to Nutch so please forgive me if this seems simple.  Please
also note that I'm a Java newbie (go Perl!).

I'm trying to index 250,000 pdfs (roughly 40GB).  I estimate the initial
crawl to take 5-10 days (another, commercial product took 5 days to
index this collection.)  The issue I have is that I'd like to have part
of the index available while the remainder of the document collection is
being fetched, analyzed, and indexed.  The way I'm trying to create the
index is:

bin/nutch crawl conf/root_urls.txt -dir
/mnt/storage/app_data/nutch-data/site1

If I put a subdirectory of the main site in the root_urls.txt Nutch
finishes quickly, but I cannot run it again with another subdirectory.
It says the data directory is a directory ...
Exception in thread "main" java.io.FileNotFoundException:
/mnt/storage/app_data/nutch-data/site1 (Is a directory)

Any help is really appreciated.

Thanks,

bill




CONFIDENTIALITY NOTICE: This E-Mail is intended only 
for the use of the individual or entity to which it is addressed and may 
contain information that is privileged, confidential and exempt from disclosure 
under applicable law. If you have received this communication in error, please 
do not distribute and delete the original message.  Please notify the sender by 
E-Mail at the address shown. Thank you for your compliance..


Reply via email to