Hi. I'm very new to Nutch so please forgive me if this seems simple. Please also note that I'm a Java newbie (go Perl!).
I'm trying to index 250,000 pdfs (roughly 40GB). I estimate the initial crawl to take 5-10 days (another, commercial product took 5 days to index this collection.) The issue I have is that I'd like to have part of the index available while the remainder of the document collection is being fetched, analyzed, and indexed. The way I'm trying to create the index is: bin/nutch crawl conf/root_urls.txt -dir /mnt/storage/app_data/nutch-data/site1 If I put a subdirectory of the main site in the root_urls.txt Nutch finishes quickly, but I cannot run it again with another subdirectory. It says the data directory is a directory ... Exception in thread "main" java.io.FileNotFoundException: /mnt/storage/app_data/nutch-data/site1 (Is a directory) Any help is really appreciated. Thanks, bill CONFIDENTIALITY NOTICE: This E-Mail is intended only for the use of the individual or entity to which it is addressed and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If you have received this communication in error, please do not distribute and delete the original message. Please notify the sender by E-Mail at the address shown. Thank you for your compliance..
