Sorry, not a direct answer, but since yo usaid "go Perl!", you may find
it easier to do everything in Perl.  LWP & friends can be used to put
together a little web crawler, and Plucene or Kinosearch can be used
for searching.

Here are some Plucene links:
http://www.simpy.com/simpy/User.do?username=otis&q=%2Bplucene

Otis

--- "Hauck, William B." <[EMAIL PROTECTED]> wrote:
> Hi.
> 
> I'm very new to Nutch so please forgive me if this seems simple. 
> Please
> also note that I'm a Java newbie (go Perl!).
> 
> I'm trying to index 250,000 pdfs (roughly 40GB).  I estimate the
> initial
> crawl to take 5-10 days (another, commercial product took 5 days to
> index this collection.)  The issue I have is that I'd like to have
> part
> of the index available while the remainder of the document collection
> is
> being fetched, analyzed, and indexed.  The way I'm trying to create
> the
> index is:
> 
> bin/nutch crawl conf/root_urls.txt -dir
> /mnt/storage/app_data/nutch-data/site1
> 
> If I put a subdirectory of the main site in the root_urls.txt Nutch
> finishes quickly, but I cannot run it again with another
> subdirectory.
> It says the data directory is a directory ...
> Exception in thread "main" java.io.FileNotFoundException:
> /mnt/storage/app_data/nutch-data/site1 (Is a directory)
> 
> Any help is really appreciated.
> 
> Thanks,
> 
> bill
> 
> 
> 
> 
> CONFIDENTIALITY NOTICE: This E-Mail is intended only 
> for the use of the individual or entity to which it is addressed and
> may contain information that is privileged, confidential and exempt
> from disclosure under applicable law. If you have received this
> communication in error, please do not distribute and delete the
> original message.  Please notify the sender by E-Mail at the address
> shown. Thank you for your compliance..
> 
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real
> users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
> _______________________________________________
> Nutch-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 

Reply via email to