Sorry, not a direct answer, but since yo usaid "go Perl!", you may find it easier to do everything in Perl. LWP & friends can be used to put together a little web crawler, and Plucene or Kinosearch can be used for searching.
Here are some Plucene links: http://www.simpy.com/simpy/User.do?username=otis&q=%2Bplucene Otis --- "Hauck, William B." <[EMAIL PROTECTED]> wrote: > Hi. > > I'm very new to Nutch so please forgive me if this seems simple. > Please > also note that I'm a Java newbie (go Perl!). > > I'm trying to index 250,000 pdfs (roughly 40GB). I estimate the > initial > crawl to take 5-10 days (another, commercial product took 5 days to > index this collection.) The issue I have is that I'd like to have > part > of the index available while the remainder of the document collection > is > being fetched, analyzed, and indexed. The way I'm trying to create > the > index is: > > bin/nutch crawl conf/root_urls.txt -dir > /mnt/storage/app_data/nutch-data/site1 > > If I put a subdirectory of the main site in the root_urls.txt Nutch > finishes quickly, but I cannot run it again with another > subdirectory. > It says the data directory is a directory ... > Exception in thread "main" java.io.FileNotFoundException: > /mnt/storage/app_data/nutch-data/site1 (Is a directory) > > Any help is really appreciated. > > Thanks, > > bill > > > > > CONFIDENTIALITY NOTICE: This E-Mail is intended only > for the use of the individual or entity to which it is addressed and > may contain information that is privileged, confidential and exempt > from disclosure under applicable law. If you have received this > communication in error, please do not distribute and delete the > original message. Please notify the sender by E-Mail at the address > shown. Thank you for your compliance.. > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real > users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_ide95&alloc_id396&op=click > _______________________________________________ > Nutch-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general >
