Otis, I'd love to do it in Perl, but I don't really have a choice in the matter. Java is in favor so Java it is.
Thanks, though. bill -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 05, 2005 10:30 PM To: [email protected] Subject: Re: [Nutch-general] Quickstart to indexing/searching large site Sorry, not a direct answer, but since yo usaid "go Perl!", you may find it easier to do everything in Perl. LWP & friends can be used to put together a little web crawler, and Plucene or Kinosearch can be used for searching. Here are some Plucene links: http://www.simpy.com/simpy/User.do?username=otis&q=%2Bplucene Otis --- "Hauck, William B." <[EMAIL PROTECTED]> wrote: > Hi. > > I'm very new to Nutch so please forgive me if this seems simple. > Please > also note that I'm a Java newbie (go Perl!). > > I'm trying to index 250,000 pdfs (roughly 40GB). I estimate the > initial crawl to take 5-10 days (another, commercial product took 5 > days to index this collection.) The issue I have is that I'd like to > have part of the index available while the remainder of the document > collection is being fetched, analyzed, and indexed. The way I'm > trying to create the index is: > > bin/nutch crawl conf/root_urls.txt -dir > /mnt/storage/app_data/nutch-data/site1 > > If I put a subdirectory of the main site in the root_urls.txt Nutch > finishes quickly, but I cannot run it again with another subdirectory. > It says the data directory is a directory ... > Exception in thread "main" java.io.FileNotFoundException: > /mnt/storage/app_data/nutch-data/site1 (Is a directory) > > Any help is really appreciated. > > Thanks, > > bill > > > > > CONFIDENTIALITY NOTICE: This E-Mail is intended only for the use of > the individual or entity to which it is addressed and may contain > information that is privileged, confidential and exempt from > disclosure under applicable law. If you have received this > communication in error, please do not distribute and delete the > original message. Please notify the sender by E-Mail at the address > shown. Thank you for your compliance.. > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide Read honest & candid > reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_ide95&alloc_id396&op=click > _______________________________________________ > Nutch-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general >
