Otis,

I'd love to do it in Perl, but I don't really have a choice in the
matter.  Java is in favor so Java it is.

Thanks, though.

bill 

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, April 05, 2005 10:30 PM
To: [email protected]
Subject: Re: [Nutch-general] Quickstart to indexing/searching large site

Sorry, not a direct answer, but since yo usaid "go Perl!", you may find
it easier to do everything in Perl.  LWP & friends can be used to put
together a little web crawler, and Plucene or Kinosearch can be used for
searching.

Here are some Plucene links:
http://www.simpy.com/simpy/User.do?username=otis&q=%2Bplucene

Otis

--- "Hauck, William B." <[EMAIL PROTECTED]> wrote:
> Hi.
> 
> I'm very new to Nutch so please forgive me if this seems simple. 
> Please
> also note that I'm a Java newbie (go Perl!).
> 
> I'm trying to index 250,000 pdfs (roughly 40GB).  I estimate the 
> initial crawl to take 5-10 days (another, commercial product took 5 
> days to index this collection.)  The issue I have is that I'd like to 
> have part of the index available while the remainder of the document 
> collection is being fetched, analyzed, and indexed.  The way I'm 
> trying to create the index is:
> 
> bin/nutch crawl conf/root_urls.txt -dir
> /mnt/storage/app_data/nutch-data/site1
> 
> If I put a subdirectory of the main site in the root_urls.txt Nutch 
> finishes quickly, but I cannot run it again with another subdirectory.
> It says the data directory is a directory ...
> Exception in thread "main" java.io.FileNotFoundException:
> /mnt/storage/app_data/nutch-data/site1 (Is a directory)
> 
> Any help is really appreciated.
> 
> Thanks,
> 
> bill
> 
> 
> 
> 
> CONFIDENTIALITY NOTICE: This E-Mail is intended only for the use of 
> the individual or entity to which it is addressed and may contain 
> information that is privileged, confidential and exempt from 
> disclosure under applicable law. If you have received this 
> communication in error, please do not distribute and delete the 
> original message.  Please notify the sender by E-Mail at the address 
> shown. Thank you for your compliance..
> 
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide Read honest & candid 
> reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
> _______________________________________________
> Nutch-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 


Reply via email to