Your wish is my command, well, sort of.

Check out http://www.budget-ha.com/nutch/crawl/

It's a work in progess, but may help you.

-Pete

On Apr 7, 2005 1:54 AM, Chris Edwards <[EMAIL PROTECTED]> wrote:
> Hauck,
> 
> I am having the same problem.  I have searched all over the net.  If someone 
> would just post there shell script I would have a MUCH better idea of how 
> things run and in what order.  The tutorial is ok, I got the intranet crawl 
> to work, which was awesome.  I just want to understand how to crawl specific 
> sites on the internet on a monthly basis and update the db.  Anyones help 
> would be appreciated.
> 
> "Hauck, William B." <[EMAIL PROTECTED]> wrote:
> Byron,
> 
> I'm limited by bandwidth (10bT), cpu (2x 1GHz PIII), and RAM (1GB).
> It's also running on two drives--1 EIDE for OS, the other for Nutch
> data. Both drives are on the same channel. This is an old test box
> that I borrowed to setup Nutch. Unfortunately, I won't be able to move
> it to a faster network connection for a while. Any way, it's better
> than my _old_ PII 450MHz play machine. :)
> 
> The PDFs are on another machine which has no problem serving them as
> fast as the network will allow. I'm not concerned about hammering the
> indexing machine as it's only me on it at this point.
> 
> Can you give an example of how to split the run into multiple segments
> and then indexing them? Say you have http://site1/dir1 and
> http://site1/dir2 that you'd like to index as multiple segments. How
> would you fetch and index them so they are all searchable by one app at
> the end?
> 
> Any info I can get / figure out I'll gladly write up as a quickstart for
> Nutch Newbies like me.
> 
> Thanks,
> 
> bill
> 
> -----Original Message-----
> From: Byron Miller [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 06, 2005 8:36 AM
> To: [email protected]
> Subject: Re: Quickstart to indexing/searching large site
> 
> Just split your run into multiple segments of smaller sizes so as they
> are done you can index them and search them.
> 
> BTW, I have a P4 with 2 gigs ram and 4x120gig SATA drives and i can
> fetch 250k pdf's in an hour or so. What would take 5 days for you to
> run?
> limited bandwidth? i guess in your case if everything is on a single
> host your probably only processing a few docs at a time to keep from
> hammering your server :)
> 
> -byron
> 
> -----Original Message-----
> From: "Hauck, William B."
> To: [email protected]
> Date: Tue, 5 Apr 2005 15:59:40 -0400
> Subject: Quickstart to indexing/searching large site
> 
> > Hi.
> >
> > I'm very new to Nutch so please forgive me if this seems simple.
> > Please
> > also note that I'm a Java newbie (go Perl!).
> >
> > I'm trying to index 250,000 pdfs (roughly 40GB). I estimate the
> > initial crawl to take 5-10 days (another, commercial product took 5
> > days to index this collection.) The issue I have is that I'd like to
> > have part of the index available while the remainder of the document
> > collection is being fetched, analyzed, and indexed. The way I'm
> > trying to create the index is:
> >
> > bin/nutch crawl conf/root_urls.txt -dir
> > /mnt/storage/app_data/nutch-data/site1
> >
> > If I put a subdirectory of the main site in the root_urls.txt Nutch
> > finishes quickly, but I cannot run it again with another subdirectory.
> > It says the data directory is a directory ...
> > Exception in thread "main" java.io.FileNotFoundException:
> > /mnt/storage/app_data/nutch-data/site1 (Is a directory)
> >
> > Any help is really appreciated.
> >
> > Thanks,
> >
> > bill
> >
> >
> >
> >
> > CONFIDENTIALITY NOTICE: This E-Mail is intended only for the use of
> > the individual or entity to which it is addressed and may contain
> > information that is privileged, confidential and exempt from
> > disclosure under applicable law. If you have received this
> > communication in error, please do not distribute and delete the
> > original message. Please notify the sender by E-Mail at the address
> > shown. Thank you for your compliance..
> >
> 
>

Reply via email to