Hauck,
 
I am having the same problem.  I have searched all over the net.  If someone 
would just post there shell script I would have a MUCH better idea of how 
things run and in what order.  The tutorial is ok, I got the intranet crawl to 
work, which was awesome.  I just want to understand how to crawl specific sites 
on the internet on a monthly basis and update the db.  Anyones help would be 
appreciated.

"Hauck, William B." <[EMAIL PROTECTED]> wrote:
Byron,

I'm limited by bandwidth (10bT), cpu (2x 1GHz PIII), and RAM (1GB).
It's also running on two drives--1 EIDE for OS, the other for Nutch
data. Both drives are on the same channel. This is an old test box
that I borrowed to setup Nutch. Unfortunately, I won't be able to move
it to a faster network connection for a while. Any way, it's better
than my _old_ PII 450MHz play machine. :)

The PDFs are on another machine which has no problem serving them as
fast as the network will allow. I'm not concerned about hammering the
indexing machine as it's only me on it at this point.

Can you give an example of how to split the run into multiple segments
and then indexing them? Say you have http://site1/dir1 and
http://site1/dir2 that you'd like to index as multiple segments. How
would you fetch and index them so they are all searchable by one app at
the end?

Any info I can get / figure out I'll gladly write up as a quickstart for
Nutch Newbies like me.

Thanks,

bill

-----Original Message-----
From: Byron Miller [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 06, 2005 8:36 AM
To: [email protected]
Subject: Re: Quickstart to indexing/searching large site

Just split your run into multiple segments of smaller sizes so as they
are done you can index them and search them.

BTW, I have a P4 with 2 gigs ram and 4x120gig SATA drives and i can
fetch 250k pdf's in an hour or so. What would take 5 days for you to
run?
limited bandwidth? i guess in your case if everything is on a single
host your probably only processing a few docs at a time to keep from
hammering your server :)

-byron

-----Original Message-----
From: "Hauck, William B." 
To: [email protected]
Date: Tue, 5 Apr 2005 15:59:40 -0400
Subject: Quickstart to indexing/searching large site

> Hi.
> 
> I'm very new to Nutch so please forgive me if this seems simple. 
> Please
> also note that I'm a Java newbie (go Perl!).
> 
> I'm trying to index 250,000 pdfs (roughly 40GB). I estimate the 
> initial crawl to take 5-10 days (another, commercial product took 5 
> days to index this collection.) The issue I have is that I'd like to 
> have part of the index available while the remainder of the document 
> collection is being fetched, analyzed, and indexed. The way I'm 
> trying to create the index is:
> 
> bin/nutch crawl conf/root_urls.txt -dir
> /mnt/storage/app_data/nutch-data/site1
> 
> If I put a subdirectory of the main site in the root_urls.txt Nutch 
> finishes quickly, but I cannot run it again with another subdirectory.
> It says the data directory is a directory ...
> Exception in thread "main" java.io.FileNotFoundException:
> /mnt/storage/app_data/nutch-data/site1 (Is a directory)
> 
> Any help is really appreciated.
> 
> Thanks,
> 
> bill
> 
> 
> 
> 
> CONFIDENTIALITY NOTICE: This E-Mail is intended only for the use of 
> the individual or entity to which it is addressed and may contain 
> information that is privileged, confidential and exempt from 
> disclosure under applicable law. If you have received this 
> communication in error, please do not distribute and delete the 
> original message. Please notify the sender by E-Mail at the address 
> shown. Thank you for your compliance..
> 



Reply via email to