Your wish is my command, well, sort of. Check out http://www.budget-ha.com/nutch/crawl/
It's a work in progess, but may help you. -Pete On Apr 7, 2005 1:54 AM, Chris Edwards <[EMAIL PROTECTED]> wrote: > Hauck, > > I am having the same problem. I have searched all over the net. If someone > would just post there shell script I would have a MUCH better idea of how > things run and in what order. The tutorial is ok, I got the intranet crawl > to work, which was awesome. I just want to understand how to crawl specific > sites on the internet on a monthly basis and update the db. Anyones help > would be appreciated. > > "Hauck, William B." <[EMAIL PROTECTED]> wrote: > Byron, > > I'm limited by bandwidth (10bT), cpu (2x 1GHz PIII), and RAM (1GB). > It's also running on two drives--1 EIDE for OS, the other for Nutch > data. Both drives are on the same channel. This is an old test box > that I borrowed to setup Nutch. Unfortunately, I won't be able to move > it to a faster network connection for a while. Any way, it's better > than my _old_ PII 450MHz play machine. :) > > The PDFs are on another machine which has no problem serving them as > fast as the network will allow. I'm not concerned about hammering the > indexing machine as it's only me on it at this point. > > Can you give an example of how to split the run into multiple segments > and then indexing them? Say you have http://site1/dir1 and > http://site1/dir2 that you'd like to index as multiple segments. How > would you fetch and index them so they are all searchable by one app at > the end? > > Any info I can get / figure out I'll gladly write up as a quickstart for > Nutch Newbies like me. > > Thanks, > > bill > > -----Original Message----- > From: Byron Miller [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 06, 2005 8:36 AM > To: [email protected] > Subject: Re: Quickstart to indexing/searching large site > > Just split your run into multiple segments of smaller sizes so as they > are done you can index them and search them. > > BTW, I have a P4 with 2 gigs ram and 4x120gig SATA drives and i can > fetch 250k pdf's in an hour or so. What would take 5 days for you to > run? > limited bandwidth? i guess in your case if everything is on a single > host your probably only processing a few docs at a time to keep from > hammering your server :) > > -byron > > -----Original Message----- > From: "Hauck, William B." > To: [email protected] > Date: Tue, 5 Apr 2005 15:59:40 -0400 > Subject: Quickstart to indexing/searching large site > > > Hi. > > > > I'm very new to Nutch so please forgive me if this seems simple. > > Please > > also note that I'm a Java newbie (go Perl!). > > > > I'm trying to index 250,000 pdfs (roughly 40GB). I estimate the > > initial crawl to take 5-10 days (another, commercial product took 5 > > days to index this collection.) The issue I have is that I'd like to > > have part of the index available while the remainder of the document > > collection is being fetched, analyzed, and indexed. The way I'm > > trying to create the index is: > > > > bin/nutch crawl conf/root_urls.txt -dir > > /mnt/storage/app_data/nutch-data/site1 > > > > If I put a subdirectory of the main site in the root_urls.txt Nutch > > finishes quickly, but I cannot run it again with another subdirectory. > > It says the data directory is a directory ... > > Exception in thread "main" java.io.FileNotFoundException: > > /mnt/storage/app_data/nutch-data/site1 (Is a directory) > > > > Any help is really appreciated. > > > > Thanks, > > > > bill > > > > > > > > > > CONFIDENTIALITY NOTICE: This E-Mail is intended only for the use of > > the individual or entity to which it is addressed and may contain > > information that is privileged, confidential and exempt from > > disclosure under applicable law. If you have received this > > communication in error, please do not distribute and delete the > > original message. Please notify the sender by E-Mail at the address > > shown. Thank you for your compliance.. > > > >
