Byron, I'm limited by bandwidth (10bT), cpu (2x 1GHz PIII), and RAM (1GB). It's also running on two drives--1 EIDE for OS, the other for Nutch data. Both drives are on the same channel. This is an old test box that I borrowed to setup Nutch. Unfortunately, I won't be able to move it to a faster network connection for a while. Any way, it's better than my _old_ PII 450MHz play machine. :)
The PDFs are on another machine which has no problem serving them as fast as the network will allow. I'm not concerned about hammering the indexing machine as it's only me on it at this point. Can you give an example of how to split the run into multiple segments and then indexing them? Say you have http://site1/dir1 and http://site1/dir2 that you'd like to index as multiple segments. How would you fetch and index them so they are all searchable by one app at the end? Any info I can get / figure out I'll gladly write up as a quickstart for Nutch Newbies like me. Thanks, bill -----Original Message----- From: Byron Miller [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 06, 2005 8:36 AM To: [email protected] Subject: Re: Quickstart to indexing/searching large site Just split your run into multiple segments of smaller sizes so as they are done you can index them and search them. BTW, I have a P4 with 2 gigs ram and 4x120gig SATA drives and i can fetch 250k pdf's in an hour or so. What would take 5 days for you to run? limited bandwidth? i guess in your case if everything is on a single host your probably only processing a few docs at a time to keep from hammering your server :) -byron -----Original Message----- From: "Hauck, William B." <[EMAIL PROTECTED]> To: [email protected] Date: Tue, 5 Apr 2005 15:59:40 -0400 Subject: Quickstart to indexing/searching large site > Hi. > > I'm very new to Nutch so please forgive me if this seems simple. > Please > also note that I'm a Java newbie (go Perl!). > > I'm trying to index 250,000 pdfs (roughly 40GB). I estimate the > initial crawl to take 5-10 days (another, commercial product took 5 > days to index this collection.) The issue I have is that I'd like to > have part of the index available while the remainder of the document > collection is being fetched, analyzed, and indexed. The way I'm > trying to create the index is: > > bin/nutch crawl conf/root_urls.txt -dir > /mnt/storage/app_data/nutch-data/site1 > > If I put a subdirectory of the main site in the root_urls.txt Nutch > finishes quickly, but I cannot run it again with another subdirectory. > It says the data directory is a directory ... > Exception in thread "main" java.io.FileNotFoundException: > /mnt/storage/app_data/nutch-data/site1 (Is a directory) > > Any help is really appreciated. > > Thanks, > > bill > > > > > CONFIDENTIALITY NOTICE: This E-Mail is intended only for the use of > the individual or entity to which it is addressed and may contain > information that is privileged, confidential and exempt from > disclosure under applicable law. If you have received this > communication in error, please do not distribute and delete the > original message. Please notify the sender by E-Mail at the address > shown. Thank you for your compliance.. >
