With your setup, i can see how you would run into issues with scale of the
project you are working on.

What you may want to do is run a script/program (there are many) that will
get every pdf document as a full url, and use that to inject into your db.

This way if you say you have 5000 documents (as urls) you can run
bin/nutch generate -topN 5000 (to grab all) -numfetchers 5   to have 5
segments of 1,000 or so docs a piece.

This way you then run

bin/nutch fetach segment/firstsegmentname
bin/nutch udpatedb segment/firstsegmentname
bin/nutch index segment/firstsegmentname

and so on and so forth until you have indexed all of yoru segments.

Once you have fetched all documents you can merge your segments and your
indexes and run the recalc to do further ranking and such as you should
only have your scaling problem the first batch as from this point on each
doc is a full url and not something to be spidered from a Top level on down.

Make sense?

:)


-----Original Message-----
From: "Hauck, William B." <[EMAIL PROTECTED]>
To: [email protected]
Date: Wed, 6 Apr 2005 14:26:56 -0400
Subject: RE: Quickstart to indexing/searching large site

> Byron,
> 
> I'm limited by bandwidth (10bT), cpu (2x 1GHz PIII), and RAM (1GB).
> It's also running on two drives--1 EIDE for OS, the other for Nutch
> data.  Both drives are on the same channel.  This is an old test box
> that I borrowed to setup Nutch.  Unfortunately, I won't be able to move
> it to a faster network connection for a while.  Any way, it's better
> than my _old_ PII 450MHz play machine.  :)
> 
> The PDFs are on another machine which has no problem serving them as
> fast as the network will allow.  I'm not concerned about hammering the
> indexing machine as it's only me on it at this point.
> 
> Can you give an example of how to split the run into multiple segments
> and then indexing them?  Say you have http://site1/dir1 and
> http://site1/dir2 that you'd like to index as multiple segments.  How
> would you fetch and index them so they are all searchable by one app at
> the end?
> 
> Any info I can get / figure out I'll gladly write up as a quickstart
> for
> Nutch Newbies like me.
> 
> Thanks,
> 
> bill
> 
> -----Original Message-----
> From: Byron Miller [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, April 06, 2005 8:36 AM
> To: [email protected]
> Subject: Re: Quickstart to indexing/searching large site
> 
> Just split your run into multiple segments of smaller sizes so as they
> are done you can index them and search them.
> 
> BTW, I have a P4 with 2 gigs ram and 4x120gig SATA drives and i can
> fetch 250k pdf's in an hour or so.  What would take 5 days for you to
> run?
> limited bandwidth?  i guess in your case if everything is on a single
> host your probably only processing a few docs at a time to keep from
> hammering your server :)
> 
> -byron
> 
> -----Original Message-----
> From: "Hauck, William B." <[EMAIL PROTECTED]>
> To: [email protected]
> Date: Tue, 5 Apr 2005 15:59:40 -0400
> Subject: Quickstart to indexing/searching large site
> 
> > Hi.
> > 
> > I'm very new to Nutch so please forgive me if this seems simple. 
> > Please
> > also note that I'm a Java newbie (go Perl!).
> > 
> > I'm trying to index 250,000 pdfs (roughly 40GB).  I estimate the 
> > initial crawl to take 5-10 days (another, commercial product took 5 
> > days to index this collection.)  The issue I have is that I'd like to
> > have part of the index available while the remainder of the document 
> > collection is being fetched, analyzed, and indexed.  The way I'm 
> > trying to create the index is:
> > 
> > bin/nutch crawl conf/root_urls.txt -dir
> > /mnt/storage/app_data/nutch-data/site1
> > 
> > If I put a subdirectory of the main site in the root_urls.txt Nutch 
> > finishes quickly, but I cannot run it again with another
> subdirectory.
> > It says the data directory is a directory ...
> > Exception in thread "main" java.io.FileNotFoundException:
> > /mnt/storage/app_data/nutch-data/site1 (Is a directory)
> > 
> > Any help is really appreciated.
> > 
> > Thanks,
> > 
> > bill
> > 
> > 
> > 
> > 
> > CONFIDENTIALITY NOTICE: This E-Mail is intended only for the use of 
> > the individual or entity to which it is addressed and may contain 
> > information that is privileged, confidential and exempt from 
> > disclosure under applicable law. If you have received this 
> > communication in error, please do not distribute and delete the 
> > original message.  Please notify the sender by E-Mail at the address 
> > shown. Thank you for your compliance..
> > 
> 

Reply via email to