Re: Crawler/Fetcher Questions

Byron Miller Fri, 20 May 2005 06:51:49 -0700

Index folder is created when your merge indexes - not needed, but usually
enhanced performance.  Crawl is probably merging the indexes automagically
while the manual process won't.


During crawl/segment creation and indexing there are tons of files that
get created and the optimize process goes through and cleans this up a bit.

-byron

-----Original Message-----
From: Ian Reardon <[EMAIL PROTECTED]>
To: [email protected]
Date: Fri, 20 May 2005 08:44:45 -0400
Subject: Crawler/Fetcher Questions

> I've noticed a few things that I'm puzzled about with nutch.
> 
> When I just do a "nutch crawl" and give it a directory it creates 3
> folders off the root "db", "index" and "segments".
> 
> On the other hand if I just create a root directory by hand.  
> 
> -Make 2 folders inside "segments" and "db" 
> -Create an empty web db 
> -Copy my segments from an existing crawl into the new segments folder
> -Run updatedb
> -Run index on those newly copied segments
> (i've been using this method to combine multiple crawls of single
> sites into 1 repository)
> 
> it seems to work fine but I do not have an "index" folder like it
> makes when you just do "nutch crawl".  What is the index folder?  Is
> it ok that I don't have it, everything appears to be working.
> 
> 
> 2nd question which is not as important.
> 
> I've been tracking the size of the folders containing the crawls I'm
> doing.  It seems like they go up to say 20 megs, then it will go down
> to 2 megs and slowly go up again.    Where is this drastic reduction
> coming from?  I just hope I am not losing documents.
> 
> Thanks in advance.
>

Re: Crawler/Fetcher Questions

Reply via email to