Crawler/Fetcher Questions

Ian Reardon Fri, 20 May 2005 05:45:04 -0700

I've noticed a few things that I'm puzzled about with nutch.

When I just do a "nutch crawl" and give it a directory it creates 3
folders off the root "db", "index" and "segments".


On the other hand if I just create a root directory by hand.  

-Make 2 folders inside "segments" and "db" 
-Create an empty web db 
-Copy my segments from an existing crawl into the new segments folder
-Run updatedb
-Run index on those newly copied segments
(i've been using this method to combine multiple crawls of single
sites into 1 repository)

it seems to work fine but I do not have an "index" folder like it
makes when you just do "nutch crawl".  What is the index folder?  Is
it ok that I don't have it, everything appears to be working.


2nd question which is not as important.

I've been tracking the size of the folders containing the crawls I'm
doing.  It seems like they go up to say 20 megs, then it will go down
to 2 megs and slowly go up again.    Where is this drastic reduction
coming from?  I just hope I am not losing documents.

Thanks in advance.

Crawler/Fetcher Questions

Reply via email to