Re: Recrawl, New URLS and Nutch on multiple machines !

Daniel D. Mon, 06 Jun 2005 19:55:08 -0700

Hi Byron,

Thanks a lot for your clarifications. I have spent some time
understanding the Fetcher code and now will need to understand how I
can crawl initial set of URLS and then re-fetching:
        · URLS that are due to be fetched (based on
db.default.fetch.interval) – maintenance.
       · Fetching newly discovered (in the last fetch/re-fetch) URLS.

Unfortunately I couldn't find documentation that will explain all
options I can use. Searching in the forums also didn't help me much as
I have seen people asking similar questions and not getting clear
answers.  In some cases messages have presented controversial
information.

I will start running tests and look in the code but I assume it will
be difficult to track URLS being fetched, re-fetched and added after
couple of rounds of re-fetching.

I will post my questions here with the hope that good "Nutch" people
will help me to understand some elements of the software before I will
spend some night hours looking in the code.

1. Tutorial for the whole web crawling suggesting running generate,
fetch and update db couple of times. I think it's getting referred as
a depth. Is there any document that explaining the benefit of having
different depth of the results? I hope I'm using the right terminology
here.

2. Were can I found the descriptions for bin/nutch generate options
(net.nutch.tools.FetchListTool)?

3. What does term "top pages" mean? Where can I found description of
the "Scoring" algorithm?

4. If I would not specify for bin/nutch generate –refetchonly the
–topN parameter would I re-fetch all my URLS that are due.

5. I know that there is another discussion (subject: Intranet crawl
and re-fetch) but it still seams that we don't have a clear answer.
Would bin/nutch generate –refetchonly include new URLS (not fetched
yet) in the fetchlist?

6. If I initially have only one segment and I will keep running the
re-fetch (assuming that I will re-fetch existing and add some new
URLS) I will eventual create bigger and bigger fetchlist. Is there
known (or configurable) max size for the segment fethclist and after
reaching this size bin/nutch generate will create additional segment?
If it's not, I assume I can play with the –numFetchers parameter if I
would know number of URLS in the WebDB.

7. Is there a suggested size for the segment fetchlist for better
performance? Is there known maximum size when performance will
degrade?

8. Where can I found memory (dick) usage for the WebDB and CPU usage
for bin/nutch updatedb? I'm looking for something like: for 1,000,000
documents WebDB will take approximately XX GB and running bin/nutch
updatedb on 1,000,000 will use up to XX MB of RAM.

It's a lot of questions for busy people to answer but I hope somebody
will drop a word.

Thanks,
Daniel. 

On 6/6/05, Byron Miller <[EMAIL PROTECTED]> wrote:
> Daniel,
> 
> Nutch doesn't do anything by itself, you have to initiate the refetch
> process by running something like:
> 
> bin/nutch generate -refetchonly db segments -numFetchers 30 -topN 30000000
> 
> 
> Something like that would do your refetch of the top 30 million docuements
> and give you roughly 30 segments of 1 million +/- urls in each segment.
> 
> YOu could then move these segments (or nfs mount them) on your spider
> boxes and fetch them concurrently (on segment per box or something)
> 
> machine 1: bin/nutch fetch segments/200505012345-0
> machine 2: bin/nutch fetch segments/200505012345-1
> 
> .... so on and so forth....
> 
> Hopefully with the new stuff Doug is working on perhaps "fetch/spider"
> boxes can have a rule they apply against the DB for constant
> fetching/updates without this much manual intervention.
> 
> -byron
> > 
>

Re: Recrawl, New URLS and Nutch on multiple machines !

Reply via email to