RE: processing parallel sites

bruce Tue, 29 Aug 2006 09:03:20 -0700

Hi Ken!!

Thanks for the short conversation regarding the project. You've more or less
confirmed some of the initial thoughts/research that we've observed.


Regards,

-Bruce Douglas
[EMAIL PROTECTED]



-----Original Message-----
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Monday, August 28, 2006 12:31 PM
To: [email protected]
Subject: Re: processing parallel sites


>can nutch/lucene handle getting content from 100s-1000s of pages
>sinultaneously....

If you mean "Can Nutch handle fetching 1000s of pages at a time", the
answer is yes.

If you mean "Can Lucene, when used as the IR engine for Nutch, handle
searching 1000s of pages at a time", then answer is also yes.

>if it can, how does it write the content to the resulting/output db.
>
>does it actually perform 100s-1000s of simultaneous connections to a
backend
>db.
>
>does it utilize writing the output files to a filesystem, which is then
>somehow inserted into a db....

The fetched pages get written to a sequential file. After a fetch
cycle, additional data about the page state gets processed, and the
results are used to update the crawldb, which is a kind of
specialized database for web crawling.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

RE: processing parallel sites

Reply via email to