Re: [Nutch-general] processing parallel sites

bruce Tue, 29 Aug 2006 09:03:45 -0700

Hi Ken!!

Thanks for the short conversation regarding the project. You've more or less
confirmed some of the initial thoughts/research that we've observed.


Regards,

-Bruce Douglas
[EMAIL PROTECTED]



-----Original Message-----
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Monday, August 28, 2006 12:31 PM
To: [email protected]
Subject: Re: processing parallel sites


>can nutch/lucene handle getting content from 100s-1000s of pages
>sinultaneously....

If you mean "Can Nutch handle fetching 1000s of pages at a time", the
answer is yes.

If you mean "Can Lucene, when used as the IR engine for Nutch, handle
searching 1000s of pages at a time", then answer is also yes.

>if it can, how does it write the content to the resulting/output db.
>
>does it actually perform 100s-1000s of simultaneous connections to a
backend
>db.
>
>does it utilize writing the output files to a filesystem, which is then
>somehow inserted into a db....

The fetched pages get written to a sequential file. After a fetch
cycle, additional data about the page state gets processed, and the
results are used to update the crawldb, which is a kind of
specialized database for web crawling.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] processing parallel sites

Reply via email to