Re: nutch/lucene question...

Dennis Kubes Fri, 25 Aug 2006 14:16:19 -0700

bruce wrote:

hi...


if it's ok, i've got some basic research questions.

can someone tell me if there's a limit to the number of simultaneous
websites that nutch/lucence can return...?

I assume you are asking its indexing capacity. If that is the case itis billions, it is pretty much dependent

only on the hardware and bandwidth.

i'm assuming the nutch/lucene writes the text information from the crawl
back to a db. can someone tell me if there's a limit to the number of pages
that can be written to the db in a simultaneous manner...

The crawl process works over a cluster of machines in parallel. Eachfetcher grabs webpages in parallelwith the others and then those pages are reduced into a number of binaryfiles called the crawl database.This is not a sql database. While the fetching can be massivelyparallel and is again dependent only onhardware, writing the results to the crawl database usually happens on asingle machine using a single job

and is serial in nature.

from what i've seen, you can setup nutch/lucene to use multiple servers to
do the search. how do these child servers go about adding their information
from the crawl to the overall db....

Once the pages are fetched they are processed for links and content andthen go through an indexing processto create binary index files in the lucene format. This usually happenson a distributed file system. Those indexfiles are then moved to local file system for searching.You would have multiple search servers to support search capacity, butthose search servers don't alter the indexes.Creation and manipulation of the indexes happens as batch jobs using mapreduce processes way before the indexesare ever being searched. Once created they are usually not changed justcontinually searched (meaning read from disk).Multiple search servers would aggregate their search results togetherbefore results are returned to the user.

thanks

-bruce

Re: nutch/lucene question...

Reply via email to