bruce wrote:
hi...

if it's ok, i've got some basic research questions.

can someone tell me if there's a limit to the number of simultaneous
websites that nutch/lucence can return...?
I assume you are asking its indexing capacity. If that is the case it is billions, it is pretty much dependent
only on the hardware and bandwidth.
i'm assuming the nutch/lucene writes the text information from the crawl
back to a db. can someone tell me if there's a limit to the number of pages
that can be written to the db in a simultaneous manner...
The crawl process works over a cluster of machines in parallel. Each fetcher grabs webpages in parallel with the others and then those pages are reduced into a number of binary files called the crawl database. This is not a sql database. While the fetching can be massively parallel and is again dependent only on hardware, writing the results to the crawl database usually happens on a single machine using a single job
and is serial in nature.
from what i've seen, you can setup nutch/lucene to use multiple servers to
do the search. how do these child servers go about adding their information
from the crawl to the overall db....

Once the pages are fetched they are processed for links and content and then go through an indexing process to create binary index files in the lucene format. This usually happens on a distributed file system. Those index files are then moved to local file system for searching. You would have multiple search servers to support search capacity, but those search servers don't alter the indexes. Creation and manipulation of the indexes happens as batch jobs using map reduce processes way before the indexes are ever being searched. Once created they are usually not changed just continually searched (meaning read from disk). Multiple search servers would aggregate their search results together before results are returned to the user.
thanks

-bruce

Reply via email to