Re: Nutch WebDb storage alternatives: Revisited

Doug Cutting Wed, 16 Nov 2005 10:30:30 -0800

Dalton, Jeffery wrote:

To illustrate, say we want to update a database of 100 million pages,
roughly 100 GB of data in the PageDb with a new set of 20 million pages
(20 GB).  This should take approximately 9-10 hours (again using Doug's
numbers below) and so you have: 20 million updates divided by  9 hours
(9 *60 *60) = 617 updates / second.   Very good performance.  Now let's
try the same update on a larger database -- our aforementioned database
of 10 billion pages (10 terabytes).   Re-writing this database file
takes approximately 900 hours and so you have 20 million updates in 900
hours = 6.17 updates / second.  This hopefully shows that as your
database scales and your batch size remains the same that your performs
degrades linearly in relation to the size of the database.  In order to
achieve the same performance as the 100 million page database the update

must be 2 billion pages (20% of the index).

I assume that update batch sizes are proportional to the database size.If you mostly have small constant-sized updates, then a differentdatastructure might be more appropriate, like a B-tree (as used inrelational databases). In Nutch, the prototypical use case is crawlinglarge web collections, which generally has update batches proportionalto the total size of the database. In such cases this approach isvastly superior to B-trees.

I believe that in order for Nutch to scale and mature, it will have to
adopt a more incremental approach that better balances smaller updates

with space and time performance constraints.

If I am right, I see this as an amazing opportunity to improve
scalability and performance.  I have some ideas about how to implement a
more incremental approach if people that I would love to share, but I
thought it best to try and identify the problem before spouting off

possible solutions when the question may not be correct.

I'd love to hear your ideas. The approach that occurs to me would be tohave multiple databases that are merged softly on access. This is akinto Lucene's index segments. A small update could write a new small db.When one, e.g., asks for the set of incoming links to a page, onecould first ask the full, db, then the small db, and combine theresults. In the general case there could be a stack of databases oflogarithmically increasing size. When more than N db's of the same sizeare on the stack they can be popped, merged and the result pushed back on.

Also, before you develop a lot of new code, please note that in themapred branch the webdb has been broken in two, into a crawldb and alinkdb. These are batch-only and, as yet, without an abstract API:they're simply accessed with MapFile.

Lastly, Doug, I was also hoping that you, could clarify one point in
your previous post that I don't quite understand:
        With 100 bytes per link and 10 links per page, a 100M page
databaserequires 100GB. At 10MB/second transfer rate this takes on theorder ofthree hours to read and six hours to re-write, even with tens of
millions of updates.
I believe the three hours is an approximation of:  100GB / 10 MB / Sec
converted to minutes and then hours.   I guess I am not understanding
how you got the six hours to re-rewrite the data.  You doubled the read
time? Why?


I think I just meant the total time to read and write it would be six hours.

Doug

Re: Nutch WebDb storage alternatives: Revisited

Reply via email to