Dalton, Jeffery wrote:
To illustrate, say we want to update a database of 100 million pages,
roughly 100 GB of data in the PageDb with a new set of 20 million pages
(20 GB). This should take approximately 9-10 hours (again using Doug's
numbers below) and so you have: 20 million updates divided by 9 hours
(9 *60 *60) = 617 updates / second. Very good performance. Now let's
try the same update on a larger database -- our aforementioned database
of 10 billion pages (10 terabytes). Re-writing this database file
takes approximately 900 hours and so you have 20 million updates in 900
hours = 6.17 updates / second. This hopefully shows that as your
database scales and your batch size remains the same that your performs
degrades linearly in relation to the size of the database. In order to
achieve the same performance as the 100 million page database the update
must be 2 billion pages (20% of the index).
I assume that update batch sizes are proportional to the database size.
If you mostly have small constant-sized updates, then a different
datastructure might be more appropriate, like a B-tree (as used in
relational databases). In Nutch, the prototypical use case is crawling
large web collections, which generally has update batches proportional
to the total size of the database. In such cases this approach is
vastly superior to B-trees.
I believe that in order for Nutch to scale and mature, it will have to
adopt a more incremental approach that better balances smaller updates
with space and time performance constraints.
If I am right, I see this as an amazing opportunity to improve
scalability and performance. I have some ideas about how to implement a
more incremental approach if people that I would love to share, but I
thought it best to try and identify the problem before spouting off
possible solutions when the question may not be correct.
I'd love to hear your ideas. The approach that occurs to me would be to
have multiple databases that are merged softly on access. This is akin
to Lucene's index segments. A small update could write a new small db.
When one, e.g., asks for the set of incoming links to a page, one
could first ask the full, db, then the small db, and combine the
results. In the general case there could be a stack of databases of
logarithmically increasing size. When more than N db's of the same size
are on the stack they can be popped, merged and the result pushed back on.
Also, before you develop a lot of new code, please note that in the
mapred branch the webdb has been broken in two, into a crawldb and a
linkdb. These are batch-only and, as yet, without an abstract API:
they're simply accessed with MapFile.
Lastly, Doug, I was also hoping that you, could clarify one point in
your previous post that I don't quite understand:
With 100 bytes per link and 10 links per page, a 100M page
database
requires 100GB. At 10MB/second transfer rate this takes on the
order of
three hours to read and six hours to re-write, even with tens of
millions of updates.
I believe the three hours is an approximation of: 100GB / 10 MB / Sec
converted to minutes and then hours. I guess I am not understanding
how you got the six hours to re-rewrite the data. You doubled the read
time? Why?
I think I just meant the total time to read and write it would be six hours.
Doug