I wanted to revive some discussion about the webdb storage
possibilities. I ran across this thread from awhile back:
http://marc.theaimsgroup.com/?l=nutch-developers&m=109958192318128&w=2
where there was some discussion about design alternatives for the WebDb.
Let me lay out a scenario to illustrate the problem as I see it. Let's
say that we have a 10 billion page database (quite reasonable by today's
standards). Assuming an average page size of 10k, the PageDb should
require roughly 10 terabytes of data. Using Doug's update numbers
below, executing an update on a single PageDb using the Mapfile approach
should require 900 hours of disk time. Yes? The problem I observe is
that as you scale up the size of the database, the size of the batch
updates need to scale proportionally in size also in order to achieve
the same update performance. If you don't scale the size of your batch
updates then your performance will degrade because the entire database
still must be re-written, but you are updating less of it.
To illustrate, say we want to update a database of 100 million pages,
roughly 100 GB of data in the PageDb with a new set of 20 million pages
(20 GB). This should take approximately 9-10 hours (again using Doug's
numbers below) and so you have: 20 million updates divided by 9 hours
(9 *60 *60) = 617 updates / second. Very good performance. Now let's
try the same update on a larger database -- our aforementioned database
of 10 billion pages (10 terabytes). Re-writing this database file
takes approximately 900 hours and so you have 20 million updates in 900
hours = 6.17 updates / second. This hopefully shows that as your
database scales and your batch size remains the same that your performs
degrades linearly in relation to the size of the database. In order to
achieve the same performance as the 100 million page database the update
must be 2 billion pages (20% of the index).
One solution is to use ever increasing batch sizes by merging large
segments into even larger batches before merging into the database,
however, at this scale I believe that this process becomes unwieldy when
we are talking about hundreds of gigabytes of data. Now of course, you
can argue that the answer is to buy more machines (disks) because the
hours are IO/Hours and that you can parallelize the process into an
operation that takes days instead of months. However, I would argue
that this hasn't helped performance, you still have 6.17 updates/second.
Do people see the problem? Am I wrong about this? I believe that this
problem points us in the direction of creating an architecture that
while might not be as fast updating smaller databases, hopefully will
scale more gracefully. In other words, I would like to be able to
scale the size of the index without having to increase the size of my
batch updates without taking a major performance hit.
I am arguing that it is not feasible to wait until we have crawled a
sizable fraction of the WebDb before updating the link / page databases.
I believe that in order for Nutch to scale and mature, it will have to
adopt a more incremental approach that better balances smaller updates
with space and time performance constraints.
If I am right, I see this as an amazing opportunity to improve
scalability and performance. I have some ideas about how to implement a
more incremental approach if people that I would love to share, but I
thought it best to try and identify the problem before spouting off
possible solutions when the question may not be correct.
Lastly, Doug, I was also hoping that you, could clarify one point in
your previous post that I don't quite understand:
With 100 bytes per link and 10 links per page, a 100M page
database
requires 100GB. At 10MB/second transfer rate this takes on the
order of
three hours to read and six hours to re-write, even with tens of
millions of updates.
I believe the three hours is an approximation of: 100GB / 10 MB / Sec
converted to minutes and then hours. I guess I am not understanding
how you got the six hours to re-rewrite the data. You doubled the read
time? Why?
Thanks,
- Jeff Dalton
Doug's Original post on the WebDb thread for convenience:
> 2) use something like berkely db which will increase space usage
> by I'd guess about 100-150%, but will allow for fast
inserts/updates/deletes.
> Sounds better to me than the current approach, but for large
installations
> we may run into hardware limits without compressing the data. I've
heard
> of berkeyly db being used to store 100Gig databases. I guess a large
nutch
> installation may push or break that size.
We started out using Berkeley DB and it became very slow when the
database was large. The problem is that B-Trees get fragmented as they
grow. Each update eventually requires a random access, a disk seek,
which take around 10 milliseconds.
Consider this: If each B-tree page holds, say, 100 pages or links, and
we're updating at least 1% of all entries in the B-Tree, then, in the
course of a db update we'll visit every page in the B-tree, but as a
random access. It is much faster to pre-sort the updates and then merge
them with the database. All disk operations are sequential and hence
operate at the transfer rate, typically around 10MB/second, nearly 100
times faster than random seeks.
The last time I benchmarked the db sorting and merging code on large
collections it was disk i/o bound. Is this no longer the case? When
performing an update on a large (>10M page) db, what is the CPU and disk
utilizations?
In short, maintaining a link graph is a very data intensive operation.
An RDBMS will always use a B-tree, and will always degenerate to random
accesses per link update when the database is large. Fetching at 100
pages per second with an average of 10 links per page requires 1000 link
updates per second in order for the database to keep up with fetching.
A typical hard drive can only perform 100 seeks per second. So any
approach which requires a random access per link will fail to keep up,
unless 10 hard drives are allocated per fetcher!
With 100 bytes per link and 10 links per page, a 100M page database
requires 100GB. At 10MB/second transfer rate this takes on the order of
three hours to read and six hours to re-write, even with tens of
millions of updates. With two 10Ms seeks required per update, only
around 1M links could be updated in six hours.
So, yes, the implementation Nutch uses does use a lot of space, but it
is very scalable.
Doug