Sorry for the delay in my response, holiday busyness.  My comments are
in-line with response below.  Your feedback would be greatly
appreciated.

- Jeff
-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 16, 2005 1:30 PM
To: [email protected]
Subject: Re: Nutch WebDb storage alternatives: Revisited

Dalton, Jeffery wrote:
> To illustrate, say we want to update a database of 100 million pages,
> roughly 100 GB of data in the PageDb with a new set of 20 million
> pages (20 GB).  This should take approximately 9-10 hours (again using
> Doug's numbers below) and so you have: 20 million updates divided by
9 hours
> (9 *60 *60) = 617 updates / second.   Very good performance.  Now
let's
> try the same update on a larger database -- our aforementioned
database
> of 10 billion pages (10 terabytes).   Re-writing this database file
> takes approximately 900 hours and so you have 20 million updates in
> 900 hours = 6.17 updates / second.  This hopefully shows that as your
> database scales and your batch size remains the same that your
> performs degrades linearly in relation to the size of the database. 
> In order to achieve the same performance as the 100 million page
> database the update must be 2 billion pages (20% of the index).

I assume that update batch sizes are proportional to the database size.
 If you mostly have small constant-sized updates, then a different
datastructure might be more appropriate, like a B-tree (as used in
relational databases).  In Nutch, the prototypical use case is crawling
large web collections, which generally has update batches proportional
to the total size of the database.  In such cases this approach is
vastly superior to B-trees.

I would propose that even in crawling large web collections that the
updates may not always be proportional to the total size of the database
if you want to keep your index fresh.   One of the goals of a web search
engine is to be an accurate representation of what is found on the web,
to maximize freshness.   Several studies have shown that the web is very
dynamic with a subset of pages changing constantly -- weekly (15%),
daily or hourly (~23%), or even more often ("The Evolution of the Web
and the Implications for an Incremental Crawler" --
http://citeseer.ist.psu.edu/419993.html and "What's new on the web? The
evolution of the Web from a Search Engine Perspective --
http://www.www2004.org/proceedings/docs/1p1.pdf).  In order to keep
dynamic pages fresh they must be crawled and indexed at a high
frequency.  In short, you have to be able to update your database often,
say on a daily basis, to keep it fresh with important dynamic pages.  
 
Ideally, I would really like to see a database architecture that could
support small incremental updates (say at the rate of a few hundred
pages / sec) as well as very large batch updates at a faster rate
(thousands of updates per second).   The small updates allow you to keep
your database fresh and hopefully also provide adequate performance to
grow and explore new content.  Faster batch updates speeds might also at
times be very beneficial.  For example, when you make major changes to
your ranking algorithms (like Google's "Jagger3" index release).
 
> I believe that in order for Nutch to scale and mature, it will have to
> adopt a more incremental approach that better balances smaller updates
> with space and time performance constraints.  
>
> If I am right, I see this as an amazing opportunity to improve
> scalability and performance.  I have some ideas about how to implement
> a more incremental approach if people that I would love to share, but
> I thought it best to try and identify the problem before spouting off
> possible solutions when the question may not be correct.

I'd love to hear your ideas.  The approach that occurs to me would be to
have multiple databases that are merged softly on access.  This is akin
to Lucene's index segments.  A small update could write a new small db.
  When one, e.g., asks for the set of incoming links to a page, one
could first ask the full, db, then the small db, and combine the
results.  In the general case there could be a stack of databases of
logarithmically increasing size.  When more than N db's of the same size
are on the stack they can be popped, merged and the result pushed back
on.
 
I had two ideas, the first is very similar to what you propose above.
Since it has a desirable update pattern, could Lucene be adapted to do
this?  Assuming it could be, what about its robustness?  If a disk
crashed, for example, we wouldn't want the index / db containing to be
corrupted and lost without a recovery procedure.  Would it make sense to
utilize checkpointing
(http://www.mail-archive.com/[email protected]/msg12709.htm
l)?
 
The second idea I have is heavily inspired by Webbase: A repository of
web pages (http://www9.org/w9cdrom/296/296.html).  The paper describes
several approaches and advocates what they call a Hashed-log
architecture, where urls are partitioned into multiple log structured
files by hash (lets say some combination of host/url).  A hashed-log
approach keeps the log files small (tens of megabytes) so that the
entire log could be searched/sorted in memory, if necessary.  Could we
use BDB as the log implementation in the Hashed-Log architecture
described above?  BDB Java Edition utilizes a log structured storage
system, which makes it a good candidate.   Splitting the log files up
into more manageable bits could provide faster performance since smaller
b-trees would need to be searched and maintained.  BDB JE also has the
benefit of being extremely robust with full transactional support.  I
know you used BDB in the past, but do you think it might be worth
re-considering, perhaps in the framework of this or a similar
architecture? BDB Java Editition has a new version, which might provide
some new features.
 
I know that The Internet Archive uses BDB JE as the database system for
Heritrix.   They use BDB JE for all of Heritrix's large data structures,
such as the url frontier
(http://crawler.archive.org/cgi-bin/wiki.pl?AlreadySeen
<http://crawler.archive.org/cgi-bin/wiki.pl?AlreadySeen> ).  In the
previous link, they describe some interesting performance tests.   It
seems to work for them even on a relatively large scale (100+ GB and 64
million urls with a transaction rate of 125 operations per second).  The
caching in BDB JE seems to help performance significantly.  If those
numbers are correct, you could achieve satisfactory performance by using
it in a distributed wedb-like fashion.

Alternatively, is there a hybrid approach between "Lucene" and
"Hashed-Log"?  I will call it the Hashed-Lucene archicture.  Could we
split the database up into smaller segements but utilize a Lucene-like
merging / storage mechanism?  
 
I haven't discussed the distribution algorithm (the paper describes
"uniform" vs. "hash") or the update mode (batch vs. incremental vs.
hybrid of both) of the proposed database.  However, I would add that
adapting an incremental update approach could save space since the
entire database would not have to be re-written before being searchable
by users.  This feature becomes more compelling when the database grows
to terabytes of data.  
 
Also, before you develop a lot of new code, please note that in the
mapred branch the webdb has been broken in two, into a crawldb and a
linkdb.  These are batch-only and, as yet, without an abstract API:
they're simply accessed with MapFile.

I'm glad they've been broken up!   Sounds very logical to me. If I (or
we) do decide to implement a different db model, designing an API for
the two databases would be very useful.  The API would make it much
easier to swap out and test different database implementations.  It
might make sense to use different database storage schemes depending on
the deployment (a la MySQL) .  It's kind of cool that an API hasn't been
defined yet because it provides an opportunity to create something new
without legacy baggage.

> Lastly, Doug, I was also hoping that you, could clarify one point in
> your previous post that I don't quite understand:
>
>       With 100 bytes per link and 10 links per page, a 100M page
database
>       requires 100GB.  At 10MB/second transfer rate this takes on the
order
> of
>       three hours to read and six hours to re-write, even with tens of
>
>       millions of updates.
>
> I believe the three hours is an approximation of:  100GB / 10 MB / Sec
> converted to minutes and then hours.   I guess I am not understanding
> how you got the six hours to re-rewrite the data.  You doubled the
> read time?  Why?

I think I just meant the total time to read and write it would be six
hours.

Doug

 

Reply via email to