Sorry for the delay in my response, holiday busyness. My comments are in-line with response below. Your feedback would be greatly appreciated.
- Jeff -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 16, 2005 1:30 PM To: [email protected] Subject: Re: Nutch WebDb storage alternatives: Revisited Dalton, Jeffery wrote: > To illustrate, say we want to update a database of 100 million pages, > roughly 100 GB of data in the PageDb with a new set of 20 million > pages (20 GB). This should take approximately 9-10 hours (again using > Doug's numbers below) and so you have: 20 million updates divided by 9 hours > (9 *60 *60) = 617 updates / second. Very good performance. Now let's > try the same update on a larger database -- our aforementioned database > of 10 billion pages (10 terabytes). Re-writing this database file > takes approximately 900 hours and so you have 20 million updates in > 900 hours = 6.17 updates / second. This hopefully shows that as your > database scales and your batch size remains the same that your > performs degrades linearly in relation to the size of the database. > In order to achieve the same performance as the 100 million page > database the update must be 2 billion pages (20% of the index). I assume that update batch sizes are proportional to the database size. If you mostly have small constant-sized updates, then a different datastructure might be more appropriate, like a B-tree (as used in relational databases). In Nutch, the prototypical use case is crawling large web collections, which generally has update batches proportional to the total size of the database. In such cases this approach is vastly superior to B-trees. I would propose that even in crawling large web collections that the updates may not always be proportional to the total size of the database if you want to keep your index fresh. One of the goals of a web search engine is to be an accurate representation of what is found on the web, to maximize freshness. Several studies have shown that the web is very dynamic with a subset of pages changing constantly -- weekly (15%), daily or hourly (~23%), or even more often ("The Evolution of the Web and the Implications for an Incremental Crawler" -- http://citeseer.ist.psu.edu/419993.html and "What's new on the web? The evolution of the Web from a Search Engine Perspective -- http://www.www2004.org/proceedings/docs/1p1.pdf). In order to keep dynamic pages fresh they must be crawled and indexed at a high frequency. In short, you have to be able to update your database often, say on a daily basis, to keep it fresh with important dynamic pages. Ideally, I would really like to see a database architecture that could support small incremental updates (say at the rate of a few hundred pages / sec) as well as very large batch updates at a faster rate (thousands of updates per second). The small updates allow you to keep your database fresh and hopefully also provide adequate performance to grow and explore new content. Faster batch updates speeds might also at times be very beneficial. For example, when you make major changes to your ranking algorithms (like Google's "Jagger3" index release). > I believe that in order for Nutch to scale and mature, it will have to > adopt a more incremental approach that better balances smaller updates > with space and time performance constraints. > > If I am right, I see this as an amazing opportunity to improve > scalability and performance. I have some ideas about how to implement > a more incremental approach if people that I would love to share, but > I thought it best to try and identify the problem before spouting off > possible solutions when the question may not be correct. I'd love to hear your ideas. The approach that occurs to me would be to have multiple databases that are merged softly on access. This is akin to Lucene's index segments. A small update could write a new small db. When one, e.g., asks for the set of incoming links to a page, one could first ask the full, db, then the small db, and combine the results. In the general case there could be a stack of databases of logarithmically increasing size. When more than N db's of the same size are on the stack they can be popped, merged and the result pushed back on. I had two ideas, the first is very similar to what you propose above. Since it has a desirable update pattern, could Lucene be adapted to do this? Assuming it could be, what about its robustness? If a disk crashed, for example, we wouldn't want the index / db containing to be corrupted and lost without a recovery procedure. Would it make sense to utilize checkpointing (http://www.mail-archive.com/[email protected]/msg12709.htm l)? The second idea I have is heavily inspired by Webbase: A repository of web pages (http://www9.org/w9cdrom/296/296.html). The paper describes several approaches and advocates what they call a Hashed-log architecture, where urls are partitioned into multiple log structured files by hash (lets say some combination of host/url). A hashed-log approach keeps the log files small (tens of megabytes) so that the entire log could be searched/sorted in memory, if necessary. Could we use BDB as the log implementation in the Hashed-Log architecture described above? BDB Java Edition utilizes a log structured storage system, which makes it a good candidate. Splitting the log files up into more manageable bits could provide faster performance since smaller b-trees would need to be searched and maintained. BDB JE also has the benefit of being extremely robust with full transactional support. I know you used BDB in the past, but do you think it might be worth re-considering, perhaps in the framework of this or a similar architecture? BDB Java Editition has a new version, which might provide some new features. I know that The Internet Archive uses BDB JE as the database system for Heritrix. They use BDB JE for all of Heritrix's large data structures, such as the url frontier (http://crawler.archive.org/cgi-bin/wiki.pl?AlreadySeen <http://crawler.archive.org/cgi-bin/wiki.pl?AlreadySeen> ). In the previous link, they describe some interesting performance tests. It seems to work for them even on a relatively large scale (100+ GB and 64 million urls with a transaction rate of 125 operations per second). The caching in BDB JE seems to help performance significantly. If those numbers are correct, you could achieve satisfactory performance by using it in a distributed wedb-like fashion. Alternatively, is there a hybrid approach between "Lucene" and "Hashed-Log"? I will call it the Hashed-Lucene archicture. Could we split the database up into smaller segements but utilize a Lucene-like merging / storage mechanism? I haven't discussed the distribution algorithm (the paper describes "uniform" vs. "hash") or the update mode (batch vs. incremental vs. hybrid of both) of the proposed database. However, I would add that adapting an incremental update approach could save space since the entire database would not have to be re-written before being searchable by users. This feature becomes more compelling when the database grows to terabytes of data. Also, before you develop a lot of new code, please note that in the mapred branch the webdb has been broken in two, into a crawldb and a linkdb. These are batch-only and, as yet, without an abstract API: they're simply accessed with MapFile. I'm glad they've been broken up! Sounds very logical to me. If I (or we) do decide to implement a different db model, designing an API for the two databases would be very useful. The API would make it much easier to swap out and test different database implementations. It might make sense to use different database storage schemes depending on the deployment (a la MySQL) . It's kind of cool that an API hasn't been defined yet because it provides an opportunity to create something new without legacy baggage. > Lastly, Doug, I was also hoping that you, could clarify one point in > your previous post that I don't quite understand: > > With 100 bytes per link and 10 links per page, a 100M page database > requires 100GB. At 10MB/second transfer rate this takes on the order > of > three hours to read and six hours to re-write, even with tens of > > millions of updates. > > I believe the three hours is an approximation of: 100GB / 10 MB / Sec > converted to minutes and then hours. I guess I am not understanding > how you got the six hours to re-rewrite the data. You doubled the > read time? Why? I think I just meant the total time to read and write it would be six hours. Doug
