Hi, Scott, Thanks for your reply. I'm curious about the reason why using database is awful. Here is my requirement: we have two developers who want to do some processing and analysis work on the crawled data. If the data is stored in database, we can easily share our data, for the well-defined data models. What's more, the analysis results can also be easily stored back into the database by just adding a few fields. For example, I need to know the average number of urls in one site. In database, a single SQL will do. If I want to extract and store the main part of web pages, I can't easily modify the data structure of Nutch easily. Even in Solr, it's difficult and inefficient to iterate through the data set. The crawled data is structured, then why not using database?
Thanks! Xiao On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <[email protected]> wrote: > Use Solr? At its core, Solr is a document database. Using a relational > database, to warehouse your crawl data, is generally an awful idea. I'd go > so far as to suggest that you're probably looking at things the wrong way. :) > > I liken crawl data to sludge. Don't try to normalize it. Know what you want > to get from it, and expose that data the best way possible. If you want to > store it, index it, query it, transform it, collect statistics, etc... Solr > is a terrific tool. Amazingly so. > > That said, you also have another very good choice. Take a look at Riak > Search. They hijacked many core elements of Solr, which I applaud, and is > compatible with Solr's http interface. In effect, you can point Nutch's > solr-index job, instead, at a Riak Search node and put your data there. > > The other nice thing: Riak is a (self-described) "mini-hadoop." So you can > search across the Solr indexes, that it's built on top of, or you can throw > MapReduce jobs at riak and perform some very detailed analytics. > > I don't know of a database that lacks a Java client, so the potential for > indexing plugins is limitless... regardless of where the data is placed. > > Scott Gonyea > > On Oct 25, 2010, at 7:56 PM, xiao yang wrote: > >> Hi, guys, >> >> Nutch has its own data format for CrawlDB and LinkDB, which are >> difficult to manage and share among applications. >> Are there any web crawlers based on relational database? >> I can see that Nutch is trying to use HBase for storage, but why not >> use a relational database instead? We can use partitioning to solve >> scalability problem. >> >> Thanks! >> Xiao > >

