Use Solr? At its core, Solr is a document database. Using a relational database, to warehouse your crawl data, is generally an awful idea. I'd go so far as to suggest that you're probably looking at things the wrong way. :)
I liken crawl data to sludge. Don't try to normalize it. Know what you want to get from it, and expose that data the best way possible. If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool. Amazingly so. That said, you also have another very good choice. Take a look at Riak Search. They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface. In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there. The other nice thing: Riak is a (self-described) "mini-hadoop." So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics. I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed. Scott Gonyea On Oct 25, 2010, at 7:56 PM, xiao yang wrote: > Hi, guys, > > Nutch has its own data format for CrawlDB and LinkDB, which are > difficult to manage and share among applications. > Are there any web crawlers based on relational database? > I can see that Nutch is trying to use HBase for storage, but why not > use a relational database instead? We can use partitioning to solve > scalability problem. > > Thanks! > Xiao

