I love relational databases, but their many features are (in my opinion) wasted on what you find in Nutch. Row-locking and transactional integrity is great for lots of applications, but becomes a whole lot of overhead when it's of next-to-no-value to whatever you're doing.
RE: counting URLs: Have you looked at Solr's facets, etc? I use them like they're going out of style--and it's very powerful. For my application, Solr *is* my database. Nutch crawls data, stores it somewhere, then picks it back up and drops it in Solr. all of my crawl data sits in Solr. I actively report on stats from Solr, as well as make updates to the content that's stored. Lots of fields / boolean attributes sit in the schema. As the user works through the app, their changes get pushed back into Solr. Then when they next hit "Search," results disappear / move around as they had organized it. Scott On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <[email protected]> wrote: > Hi, Scott, > > Thanks for your reply. > I'm curious about the reason why using database is awful. > Here is my requirement: we have two developers who want to do some > processing and analysis work on the crawled data. If the data is > stored in database, we can easily share our data, for the well-defined > data models. What's more, the analysis results can also be easily > stored back into the database by just adding a few fields. > For example, I need to know the average number of urls in one site. In > database, a single SQL will do. If I want to extract and store the > main part of web pages, I can't easily modify the data structure of > Nutch easily. Even in Solr, it's difficult and inefficient to iterate > through the data set. > The crawled data is structured, then why not using database? > > Thanks! > Xiao > > On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <[email protected]> wrote: >> Use Solr? At its core, Solr is a document database. Using a relational >> database, to warehouse your crawl data, is generally an awful idea. I'd go >> so far as to suggest that you're probably looking at things the wrong way. :) >> >> I liken crawl data to sludge. Don't try to normalize it. Know what you >> want to get from it, and expose that data the best way possible. If you >> want to store it, index it, query it, transform it, collect statistics, >> etc... Solr is a terrific tool. Amazingly so. >> >> That said, you also have another very good choice. Take a look at Riak >> Search. They hijacked many core elements of Solr, which I applaud, and is >> compatible with Solr's http interface. In effect, you can point Nutch's >> solr-index job, instead, at a Riak Search node and put your data there. >> >> The other nice thing: Riak is a (self-described) "mini-hadoop." So you can >> search across the Solr indexes, that it's built on top of, or you can throw >> MapReduce jobs at riak and perform some very detailed analytics. >> >> I don't know of a database that lacks a Java client, so the potential for >> indexing plugins is limitless... regardless of where the data is placed. >> >> Scott Gonyea >> >> On Oct 25, 2010, at 7:56 PM, xiao yang wrote: >> >>> Hi, guys, >>> >>> Nutch has its own data format for CrawlDB and LinkDB, which are >>> difficult to manage and share among applications. >>> Are there any web crawlers based on relational database? >>> I can see that Nutch is trying to use HBase for storage, but why not >>> use a relational database instead? We can use partitioning to solve >>> scalability problem. >>> >>> Thanks! >>> Xiao >> >> >

