Re: Are there any web crawlers based on database?

Scott Gonyea Mon, 25 Oct 2010 21:01:33 -0700

Use Solr?  At its core, Solr is a document database.  Using a relational 
database, to warehouse your crawl data, is generally an awful idea.  I'd go so 
far as to suggest that you're probably looking at things the wrong way. :)

I liken crawl data to sludge.  Don't try to normalize it.  Know what you want 
to get from it, and expose that data the best way possible.  If you want to 
store it, index it, query it, transform it, collect statistics, etc... Solr is 
a terrific tool.  Amazingly so.

That said, you also have another very good choice.  Take a look at Riak Search. 
 They hijacked many core elements of Solr, which I applaud, and is compatible 
with Solr's http interface.  In effect, you can point Nutch's solr-index job, 
instead, at a Riak Search node and put your data there.

The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can 
search across the Solr indexes, that it's built on top of, or you can throw 
MapReduce jobs at riak and perform some very detailed analytics.

I don't know of a database that lacks a Java client, so the potential for 
indexing plugins is limitless... regardless of where the data is placed.

Scott Gonyea

On Oct 25, 2010, at 7:56 PM, xiao yang wrote:

> Hi, guys,
> 
> Nutch has its own data format for CrawlDB and LinkDB, which are
> difficult to manage and share among applications.
> Are there any web crawlers based on relational database?
> I can see that Nutch is trying to use HBase for storage, but why not
> use a relational database instead? We can use partitioning to solve
> scalability problem.
> 
> Thanks!
> Xiao

Re: Are there any web crawlers based on database?

Reply via email to