I love relational databases, but their many features are (in my
opinion) wasted on what you find in Nutch.  Row-locking and
transactional integrity is great for lots of applications, but becomes
a whole lot of overhead when it's of next-to-no-value to whatever
you're doing.

RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
like they're going out of style--and it's very powerful.

For my application, Solr *is* my database.  Nutch crawls data, stores
it somewhere, then picks it back up and drops it in Solr.  all of my
crawl data sits in Solr.  I actively report on stats from Solr, as
well as make updates to the content that's stored.  Lots of fields /
boolean attributes sit in the schema.

As the user works through the app, their changes get pushed back into
Solr.  Then when they next hit "Search," results disappear / move
around as they had organized it.

Scott

On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <[email protected]> wrote:
> Hi, Scott,
>
> Thanks for your reply.
> I'm curious about the reason why using database is awful.
> Here is my requirement: we have two developers who want to do some
> processing and analysis work on the crawled data. If the data is
> stored in database, we can easily share our data, for the well-defined
> data models. What's more, the analysis results can also be easily
> stored back into the database by just adding a few fields.
> For example, I need to know the average number of urls in one site. In
> database, a single SQL will do. If I want to extract and store the
> main part of web pages, I can't easily modify the data structure of
> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
> through the data set.
> The crawled data is structured, then why not using database?
>
> Thanks!
> Xiao
>
> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <[email protected]> wrote:
>> Use Solr?  At its core, Solr is a document database.  Using a relational 
>> database, to warehouse your crawl data, is generally an awful idea.  I'd go 
>> so far as to suggest that you're probably looking at things the wrong way. :)
>>
>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you 
>> want to get from it, and expose that data the best way possible.  If you 
>> want to store it, index it, query it, transform it, collect statistics, 
>> etc... Solr is a terrific tool.  Amazingly so.
>>
>> That said, you also have another very good choice.  Take a look at Riak 
>> Search.  They hijacked many core elements of Solr, which I applaud, and is 
>> compatible with Solr's http interface.  In effect, you can point Nutch's 
>> solr-index job, instead, at a Riak Search node and put your data there.
>>
>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can 
>> search across the Solr indexes, that it's built on top of, or you can throw 
>> MapReduce jobs at riak and perform some very detailed analytics.
>>
>> I don't know of a database that lacks a Java client, so the potential for 
>> indexing plugins is limitless... regardless of where the data is placed.
>>
>> Scott Gonyea
>>
>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>
>>> Hi, guys,
>>>
>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>> difficult to manage and share among applications.
>>> Are there any web crawlers based on relational database?
>>> I can see that Nutch is trying to use HBase for storage, but why not
>>> use a relational database instead? We can use partitioning to solve
>>> scalability problem.
>>>
>>> Thanks!
>>> Xiao
>>
>>
>

Reply via email to