Re: Are there any web crawlers based on database?

xiao yang Tue, 26 Oct 2010 22:14:54 -0700

Hi, Scott,

I agree with you on the uselessness of row-locking and transactional
integrity features. But we can reduce the overhead by reading data by
block. I mean read many rows(like 1K, or more) at a time, and process
them in memory. Do you think whether it will work?


Thanks!
Xiao

On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <[email protected]> wrote:
> Not that it's guaranteed to be of "next to no value" but really,
> you've probably already lost pages just crawling them.  Server /
> network errors, for example, takes the integrity question and makes it
> a cost-benefit.  Do you recrawl a bunch?  At different times?
> Different geographies?
>
> Row locking is reasonably nice, but that begs other questions.  It can
> easily be solved one of two ways:  Put your data is Solr, and persist
> your efforts in both places:  Solr and an SQL backend.  If you're
> using riak (or Cassandra), you allow document collisions to exist and
> reconcile them within your application.
>
> It sounds complex, but are actually quite trivial to implement.
>
> Scott
>
> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <[email protected]> wrote:
>> I love relational databases, but their many features are (in my
>> opinion) wasted on what you find in Nutch.  Row-locking and
>> transactional integrity is great for lots of applications, but becomes
>> a whole lot of overhead when it's of next-to-no-value to whatever
>> you're doing.
>>
>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>> like they're going out of style--and it's very powerful.
>>
>> For my application, Solr *is* my database.  Nutch crawls data, stores
>> it somewhere, then picks it back up and drops it in Solr.  all of my
>> crawl data sits in Solr.  I actively report on stats from Solr, as
>> well as make updates to the content that's stored.  Lots of fields /
>> boolean attributes sit in the schema.
>>
>> As the user works through the app, their changes get pushed back into
>> Solr.  Then when they next hit "Search," results disappear / move
>> around as they had organized it.
>>
>> Scott
>>
>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <[email protected]> wrote:
>>> Hi, Scott,
>>>
>>> Thanks for your reply.
>>> I'm curious about the reason why using database is awful.
>>> Here is my requirement: we have two developers who want to do some
>>> processing and analysis work on the crawled data. If the data is
>>> stored in database, we can easily share our data, for the well-defined
>>> data models. What's more, the analysis results can also be easily
>>> stored back into the database by just adding a few fields.
>>> For example, I need to know the average number of urls in one site. In
>>> database, a single SQL will do. If I want to extract and store the
>>> main part of web pages, I can't easily modify the data structure of
>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>> through the data set.
>>> The crawled data is structured, then why not using database?
>>>
>>> Thanks!
>>> Xiao
>>>
>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <[email protected]> wrote:
>>>> Use Solr?  At its core, Solr is a document database.  Using a relational 
>>>> database, to warehouse your crawl data, is generally an awful idea.  I'd 
>>>> go so far as to suggest that you're probably looking at things the wrong 
>>>> way. :)
>>>>
>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you 
>>>> want to get from it, and expose that data the best way possible.  If you 
>>>> want to store it, index it, query it, transform it, collect statistics, 
>>>> etc... Solr is a terrific tool.  Amazingly so.
>>>>
>>>> That said, you also have another very good choice.  Take a look at Riak 
>>>> Search.  They hijacked many core elements of Solr, which I applaud, and is 
>>>> compatible with Solr's http interface.  In effect, you can point Nutch's 
>>>> solr-index job, instead, at a Riak Search node and put your data there.
>>>>
>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you 
>>>> can search across the Solr indexes, that it's built on top of, or you can 
>>>> throw MapReduce jobs at riak and perform some very detailed analytics.
>>>>
>>>> I don't know of a database that lacks a Java client, so the potential for 
>>>> indexing plugins is limitless... regardless of where the data is placed.
>>>>
>>>> Scott Gonyea
>>>>
>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>
>>>>> Hi, guys,
>>>>>
>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>> difficult to manage and share among applications.
>>>>> Are there any web crawlers based on relational database?
>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>> use a relational database instead? We can use partitioning to solve
>>>>> scalability problem.
>>>>>
>>>>> Thanks!
>>>>> Xiao
>>>>
>>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Reply via email to