Re: Are there any web crawlers based on database?

xiao yang Wed, 27 Oct 2010 01:43:05 -0700

I want to modify the schedule of crawler to make it more real-time.
Some web pages are frequently updated, while others seldom change. My
idea is to classify URL into 2 categories which will affect the score
of URL, so I want to add a field to store which category a URL belongs
to.
The idea is simple, but I found it's not so easy to implement in Nutch.


Thanks!
Xiao

On Wed, Oct 27, 2010 at 2:04 PM, Scott Gonyea <[email protected]> wrote:
> Lots of things will "work," the question is all about what you're
> doing, specifically.  I avoid trolling with phrases like "MySQL can't
> scale" (unless I know I can get a funny response).  MySQL works and
> scales wonderfully for a specific set of problems, 'more than good
> enough' for most problems, and will make your life needlessly
> difficult for some others.
>
> If you post some larger insights into what you want to warehouse from
> your crawl data, and what you plan to do with it, I can try to give
> some deeper feedback on how to approach it.  But really, nothing too
> awful can come from putting it into SQL and picking up your own set of
> lessons.  It may well be good enough and have just the right level of
> convenience for whomever is using it.
>
> There's no real "right" or "wrong" answer, which is what makes some of
> this stuff a real PITA.  Sometimes, it'd be nice if someone told me
> what tool to use--so I could move on with my life, and solve the
> nonsense I was supposed to.  It's all still very new, right now--but
> Solr (thus Lucene) have a fairly established track record in
> indexing/cataloguing heavily de-normalized internet sludge.
>
> Scott Gonyea
>
> On Tue, Oct 26, 2010 at 10:14 PM, xiao yang <[email protected]> wrote:
>> Hi, Scott,
>>
>> I agree with you on the uselessness of row-locking and transactional
>> integrity features. But we can reduce the overhead by reading data by
>> block. I mean read many rows(like 1K, or more) at a time, and process
>> them in memory. Do you think whether it will work?
>>
>> Thanks!
>> Xiao
>>
>> On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <[email protected]> wrote:
>>> Not that it's guaranteed to be of "next to no value" but really,
>>> you've probably already lost pages just crawling them.  Server /
>>> network errors, for example, takes the integrity question and makes it
>>> a cost-benefit.  Do you recrawl a bunch?  At different times?
>>> Different geographies?
>>>
>>> Row locking is reasonably nice, but that begs other questions.  It can
>>> easily be solved one of two ways:  Put your data is Solr, and persist
>>> your efforts in both places:  Solr and an SQL backend.  If you're
>>> using riak (or Cassandra), you allow document collisions to exist and
>>> reconcile them within your application.
>>>
>>> It sounds complex, but are actually quite trivial to implement.
>>>
>>> Scott
>>>
>>> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <[email protected]> wrote:
>>>> I love relational databases, but their many features are (in my
>>>> opinion) wasted on what you find in Nutch.  Row-locking and
>>>> transactional integrity is great for lots of applications, but becomes
>>>> a whole lot of overhead when it's of next-to-no-value to whatever
>>>> you're doing.
>>>>
>>>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>>>> like they're going out of style--and it's very powerful.
>>>>
>>>> For my application, Solr *is* my database.  Nutch crawls data, stores
>>>> it somewhere, then picks it back up and drops it in Solr.  all of my
>>>> crawl data sits in Solr.  I actively report on stats from Solr, as
>>>> well as make updates to the content that's stored.  Lots of fields /
>>>> boolean attributes sit in the schema.
>>>>
>>>> As the user works through the app, their changes get pushed back into
>>>> Solr.  Then when they next hit "Search," results disappear / move
>>>> around as they had organized it.
>>>>
>>>> Scott
>>>>
>>>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <[email protected]> wrote:
>>>>> Hi, Scott,
>>>>>
>>>>> Thanks for your reply.
>>>>> I'm curious about the reason why using database is awful.
>>>>> Here is my requirement: we have two developers who want to do some
>>>>> processing and analysis work on the crawled data. If the data is
>>>>> stored in database, we can easily share our data, for the well-defined
>>>>> data models. What's more, the analysis results can also be easily
>>>>> stored back into the database by just adding a few fields.
>>>>> For example, I need to know the average number of urls in one site. In
>>>>> database, a single SQL will do. If I want to extract and store the
>>>>> main part of web pages, I can't easily modify the data structure of
>>>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>>>> through the data set.
>>>>> The crawled data is structured, then why not using database?
>>>>>
>>>>> Thanks!
>>>>> Xiao
>>>>>
>>>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <[email protected]> wrote:
>>>>>> Use Solr?  At its core, Solr is a document database.  Using a relational 
>>>>>> database, to warehouse your crawl data, is generally an awful idea.  I'd 
>>>>>> go so far as to suggest that you're probably looking at things the wrong 
>>>>>> way. :)
>>>>>>
>>>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you 
>>>>>> want to get from it, and expose that data the best way possible.  If you 
>>>>>> want to store it, index it, query it, transform it, collect statistics, 
>>>>>> etc... Solr is a terrific tool.  Amazingly so.
>>>>>>
>>>>>> That said, you also have another very good choice.  Take a look at Riak 
>>>>>> Search.  They hijacked many core elements of Solr, which I applaud, and 
>>>>>> is compatible with Solr's http interface.  In effect, you can point 
>>>>>> Nutch's solr-index job, instead, at a Riak Search node and put your data 
>>>>>> there.
>>>>>>
>>>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you 
>>>>>> can search across the Solr indexes, that it's built on top of, or you 
>>>>>> can throw MapReduce jobs at riak and perform some very detailed 
>>>>>> analytics.
>>>>>>
>>>>>> I don't know of a database that lacks a Java client, so the potential 
>>>>>> for indexing plugins is limitless... regardless of where the data is 
>>>>>> placed.
>>>>>>
>>>>>> Scott Gonyea
>>>>>>
>>>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>>>
>>>>>>> Hi, guys,
>>>>>>>
>>>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>>>> difficult to manage and share among applications.
>>>>>>> Are there any web crawlers based on relational database?
>>>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>>>> use a relational database instead? We can use partitioning to solve
>>>>>>> scalability problem.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Reply via email to