Hi, Scott, I agree with you on the uselessness of row-locking and transactional integrity features. But we can reduce the overhead by reading data by block. I mean read many rows(like 1K, or more) at a time, and process them in memory. Do you think whether it will work?
Thanks! Xiao On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <[email protected]> wrote: > Not that it's guaranteed to be of "next to no value" but really, > you've probably already lost pages just crawling them. Server / > network errors, for example, takes the integrity question and makes it > a cost-benefit. Do you recrawl a bunch? At different times? > Different geographies? > > Row locking is reasonably nice, but that begs other questions. It can > easily be solved one of two ways: Put your data is Solr, and persist > your efforts in both places: Solr and an SQL backend. If you're > using riak (or Cassandra), you allow document collisions to exist and > reconcile them within your application. > > It sounds complex, but are actually quite trivial to implement. > > Scott > > On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <[email protected]> wrote: >> I love relational databases, but their many features are (in my >> opinion) wasted on what you find in Nutch. Row-locking and >> transactional integrity is great for lots of applications, but becomes >> a whole lot of overhead when it's of next-to-no-value to whatever >> you're doing. >> >> RE: counting URLs: Have you looked at Solr's facets, etc? I use them >> like they're going out of style--and it's very powerful. >> >> For my application, Solr *is* my database. Nutch crawls data, stores >> it somewhere, then picks it back up and drops it in Solr. all of my >> crawl data sits in Solr. I actively report on stats from Solr, as >> well as make updates to the content that's stored. Lots of fields / >> boolean attributes sit in the schema. >> >> As the user works through the app, their changes get pushed back into >> Solr. Then when they next hit "Search," results disappear / move >> around as they had organized it. >> >> Scott >> >> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <[email protected]> wrote: >>> Hi, Scott, >>> >>> Thanks for your reply. >>> I'm curious about the reason why using database is awful. >>> Here is my requirement: we have two developers who want to do some >>> processing and analysis work on the crawled data. If the data is >>> stored in database, we can easily share our data, for the well-defined >>> data models. What's more, the analysis results can also be easily >>> stored back into the database by just adding a few fields. >>> For example, I need to know the average number of urls in one site. In >>> database, a single SQL will do. If I want to extract and store the >>> main part of web pages, I can't easily modify the data structure of >>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate >>> through the data set. >>> The crawled data is structured, then why not using database? >>> >>> Thanks! >>> Xiao >>> >>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <[email protected]> wrote: >>>> Use Solr? At its core, Solr is a document database. Using a relational >>>> database, to warehouse your crawl data, is generally an awful idea. I'd >>>> go so far as to suggest that you're probably looking at things the wrong >>>> way. :) >>>> >>>> I liken crawl data to sludge. Don't try to normalize it. Know what you >>>> want to get from it, and expose that data the best way possible. If you >>>> want to store it, index it, query it, transform it, collect statistics, >>>> etc... Solr is a terrific tool. Amazingly so. >>>> >>>> That said, you also have another very good choice. Take a look at Riak >>>> Search. They hijacked many core elements of Solr, which I applaud, and is >>>> compatible with Solr's http interface. In effect, you can point Nutch's >>>> solr-index job, instead, at a Riak Search node and put your data there. >>>> >>>> The other nice thing: Riak is a (self-described) "mini-hadoop." So you >>>> can search across the Solr indexes, that it's built on top of, or you can >>>> throw MapReduce jobs at riak and perform some very detailed analytics. >>>> >>>> I don't know of a database that lacks a Java client, so the potential for >>>> indexing plugins is limitless... regardless of where the data is placed. >>>> >>>> Scott Gonyea >>>> >>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote: >>>> >>>>> Hi, guys, >>>>> >>>>> Nutch has its own data format for CrawlDB and LinkDB, which are >>>>> difficult to manage and share among applications. >>>>> Are there any web crawlers based on relational database? >>>>> I can see that Nutch is trying to use HBase for storage, but why not >>>>> use a relational database instead? We can use partitioning to solve >>>>> scalability problem. >>>>> >>>>> Thanks! >>>>> Xiao >>>> >>>> >>> >> >

