I want to modify the schedule of crawler to make it more real-time. Some web pages are frequently updated, while others seldom change. My idea is to classify URL into 2 categories which will affect the score of URL, so I want to add a field to store which category a URL belongs to. The idea is simple, but I found it's not so easy to implement in Nutch.
Thanks! Xiao On Wed, Oct 27, 2010 at 2:04 PM, Scott Gonyea <[email protected]> wrote: > Lots of things will "work," the question is all about what you're > doing, specifically. I avoid trolling with phrases like "MySQL can't > scale" (unless I know I can get a funny response). MySQL works and > scales wonderfully for a specific set of problems, 'more than good > enough' for most problems, and will make your life needlessly > difficult for some others. > > If you post some larger insights into what you want to warehouse from > your crawl data, and what you plan to do with it, I can try to give > some deeper feedback on how to approach it. But really, nothing too > awful can come from putting it into SQL and picking up your own set of > lessons. It may well be good enough and have just the right level of > convenience for whomever is using it. > > There's no real "right" or "wrong" answer, which is what makes some of > this stuff a real PITA. Sometimes, it'd be nice if someone told me > what tool to use--so I could move on with my life, and solve the > nonsense I was supposed to. It's all still very new, right now--but > Solr (thus Lucene) have a fairly established track record in > indexing/cataloguing heavily de-normalized internet sludge. > > Scott Gonyea > > On Tue, Oct 26, 2010 at 10:14 PM, xiao yang <[email protected]> wrote: >> Hi, Scott, >> >> I agree with you on the uselessness of row-locking and transactional >> integrity features. But we can reduce the overhead by reading data by >> block. I mean read many rows(like 1K, or more) at a time, and process >> them in memory. Do you think whether it will work? >> >> Thanks! >> Xiao >> >> On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <[email protected]> wrote: >>> Not that it's guaranteed to be of "next to no value" but really, >>> you've probably already lost pages just crawling them. Server / >>> network errors, for example, takes the integrity question and makes it >>> a cost-benefit. Do you recrawl a bunch? At different times? >>> Different geographies? >>> >>> Row locking is reasonably nice, but that begs other questions. It can >>> easily be solved one of two ways: Put your data is Solr, and persist >>> your efforts in both places: Solr and an SQL backend. If you're >>> using riak (or Cassandra), you allow document collisions to exist and >>> reconcile them within your application. >>> >>> It sounds complex, but are actually quite trivial to implement. >>> >>> Scott >>> >>> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <[email protected]> wrote: >>>> I love relational databases, but their many features are (in my >>>> opinion) wasted on what you find in Nutch. Row-locking and >>>> transactional integrity is great for lots of applications, but becomes >>>> a whole lot of overhead when it's of next-to-no-value to whatever >>>> you're doing. >>>> >>>> RE: counting URLs: Have you looked at Solr's facets, etc? I use them >>>> like they're going out of style--and it's very powerful. >>>> >>>> For my application, Solr *is* my database. Nutch crawls data, stores >>>> it somewhere, then picks it back up and drops it in Solr. all of my >>>> crawl data sits in Solr. I actively report on stats from Solr, as >>>> well as make updates to the content that's stored. Lots of fields / >>>> boolean attributes sit in the schema. >>>> >>>> As the user works through the app, their changes get pushed back into >>>> Solr. Then when they next hit "Search," results disappear / move >>>> around as they had organized it. >>>> >>>> Scott >>>> >>>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <[email protected]> wrote: >>>>> Hi, Scott, >>>>> >>>>> Thanks for your reply. >>>>> I'm curious about the reason why using database is awful. >>>>> Here is my requirement: we have two developers who want to do some >>>>> processing and analysis work on the crawled data. If the data is >>>>> stored in database, we can easily share our data, for the well-defined >>>>> data models. What's more, the analysis results can also be easily >>>>> stored back into the database by just adding a few fields. >>>>> For example, I need to know the average number of urls in one site. In >>>>> database, a single SQL will do. If I want to extract and store the >>>>> main part of web pages, I can't easily modify the data structure of >>>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate >>>>> through the data set. >>>>> The crawled data is structured, then why not using database? >>>>> >>>>> Thanks! >>>>> Xiao >>>>> >>>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <[email protected]> wrote: >>>>>> Use Solr? At its core, Solr is a document database. Using a relational >>>>>> database, to warehouse your crawl data, is generally an awful idea. I'd >>>>>> go so far as to suggest that you're probably looking at things the wrong >>>>>> way. :) >>>>>> >>>>>> I liken crawl data to sludge. Don't try to normalize it. Know what you >>>>>> want to get from it, and expose that data the best way possible. If you >>>>>> want to store it, index it, query it, transform it, collect statistics, >>>>>> etc... Solr is a terrific tool. Amazingly so. >>>>>> >>>>>> That said, you also have another very good choice. Take a look at Riak >>>>>> Search. They hijacked many core elements of Solr, which I applaud, and >>>>>> is compatible with Solr's http interface. In effect, you can point >>>>>> Nutch's solr-index job, instead, at a Riak Search node and put your data >>>>>> there. >>>>>> >>>>>> The other nice thing: Riak is a (self-described) "mini-hadoop." So you >>>>>> can search across the Solr indexes, that it's built on top of, or you >>>>>> can throw MapReduce jobs at riak and perform some very detailed >>>>>> analytics. >>>>>> >>>>>> I don't know of a database that lacks a Java client, so the potential >>>>>> for indexing plugins is limitless... regardless of where the data is >>>>>> placed. >>>>>> >>>>>> Scott Gonyea >>>>>> >>>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote: >>>>>> >>>>>>> Hi, guys, >>>>>>> >>>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are >>>>>>> difficult to manage and share among applications. >>>>>>> Are there any web crawlers based on relational database? >>>>>>> I can see that Nutch is trying to use HBase for storage, but why not >>>>>>> use a relational database instead? We can use partitioning to solve >>>>>>> scalability problem. >>>>>>> >>>>>>> Thanks! >>>>>>> Xiao >>>>>> >>>>>> >>>>> >>>> >>> >> >

