BTW: if nutch is going to support vertical searching, I think page urls should be grouped in three type: fetchable url(just fetching it), extractable url(fetch it and extract information from this page) and pagination url.
/Jack On 1/5/06, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi Andrzej > > The idea brings vertical search into nutch and definitely it is great:) > I think nutch should add information retrieving layer into the who > architecture, and export some abstract interface, say > UrlBasedInformationRetrieve(you can implement your url grouping idea > here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The > user can implement these in their vertical search by their own. > > /Jack > > On 1/5/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I've been toying with the following idea, which is an extension of the > > existing URLFilter mechanism and the concept of a "crawl frontier". > > > > Let's suppose we have several initial seed urls, each with a different > > subjective quality. We would like to crawl these, and expand the > > "crawling frontier" using outlinks. However, we don't want to do it > > uniformly for every initial url, but rather propagate certain "crawling > > policy" through the expanding trees of linked pages. This "crawling > > policy" could consist of url filters, scoring methods, etc - basically > > anything configurable in Nutch could be included in this "policy". > > Perhaps it could even be the new version of non-static NutchConf ;-) > > > > Then, if a given initial url is a known high-quality source, we would > > like to apply a "favor" policy, where we e.g. add pages linked from that > > url, and in doing so we give them a higher score. Recursively, we could > > apply the same policy for the next generation pages, or perhaps only for > > pages belonging to the same domain. So, in a sense the original notion > > of high-quality would cascade down to other linked pages. The important > > aspect of this to note is that all newly discovered pages would be > > subject to the same policy - unless we have compelling reasons to switch > > the policy (from "favor" to "default" or to "distrust"), which at that > > point would essentially change the shape of the expanding frontier. > > > > If a given initial url is a known spammer, we would like to apply a > > "distrust" policy for adding pages linked from that url (e.g. adding or > > not adding, if adding then lowering their score, or applying different > > score calculation). And recursively we could apply a similar policy of > > "distrust" to any pages discovered this way. We could also change the > > policy on the way, if there are compelling reasons to do so. This means > > that we could follow some high-quality links from low-quality pages, > > without drilling down the sites which are known to be of low quality. > > > > Special care needs to be taken if the same page is discovered from pages > > with different policies, I haven't thought about this aspect yet... ;-) > > > > What would be the benefits of such approach? > > > > * the initial page + policy would both control the expanding crawling > > frontier, and it could be differently defined for different starting > > pages. I.e. in a single web database we could keep different > > "collections" or "areas of interest" with differently specified > > policies. But still we could reap the benefits of a single web db, > > namely the link information. > > > > * URLFilters could be grouped into several policies, and it would be > > easy to switch between them, or edit them. > > > > * if the crawl process realizes it ended up on a spam page, it can > > switch the page policy to "distrust", or the other way around, and stop > > crawling unwanted content. From now on the pages linked from that page > > will follow the new policy. In other words, if a crawling frontier > > reaches pages with known quality problems, it would be easy to change > > the policy on-the-fly to avoid them or pages linked from them, without > > resorting to modifications of URLFilters. > > > > Some of the above you can do even now with URLFilters, but any change > > you do now has global consequences. You may also end up with awfully > > complicated rules if you try to cover all cases in one rule set. > > > > How to implement it? Surprisingly, I think that it's very simple - just > > adding a CrawlDatum.policyId field would suffice, assuming we have a > > means to store and retrieve these policies by ID; and then instantiate > > it and call appropriate methods whenever we use today the URLFilters and > > do the score calculations. > > > > Any comments? > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
