Re: Per-page crawling policy

Jack Tang Thu, 05 Jan 2006 07:37:40 -0800

BTW: if nutch is going to support vertical searching, I think page
urls should be grouped in three type: fetchable url(just fetching it),
extractable url(fetch it and extract information from this page) and
pagination url.


/Jack

On 1/5/06, Jack Tang <[EMAIL PROTECTED]> wrote:
> Hi Andrzej
>
> The idea brings vertical search into nutch and definitely it is great:)
> I think nutch should add information retrieving layer into the who
> architecture, and export some abstract interface, say
> UrlBasedInformationRetrieve(you can implement your url grouping idea
> here?), TextBasedInformationRetrieve, DomBasedInformationRetrieve. The
> user can implement these in their vertical search by their own.
>
> /Jack
>
> On 1/5/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I've been toying with the following idea, which is an extension of the
> > existing URLFilter mechanism and the concept of a "crawl frontier".
> >
> > Let's suppose we have several initial seed urls, each with a different
> > subjective quality. We would like to crawl these, and expand the
> > "crawling frontier" using outlinks. However, we don't want to do it
> > uniformly for every initial url, but rather propagate certain "crawling
> > policy" through the expanding trees of linked pages. This "crawling
> > policy" could consist of url filters, scoring methods, etc - basically
> > anything configurable in Nutch could be included in this "policy".
> > Perhaps it could even be the new version of non-static NutchConf ;-)
> >
> > Then, if a given initial url is a known high-quality source, we would
> > like to apply a "favor" policy, where we e.g. add pages linked from that
> > url, and in doing so we give them a higher score. Recursively, we could
> > apply the same policy for the next generation pages, or perhaps only for
> > pages belonging to the same domain. So, in a sense the original notion
> > of high-quality would cascade down to other linked pages. The important
> > aspect of this to note is that all newly discovered pages would be
> > subject to the same policy - unless we have compelling reasons to switch
> > the policy (from "favor" to "default" or to "distrust"), which at that
> > point would essentially change the shape of the expanding frontier.
> >
> > If a given initial url is a known spammer, we would like to apply a
> > "distrust" policy for adding pages linked from that url (e.g. adding or
> > not adding, if adding then lowering their score, or applying different
> > score calculation). And recursively we could apply a similar policy of
> > "distrust" to any pages discovered this way. We could also change the
> > policy on the way, if there are compelling reasons to do so. This means
> > that we could follow some high-quality links from low-quality pages,
> > without drilling down the sites which are known to be of low quality.
> >
> > Special care needs to be taken if the same page is discovered from pages
> > with different policies, I haven't thought about this aspect yet... ;-)
> >
> > What would be the benefits of such approach?
> >
> > * the initial page + policy would both control the expanding crawling
> > frontier, and it could be differently defined for different starting
> > pages. I.e. in a single web database we could keep different
> > "collections" or "areas of interest" with differently specified
> > policies. But still we could reap the benefits of a single web db,
> > namely the link information.
> >
> > * URLFilters could be grouped into several policies, and it would be
> > easy to switch between them, or edit them.
> >
> > * if the crawl process realizes it ended up on a spam page, it can
> > switch the page policy to "distrust", or the other way around, and stop
> > crawling unwanted content. From now on the pages linked from that page
> > will follow the new policy. In other words, if a crawling frontier
> > reaches pages with known quality problems, it would be easy to change
> > the policy on-the-fly to avoid them or pages linked from them, without
> > resorting to modifications of URLFilters.
> >
> > Some of the above you can do even now with URLFilters, but any change
> > you do now has global consequences. You may also end up with awfully
> > complicated rules if you try to cover all cases in one rule set.
> >
> > How to implement it? Surprisingly, I think that it's very simple - just
> > adding a CrawlDatum.policyId field would suffice, assuming we have a
> > means to store and retrieve these policies by ID; and then instantiate
> > it and call appropriate methods whenever we use today the URLFilters and
> > do the score calculations.
> >
> > Any comments?
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Per-page crawling policy

Reply via email to