[Nutch-dev] Re: Per-page crawling policy

Neal Whitley Sat, 07 Jan 2006 15:09:15 -0800

For others working in a vertical search scenario I am having somegood luck with the following steps.

For starters it begins with a bit of a manual process to obtain agood seed starting point. For my current business I already had abasic seed list of about 7,500 unique links to home pages ofcompanies in my industry. But then I wanted to increase the scope toinclude "any sites" in my business category, not just a list of companies.


So now what:

1.) I found a nice desktop spider application to automate theprocess. Visual Web Spider (http://www.newprosoft.com) has turnedout to be a very good tool to set up some very focused crawls. Thisspider app has some "fantastic" filters (that I wish Nutch had) thatallow me to configure a crawl: Depth of crawl, limit number of pagesby domain, url stop words, url include words and more...

a.) Custom List Crawling: First I crawled the sites in my currentlist of urls to a depth of x. This increased the width of my currentseed list and I told the spider to stay in the current domain and notspider external sites.

b.) Search Engine Crawling: Visual Web Spider also a easy to usefunction that allows me to crawl Google, Yahoo, All The Web and AltaVista. So now I created some URL fetch queries to query industryspecific pages from these search engines. Again, I could configuredepth, max pages etc from these starting points.

Example:http://www.google.com/search?q=ExampleSearchTerm&hl=en&num=40&start=0(The crawler will grab the first 40 results for my SearchTerm). ThenI could tell the spider how deep to crawl each page and how manypages to grab etc.

c.) Site Crawling: There are a few dozen other "Hub Sites" in myindustry that offer excellent content where I wanted to index themajority of there content. So I set up a new task and told thespider to grab all the pages in these domains only - but ran itagainst a filter to filter out forums and some other content areasthat I did not want to spider.

2.) In my nutch-site.xml conf file I then set"db.max.outlinks.per.page" to 10 (default was 100). This is keepingnutch fetch lists smaller but a but more focused.

In a short period of time I had ten's of thousands of focused seedurls to inject into Nutch.

These steps have allowed me to start a vertical search engine and nothave Nutch go too far in it's own fetches/crawls thus limiting thenumber of urls that get in. I still get a fair number of off-topicsites in the database but for the most part it's not a problembecause Nutch index's the database so well. I love Nutch!



Neal


At 01:38 PM 1/5/2006, you wrote:

Andrzej,
This sounds like another great way to create more of a verticalsearch application as well. By defining trusted seed sources we canlimit the scope of the crawl to a more suitable set of links.
Also, being able to apply additional rules by domain/host or bytrusted source would be great as well. I.E. If "trusted" allowcrawling of dynamic content and allow up to N pages of "?" urls. Oreven having a trusted URL list where specific hosts would be crawledfor dynamic content. This may be similar to the "Hub" concept ofGoogle where certain sites carry a heavier weight - though perhapsbeing able to manually apply this would be suitable in a Nutchvertical implentation.
A "quality" score could also be calculated using a set of "corekeywords" that apply to a vertical. So a list of several hundredcore words could try to match words on the page that is beingcrawled. When the crawler finds sites with these words it gives ita bump in it's quality score - hence allowing for example a deepercrawl of that site and extended crawling of outlinks.
I imagine extensive rule lists/filters like this might cause astrain on a Full Web Crawl. But for those of us who are only goingto be crawling a certain segment of the net this would not slowthings down to bad (say 500,000 urls or so).
Neal Whitley


At 08:58 AM 1/5/2006, you wrote:
Hi,
I've been toying with the following idea, which is an extension ofthe existing URLFilter mechanism and the concept of a "crawl frontier".
Let's suppose we have several initial seed urls, each with adifferent subjective quality. We would like to crawl these, andexpand the "crawling frontier" using outlinks. However, we don'twant to do it uniformly for every initial url, but rather propagatecertain "crawling policy" through the expanding trees of linkedpages. This "crawling policy" could consist of url filters, scoringmethods, etc - basically anything configurable in Nutch could beincluded in this "policy". Perhaps it could even be the new versionof non-static NutchConf ;-)
Then, if a given initial url is a known high-quality source, wewould like to apply a "favor" policy, where we e.g. add pageslinked from that url, and in doing so we give them a higher score.Recursively, we could apply the same policy for the next generationpages, or perhaps only for pages belonging to the same domain. So,in a sense the original notion of high-quality would cascade downto other linked pages. The important aspect of this to note is thatall newly discovered pages would be subject to the same policy -unless we have compelling reasons to switch the policy (from"favor" to "default" or to "distrust"), which at that point wouldessentially change the shape of the expanding frontier.
If a given initial url is a known spammer, we would like to apply a"distrust" policy for adding pages linked from that url (e.g.adding or not adding, if adding then lowering their score, orapplying different score calculation). And recursively we couldapply a similar policy of "distrust" to any pages discovered thisway. We could also change the policy on the way, if there arecompelling reasons to do so. This means that we could follow somehigh-quality links from low-quality pages, without drilling downthe sites which are known to be of low quality.
Special care needs to be taken if the same page is discovered frompages with different policies, I haven't thought about this aspect yet... ;-)
What would be the benefits of such approach?
* the initial page + policy would both control the expandingcrawling frontier, and it could be differently defined fordifferent starting pages. I.e. in a single web database we couldkeep different "collections" or "areas of interest" withdifferently specified policies. But still we could reap thebenefits of a single web db, namely the link information.
* URLFilters could be grouped into several policies, and it wouldbe easy to switch between them, or edit them.
* if the crawl process realizes it ended up on a spam page, it canswitch the page policy to "distrust", or the other way around, andstop crawling unwanted content. From now on the pages linked fromthat page will follow the new policy. In other words, if a crawlingfrontier reaches pages with known quality problems, it would beeasy to change the policy on-the-fly to avoid them or pages linkedfrom them, without resorting to modifications of URLFilters.
Some of the above you can do even now with URLFilters, but anychange you do now has global consequences. You may also end up withawfully complicated rules if you try to cover all cases in one rule set.
How to implement it? Surprisingly, I think that it's very simple -just adding a CrawlDatum.policyId field would suffice, assuming wehave a means to store and retrieve these policies by ID; and theninstantiate it and call appropriate methods whenever we use todaythe URLFilters and do the score calculations.
Any comments?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Per-page crawling policy

Reply via email to