Andrzej,
This sounds like another great way to create more of a vertical
search application as well. By defining trusted seed sources we can
limit the scope of the crawl to a more suitable set of links.
Also, being able to apply additional rules by domain/host or by
trusted source would be great as well. I.E. If "trusted" allow
crawling of dynamic content and allow up to N pages of "?" urls. Or
even having a trusted URL list where specific hosts would be crawled
for dynamic content. This may be similar to the "Hub" concept of
Google where certain sites carry a heavier weight - though perhaps
being able to manually apply this would be suitable in a Nutch
vertical implentation.
A "quality" score could also be calculated using a set of "core
keywords" that apply to a vertical. So a list of several hundred
core words could try to match words on the page that is being
crawled. When the crawler finds sites with these words it gives it
a bump in it's quality score - hence allowing for example a deeper
crawl of that site and extended crawling of outlinks.
I imagine extensive rule lists/filters like this might cause a strain
on a Full Web Crawl. But for those of us who are only going to be
crawling a certain segment of the net this would not slow things down
to bad (say 500,000 urls or so).
Neal Whitley
At 08:58 AM 1/5/2006, you wrote:
Hi,
I've been toying with the following idea, which is an extension of
the existing URLFilter mechanism and the concept of a "crawl frontier".
Let's suppose we have several initial seed urls, each with a
different subjective quality. We would like to crawl these, and
expand the "crawling frontier" using outlinks. However, we don't
want to do it uniformly for every initial url, but rather propagate
certain "crawling policy" through the expanding trees of linked
pages. This "crawling policy" could consist of url filters, scoring
methods, etc - basically anything configurable in Nutch could be
included in this "policy". Perhaps it could even be the new version
of non-static NutchConf ;-)
Then, if a given initial url is a known high-quality source, we
would like to apply a "favor" policy, where we e.g. add pages linked
from that url, and in doing so we give them a higher score.
Recursively, we could apply the same policy for the next generation
pages, or perhaps only for pages belonging to the same domain. So,
in a sense the original notion of high-quality would cascade down to
other linked pages. The important aspect of this to note is that all
newly discovered pages would be subject to the same policy - unless
we have compelling reasons to switch the policy (from "favor" to
"default" or to "distrust"), which at that point would essentially
change the shape of the expanding frontier.
If a given initial url is a known spammer, we would like to apply a
"distrust" policy for adding pages linked from that url (e.g. adding
or not adding, if adding then lowering their score, or applying
different score calculation). And recursively we could apply a
similar policy of "distrust" to any pages discovered this way. We
could also change the policy on the way, if there are compelling
reasons to do so. This means that we could follow some high-quality
links from low-quality pages, without drilling down the sites which
are known to be of low quality.
Special care needs to be taken if the same page is discovered from
pages with different policies, I haven't thought about this aspect yet... ;-)
What would be the benefits of such approach?
* the initial page + policy would both control the expanding
crawling frontier, and it could be differently defined for different
starting pages. I.e. in a single web database we could keep
different "collections" or "areas of interest" with differently
specified policies. But still we could reap the benefits of a single
web db, namely the link information.
* URLFilters could be grouped into several policies, and it would be
easy to switch between them, or edit them.
* if the crawl process realizes it ended up on a spam page, it can
switch the page policy to "distrust", or the other way around, and
stop crawling unwanted content. From now on the pages linked from
that page will follow the new policy. In other words, if a crawling
frontier reaches pages with known quality problems, it would be easy
to change the policy on-the-fly to avoid them or pages linked from
them, without resorting to modifications of URLFilters.
Some of the above you can do even now with URLFilters, but any
change you do now has global consequences. You may also end up with
awfully complicated rules if you try to cover all cases in one rule set.
How to implement it? Surprisingly, I think that it's very simple -
just adding a CrawlDatum.policyId field would suffice, assuming we
have a means to store and retrieve these policies by ID; and then
instantiate it and call appropriate methods whenever we use today
the URLFilters and do the score calculations.
Any comments?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers