Matt,
This is great! It would be very useful to Nutch developers if your code
can be shared. I'm sure quite a few applications will benefit from it
because it fills a gap between whole-web crawling and single site (or a
handful of sites) crawling. I'll be interested in adapting your plugin
to Nutch convention.
Thanks,
-AJ
Matt Kangas wrote:
AJ and Earl,
I've implemented URLFilters before. In fact, I have a
WhitelistURLFilter that implements just what you describe: a
hashtable of regex-lists. We implemented it specifically because we
want to be able to crawl a large number of known-good paths through
sites, including paths through CGIs. The hash is a Nutch ArrayFile,
which provides low runtime overhead. We've tested it on 200+ sites
thus far, and haven't seen any indication that it will have problems
scaling further.
The filter and its supporting WhitelistWriter currently rely on a few
custom classes, but it should be straightforward to adapt to Nutch
naming conventions, etc. If you're interested in doing this work, I
can see if it's ok to publish our code.
BTW, we're currently alpha-testing the site that uses this plugin,
and preparing for a public beta. I'll be sure to post here when we're
finally open for business. :)
--Matt
On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
From reading http://wiki.apache.org/nutch/ DissectingTheNutchCrawler,
it seems that a new urlfilter is a good place to extend the
inclusion regex capability. The new urlfilter will be defined by
urlfilter.class property, which gets loaded by the URLFilterFactory.
Regex is necessary because you want to include urls matching certain
patterns.
Can anybody who implemented URLFilter plugin before share some
thoughts about this approach? I expect the new filter must have all
capabilities that the current RegexURLFilter.java has so that it
won't require change in any other classes. The difference is that
the new filter uses a hash table for efficiently looking up regex
for included domains (a large number!).
BTW, I can't find urlfilter.class property in any of the
configuration files in Nutch-0.7. Does 0.7 version still support
urlfilter extension? Any difference relative to what's described in
the doc DissectingTheNutchCrawler cited above?
Thanks,
AJ
Earl Cahill wrote:
The goal is to avoid entering 100,000 regex in the
craw-urlfilter.xml and checking ALL these regex for each URL. Any
comment?
Sure seems like just some hash look up table could
handle it. I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. Especially if you
have forward and maybe a backwards
lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like
include: com->site->good (for good.site.com stuff)
exclude: com->site->bad (for bad.site.com)
and kind of walk backwards, kind of like dns. Then
you could just do a few hash lookups instead of
100,000 regexes.
I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.
Hope this makes sense. Maybe I could write some code
to and see if it works in practice. If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.
Earl
--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---------------------------------------------------
--
Matt Kangas / [EMAIL PROTECTED]
--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---------------------------------------------------