Re: Automating workflow using ndfs

Matt Kangas Fri, 02 Sep 2005 12:20:27 -0700

Great! Is there a ticket in JIRA requesting this feature? If not, weshould file one and get a few votes in favor of it. AFAIK, that's theprocess for getting new features into Nutch.


On Sep 2, 2005, at 1:30 PM, AJ Chen wrote:

Matt,
This is great! It would be very useful to Nutch developers if yourcode can be shared. I'm sure quite a few applications will benefitfrom it because it fills a gap between whole-web crawling andsingle site (or a handful of sites) crawling. I'll be interestedin adapting your plugin to Nutch convention.
Thanks,
-AJ

Matt Kangas wrote:
AJ and Earl,
I've implemented URLFilters before. In fact, I have aWhitelistURLFilter that implements just what you describe: ahashtable of regex-lists. We implemented it specifically becausewe want to be able to crawl a large number of known-good pathsthrough sites, including paths through CGIs. The hash is a NutchArrayFile, which provides low runtime overhead. We've tested iton 200+ sites thus far, and haven't seen any indication that itwill have problems scaling further.
The filter and its supporting WhitelistWriter currently rely on afew custom classes, but it should be straightforward to adapt toNutch naming conventions, etc. If you're interested in doing thiswork, I can see if it's ok to publish our code.
BTW, we're currently alpha-testing the site that uses thisplugin, and preparing for a public beta. I'll be sure to posthere when we're finally open for business. :)
--Matt


On Sep 2, 2005, at 11:43 AM, AJ Chen wrote:
From reading http://wiki.apache.org/nutch/DissectingTheNutchCrawler, it seems that a new urlfilter is agood place to extend the inclusion regex capability. The newurlfilter will be defined by urlfilter.class property, whichgets loaded by the URLFilterFactory.Regex is necessary because you want to include urls matchingcertain patterns.
Can anybody who implemented URLFilter plugin before share somethoughts about this approach? I expect the new filter must haveall capabilities that the current RegexURLFilter.java has sothat it won't require change in any other classes. Thedifference is that the new filter uses a hash table forefficiently looking up regex for included domains (a largenumber!).
BTW, I can't find urlfilter.class property in any of theconfiguration files in Nutch-0.7. Does 0.7 version still supporturlfilter extension? Any difference relative to what's describedin the doc DissectingTheNutchCrawler cited above?
Thanks,
AJ

Earl Cahill wrote:
The goal is to avoid entering 100,000 regex in the
craw-urlfilter.xml and checking ALL these regex for each URL.Any comment?
Sure seems like just some hash look up table could
handle it.  I am having a hard time seeing when you
really need a regex and a fixed list wouldn't do. Especially ifyou have forward and maybe a backwards
lookup as well in a multi-level hash, to perhaps
include/exclude at a certain sudomain level, like

include: com->site->good (for good.site.com stuff)
exclude: com->site->bad (for bad.site.com)

and kind of walk backwards, kind of like dns.  Then
you could just do a few hash lookups instead of
100,000 regexes.

I realize I am talking about host and not page level
filtering, but if you want to include everything from
your 100,000 sites, I think such a strategy could
work.

Hope this makes sense.  Maybe I could write some code
to and see if it works in practice.  If nothing else,
maybe the hash stuff could just be another filter
option in conf/crawl-urlfilter.txt.

Earl
--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---------------------------------------------------
--
Matt Kangas / [EMAIL PROTECTED]
--
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, [EMAIL PROTECTED]
---------------------------------------------------


--
Matt Kangas / [EMAIL PROTECTED]

Re: Automating workflow using ndfs

Reply via email to