strange URL filter behavior

Ben Szekely Mon, 23 Apr 2007 09:04:29 -0700

Hi All,

I'm fairly new to Nutch/Hadoop but I'm starting to get the hang ofit. I followed the nutch/hadoop tutorial to fairly successfully get mymodified version of nutch that downloads atom feed URLs. The systemworks perfectly in unit test in Eclipse, but has the following strangebehavior when I run on DFS/Hadoop on my linux deployment machine. Ihave a particular URL family (same host/path structure with a differentparameter) that points to my company's intranet blog entries. When Ibootstrap my crawler with a URL file of *only* these URLs, the generatorrunning on DFS/Hadoop, can't find any URLs to generate. However, if Iput in a single URL at the top of the list that is a different host, andan external Atom feed, the generator quite happily passes on all feedsto the fetcher. I've played around quite extensively with all thevarious conf files that have URL patterns in them, and tried to makethem as accepting as possible. In particular I comment out all the (-)patterns and add a (+) catch all at the end. However, with the sameconfiguration, I don't see the behavior in unit test so I hesitate toassume that configuration files themselves are the problem.

Thanks in advance for any help.
  - Ben

strange URL filter behavior

Reply via email to