Hi All,
I'm fairly new to Nutch/Hadoop but I'm starting to get the hang of
it. I followed the nutch/hadoop tutorial to fairly successfully get my
modified version of nutch that downloads atom feed URLs. The system
works perfectly in unit test in Eclipse, but has the following strange
behavior when I run on DFS/Hadoop on my linux deployment machine. I
have a particular URL family (same host/path structure with a different
parameter) that points to my company's intranet blog entries. When I
bootstrap my crawler with a URL file of *only* these URLs, the generator
running on DFS/Hadoop, can't find any URLs to generate. However, if I
put in a single URL at the top of the list that is a different host, and
an external Atom feed, the generator quite happily passes on all feeds
to the fetcher. I've played around quite extensively with all the
various conf files that have URL patterns in them, and tried to make
them as accepting as possible. In particular I comment out all the (-)
patterns and add a (+) catch all at the end. However, with the same
configuration, I don't see the behavior in unit test so I hesitate to
assume that configuration files themselves are the problem.
Thanks in advance for any help.
- Ben