On Sep 22, 2007, at 3:37 PM, Brian Whitman wrote:
which worked fine and there were many .png files in the dump. However, with trunk, this doesn't seem to be the case anymore. There are no .png files in the linkdb dump, only html (pretty much the only filetype we allow nutch to download.)
More info on this... I noticed that the two readlinkdb outputs with - noFilter on and off were identical (diff returned nothing.)
I dumped the segment with readseg and none of the URL or outlink: lines are for anything but things that would pass my url filters.
It's not an issue with readseg or readlinkdb themselves, because a segment fetched in the older nutch (using the exact same configuration) expels png links in trunk's readlinkdb. It appears the fetcher now only parses URLs that pass the filters into the segment.
I assume this behavior is incorrect because otherwise why would readlinkdb need a -noFilter? Also, this makes it tough to do what I'm trying to do -- have nutch index text but have other things grab the binary files.
Any ideas?
