(Copied from nutch-user, this is more a dev topic now)
It's not an issue with readseg or readlinkdb themselves, because a segment fetched in the older nutch (using the exact same configuration) expels png links in trunk's readlinkdb. It appears the fetcher now only parses URLs that pass the filters into the segment.


I checked the diffs from my old version (mid-December 06) and trunk ParseOutputFormat. It appears now that the parse puts the outlink URLs through the URLFilters. I confirmed this by taking out .png from my URLFilters and re-running a crawl -- pngs now appear in the readlinkdb.

1) Was it a bug that URLs that would not pass URLFilters got into the linkdb for analysis?

2) If so, why is there a -noFilter option for readlinkdb? The linkdb has already been filtered whether you like it or not. -noFilter will never have any effect.

There needs to be a way to have the linkdb reflect all URLs (unfiltered) for further analysis. I suggest a -noFilterOutlinks (default off) in the fetch command (as the default behavior of fetch is to parse.) This would simply not call the filter in ParseOutputFormat, if my theory is correct.




Reply via email to