Jérôme Charron wrote:
For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file extensions associated to each content-type, we can build a list of file extensions to include (other ones will be excluded) in the fecth process. No?
What about a site that develops a content system that has urls that end in .foo, which we would exclude, even though they return html?
Doug