> I'm not necessarily sure that this is a "bug" per se: it's just the fact > that several different content types are potentially possible when any ol' > webserver returns an RSS file. To be honest, I performed a pretty detailed > crawl (100s of thousands of pages) when I originally wrote the plugin way > back in March/April of this year, and the two content types that you see in > the code right now that it checks for are what I found to be the most > pervasive content type returned from webservers for RSS. However, in no way > did I mean for that list to be exhaustive: for instance, web servers may > also return "application/rss", or "text/rss", or even "text/plain" I have > seen for RSS. It all depends on how the webmaster has configured the web > server. So it's kind of difficult to accurately and reliably discriminate > against the content type within a parser plugin itself, because it is > inherently out of the parsers hands what gets returned for a particular type > of file, and even though th! > ere are some "best practices" for what should be returned for different > file types, there is by no means any "standards", that must be followed. > > So, I would propose the following. I believe the checking for the content > type and then throwing an exception block of code exists in other plugins in > Nutch as well. I propose we nix that, and remove the content type checking > and exception message from the plugins themselves, and move it up to a > higher level, i.e., the actually plugin factory or something. Let it get > taken care of there, and let it be configurable, out of the code of each > plugin for instance. Because that way, I believe you can customize whatever > plugin to do whatever your need is, * without * having to recompile the code > just to add another accepted content type to a plugin so it doesn't throw an > error message. > > What say you guys? :-)
That's compliant with the other discussion on this point : http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
