Hi Marco, The issue that you are having is that the parse-html plugin is getting called by default on the content that you are trying to parse. This may have to do with the MIME type mappings, and the new improved way (that J. Charron worked on) that Nutch is currently using. So, basically there needs to be an entry in the mime types content file to detect that the file type is RSS, and set the content type to "application/rss+xml", which will cause the parse-rss content parser to be invoked. The problem right now for you is that it is now being invoked.
The bigger issue, however, is how you deal with causing the byte sequence (or so called "magic characters") in the mime types configuration file to recognize that a file is in fact an RSS file. With so many different types of valid feeds (RSS 2.0, 0.9, 1.0, ATOM, and its many versions), how do you reliably and accurately detect by magic character matchers that a file is RSS? The first bytes of the file may be * completely * different in all these valid feed types. The only thing you could probably detect is the fact that the file is of type text/xml. Then, you would need a way to then understand that it's an XML file, but it's also RSS. So, the long story short is, let me look into how this could be done with J. Charron's new MIME type system. I'll try and think about how this could be done. In the meanwhile, try and see if you can get the MIME type system to recognize that the file is in fact XML. Because, if you do that, then a quick and dirty solution for your problem would be to just edit the parse-rss plugin.xml file, and change it to handle content type "text/xml" instead of "application/rss+xml", which is what's currently in there. Then, when the code gets called, I've code the RSSParser to accept both "application/rss+xml", * and * "text/xml". So, it would work fine from there. Does that make sense? If not, just let me know. I got your prior email with the info about checking out your system. I have some free time tonight, so I'll give it a look see and let you know if I can set that up for you. Thanks, Chris Mattmann -----Original Message----- From: Marco PV [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 20, 2005 7:24 PM To: [email protected] Subject: parse-rss fetch problems Hi, I'm using /nutch-nightly from April 18th. I've downloaded and uploaded the last src/plugin/parse-rss (src) and /plugin/parse-rss (bin). I've also compiled it with "ant", with no erros. I've edited nutch-default.xml and modified the "parse-(rss|text|html)" Should I edit the new mime.type files? But when trying to fetch it can't parse either .xml or .rss files. I get the error "indexed, but can't parse : content type not text/html; content type is "text/xml". Should I edit the new mime.type files? Whatever should I do? Please, help. Thanks, Marco _________________________________________________________________ MSN Messenger: instale gr�tis e converse com seus amigos. http://messenger.msn.com.br ------------------------------------------------------- This SF.Net email is sponsored by: New Crystal Reports XI. Version 11 adds new functionality designed to reduce time involved in creating, integrating, and deploying reporting solutions. Free runtime info, new features, or free trial, at: http://www.businessobjects.com/devxi/728 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
