Hi Miguel, Actually it's not out of priority, unfortunately because of the generic nature of the mime type "text/xml". Turns out that a lot of RSS comes back as configured by the web server with the content type "text/xml", even though it's recommended that "application/rss+xml" be used as the mime type for RSS. Most web server admins don't really spend the time configuring this mime type correctly in their web server. Further, if you go look at the IANA list of mime types, there really isn't a mime type specified for RSS (although RDF has applicaction/rdf+xml, which is sometimes used to identify RSS as well).
So when I coded up the parse-plugins.xml file, I just noted the fact that text/xml isn't really the standard mime type for rss, it's just the mime type for any type of XML document, i.e., something that starts out with "<?xml version=.....", which can conform to * any * XML Schema or DTD as specified, which means identifying a document as text/xml doesn't really get you anywhere unfortunately. That's what I set the parse-text plugin to be the highest priority for text/xml, as in my mind it was most suited to handle the generic nature of XML documents. I listed parse-html as 2nd in priority because XHTML is becoming more popular and a pervasive form of content. Finally, parse-rss is last, well, because, I think it should be. :-) If you think about it, parse-rss is really only meant to handle RSS feeds, which may, or may not, come back with the mime type "text/xml". So, to answer your question, yes, parse-rss is last in the default parse-plugins file. However, this doesn't mean it has to be that way in your file. You are free to modify this list. Remember that order matters, in fact, the order that the plugin comes underneath a mime type specifies its order of preference to be used during parsing. You can find the full specification of this at: http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/ which was authored by myself, Jerome Charron, and Sebastien LeCallonec jointly. One part of fixing this problem is correct mime type identification for document types, which I know that Jerome is working on an update to, and will soon have a new mime type registry committed to Nutch. The other part of this however, is deeper than just correct mime type identification. It has to do with understanding the appropriate DTD or XML Schema that an XML document conforms to. Only then will we understand the "right" parser to call for an XML document. This could be handled in a number of ways, off the top of my head, 2 ways come to mind: 1. Having a generic "text/xml" reading plugin than could parse out the DTD/or XML Schema used by an XML document, and then call the right "sub XML parsing plugin", that knew how to handle that DTD or schema 2. Adding an attribute to the plugin.xml file that specifies the DTD or Schema that an XML Parsing Plugin supports, and then doing the resolution in a decentralized fashion whenever the mime type "text/xml" is encountered Anyways, I have been thinking about this for a while, and will start working on a proposal and solution in the near future. For now, if you like, you could create a JIRA issue about this as a "wish" or "improvement" to be worked on in the (near) future. FYI, here are a few interesting articles on the subject: http://spazioinwind.libero.it/pierfederici/blog/000056.html http://www.rassoc.com/gregr/weblog/archive.aspx?post=662 Thanks, Chris On 10/18/05 9:36 AM, "Miguel A Paraz" <[EMAIL PROTECTED]> wrote: > Hi, > I'm trying to set up Nutch to crawl blogs. > > For nutch-site.xml, I added parse-rss to plugin.includes: > <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|rss)|index-more > |query-(basic|site|url)</value> > > > and set db.ignore.internal.links to false. > > I noticed that in parse-plugins.xml: > > <mimeType name="text/xml"> > <plugin id="parse-text" /> > <plugin id="parse-html" /> > <plugin id="parse-rss" /> > </mimeType> > > is this by order of priority, and parse-rss is last? > > I tried injecting a single URL, my blog feed which is text/xml: > http://migs.paraz.com/w/feed/ > > It apparently isn't parsed. > > Thanks in advance. ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
