Hi there, That should work: however, the biggest problem will be making sure that "text/xml" is actually the content type of the RSS that you are parsing, which you'll have little or no control over.
Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris ______________________________________________ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 Phone: 818-354-8810 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. > -----Original Message----- > From: 盖世豪侠 [mailto:[EMAIL PROTECTED] > Sent: Saturday, February 04, 2006 11:40 PM > To: [email protected] > Subject: Re: Which version of rss does parse-rss plugin support? > > Hi Chris > > > How do I change the plugin.xml? For example, if I want to crawl rss files > end with "xml", just add a new element? > > <implementation id="org.apache.nutch.parse.rss.RSSParser" > class="org.apache.nutch.parse.rss.RSSParser" > contentType="application/rss+xml" > pathSuffix="rss"/> > <implementation id="org.apache.nutch.parse.rss.RSSParser" > class="org.apache.nutch.parse.rss.RSSParser" > contentType="application/rss+xml" > pathSuffix="xml"/> > > Am I right? > > > > 在06-2-3,Chris Mattmann <[EMAIL PROTECTED]> 写道: > > > > Hi there, > > Sure it will, you just have to configure it to do that. Pop over to > > $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there > > is > > an attribute called "pathSuffix". Change that to handle whatever type of > > rss > > file you want to crawl. That will work locally. For web-based crawls, > you > > need to make sure that the content type being returned for your RSS > > content > > matches the content type specified in the plugin.xml file that parse-rss > > claims to support. > > > > Note that you might not have * a lot * of success with being able to > > control the content type for rss files returned by web servers. I've > seen > > a > > LOT of inconsistency among the way that they're configured by the > > administrators, etc. However, just to let you know, there are some > people > > in > > the group that are working on a solution to addressing this. > > > > Hope that helps. > > > > Cheers, > > Chris > > > > > > > > On 2/3/06 7:16 AM, "盖世豪侠" <[EMAIL PROTECTED]> wrote: > > > > > Hi *Chris,* > > > > > > The files of RSS 1.0 have a postfix of rdf. So willthe parser > recognize > > it > > > automatically as a rss file? > > > > > > > > > 在06-2-3,Chris Mattmann <[EMAIL PROTECTED]> 写道: > > >> > > >> Hi there, > > >> > > >> parse-rss is based on commons-feedparser > > >> (http://jakarta.apache.org/commons/sandbox/feedparser). From the > > >> feedparser > > >> website: > > >> > > >> "...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, > > 1.0, > > >> and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc > > extension > > >> and RSS 1.0 modules capability..." > > >> > > >> Hope that helps. > > >> > > >> Thanks, > > >> Chris > > >> > > >> > > >> On 2/3/06 6:46 AM, "盖世豪侠" <[EMAIL PROTECTED]> wrote: > > >> > > >>> I see the test file is of version 0.91. > > >>> Does the plugin support higher versions like 1.0 or 2.0? > > >>> > > >>> -- > > >>> 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周 星驰岂是池中物,喜剧天 > > 分>>> 既 > > >>> 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既 得千里马,又失千里马, > > 当>>> 然 > > >>> 后悔莫及。 > > >> > > >> > > >> > > > > > > > > > -- > > > 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星 驰岂是池中物,喜剧天分既 > > > 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然 > > > 后悔莫及。 > > > > > > > > > -- > 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一 > 展风采。无线既得千里马,又失千里马,当然后悔莫及。 ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
