Hi Chris I have read the code of your parse-rss plugin and you said that: the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle?
在06-2-6,Chris Mattmann <[EMAIL PROTECTED]> 写道: > > Hi there, > > That should work: however, the biggest problem will be making sure that > "text/xml" is actually the content type of the RSS that you are parsing, > which you'll have little or no control over. > > Check out this previous post of mine on the list to get a better idea of > what the real issue is: > > http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html > > G'luck! > > Cheers, > Chris > > > ______________________________________________ > Chris A. Mattmann > [EMAIL PROTECTED] > Staff Member > Modeling and Data Management Systems Section (387) > Data Management Systems and Technologies Group > > _________________________________________________ > Jet Propulsion Laboratory Pasadena, CA > Office: 171-266B Mailstop: 171-246 > Phone: 818-354-8810 > _______________________________________________________ > > Disclaimer: The opinions presented within are my own and do not reflect > those of either NASA, JPL, or the California Institute of Technology. > > > -----Original Message----- > > From: 盖世豪侠 [mailto:[EMAIL PROTECTED] > > Sent: Saturday, February 04, 2006 11:40 PM > > To: [email protected] > > Subject: Re: Which version of rss does parse-rss plugin support? > > > > Hi Chris > > > > > > How do I change the plugin.xml? For example, if I want to crawl rss > files > > end with "xml", just add a new element? > > > > <implementation id="org.apache.nutch.parse.rss.RSSParser" > > class="org.apache.nutch.parse.rss.RSSParser" > > contentType="application/rss+xml" > > pathSuffix="rss"/> > > <implementation id="org.apache.nutch.parse.rss.RSSParser" > > class="org.apache.nutch.parse.rss.RSSParser" > > contentType="application/rss+xml" > > pathSuffix="xml"/> > > > > Am I right? > > > > > > > > 在06-2-3,Chris Mattmann <[EMAIL PROTECTED]> 写道: > > > > > > Hi there, > > > Sure it will, you just have to configure it to do that. Pop over to > > > $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there > there > > > is > > > an attribute called "pathSuffix". Change that to handle whatever type > of > > > rss > > > file you want to crawl. That will work locally. For web-based crawls, > > you > > > need to make sure that the content type being returned for your RSS > > > content > > > matches the content type specified in the plugin.xml file that > parse-rss > > > claims to support. > > > > > > Note that you might not have * a lot * of success with being able to > > > control the content type for rss files returned by web servers. I've > > seen > > > a > > > LOT of inconsistency among the way that they're configured by the > > > administrators, etc. However, just to let you know, there are some > > people > > > in > > > the group that are working on a solution to addressing this. > > > > > > Hope that helps. > > > > > > Cheers, > > > Chris > > > > > > > > > > > > On 2/3/06 7:16 AM, "盖世豪侠" <[EMAIL PROTECTED]> wrote: > > > > > > > Hi *Chris,* > > > > > > > > The files of RSS 1.0 have a postfix of rdf. So willthe parser > > recognize > > > it > > > > automatically as a rss file? > > > > > > > > > > > > 在06-2-3,Chris Mattmann <[EMAIL PROTECTED]> 写道: > > > >> > > > >> Hi there, > > > >> > > > >> parse-rss is based on commons-feedparser > > > >> (http://jakarta.apache.org/commons/sandbox/feedparser). From the > > > >> feedparser > > > >> website: > > > >> > > > >> "...commons-feedparser supports all versions of RSS (0.9, 0.91, > 0.92, > > > 1.0, > > > >> and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc > > > extension > > > >> and RSS 1.0 modules capability..." > > > >> > > > >> Hope that helps. > > > >> > > > >> Thanks, > > > >> Chris > > > >> > > > >> > > > >> On 2/3/06 6:46 AM, "盖世豪侠" <[EMAIL PROTECTED]> wrote: > > > >> > > > >>> I see the test file is of version 0.91. > > > >>> Does the plugin support higher versions like 1.0 or 2.0? > > > >>> > > > >>> -- > > > >>> 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周 > 星驰岂是池中物,喜剧天 > > > 分>>> 既 > > > >>> 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既 > 得千里马,又失千里马, > > > 当>>> 然 > > > >>> 后悔莫及。 > > > >> > > > >> > > > >> > > > > > > > > > > > > -- > > > > 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星 > 驰岂是池中物,喜剧天分既 > > > > 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得 > 千里马,又失千里马,当然 > > > > 后悔莫及。 > > > > > > > > > > > > > > > -- > > 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂 > 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一 > > 展风采。无线既得千里马,又失千里马,当然后悔莫及。 > > -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
