[Nutch-general] Re: Which version of rss does parse-rss plugin support?

Elwin Fri, 10 Feb 2006 07:15:03 -0800

Hi Chris

  I have read the code of your parse-rss plugin and you said that:
  the contentTitle will be a concatenation of the titles of the RSS Channels
that we've parsed.
  So the titles of the RSS Channels are what delivered for indexing, right?
  If I want the indexer to include more information about a rss file (such
as item descriptions), can I just concatenate them to the contentTitle?



在06-2-6，Chris Mattmann <[EMAIL PROTECTED]> 写道：
>
> Hi there,
>
>   That should work: however, the biggest problem will be making sure that
> "text/xml" is actually the content type of the RSS that you are parsing,
> which you'll have little or no control over.
>
> Check out this previous post of mine on the list to get a better idea of
> what the real issue is:
>
> http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html
>
> G'luck!
>
> Cheers,
> Chris
>
>
> ______________________________________________
> Chris A. Mattmann
> [EMAIL PROTECTED]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                        Mailstop:  171-246
> Phone:  818-354-8810
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
> > -----Original Message-----
> > From: 盖世豪侠 [mailto:[EMAIL PROTECTED]
> > Sent: Saturday, February 04, 2006 11:40 PM
> > To: [email protected]
> > Subject: Re: Which version of rss does parse-rss plugin support?
> >
> > Hi Chris
> >
> >
> > How do I change the plugin.xml? For example, if I want to crawl rss
> files
> > end with "xml", just add a new element?
> >
> >       <implementation id="org.apache.nutch.parse.rss.RSSParser"
> >                       class="org.apache.nutch.parse.rss.RSSParser"
> >                       contentType="application/rss+xml"
> >                       pathSuffix="rss"/>
> >       <implementation id="org.apache.nutch.parse.rss.RSSParser"
> >                       class="org.apache.nutch.parse.rss.RSSParser"
> >                       contentType="application/rss+xml"
> >                       pathSuffix="xml"/>
> >
> > Am I right?
> >
> >
> >
> > 在06-2-3，Chris Mattmann <[EMAIL PROTECTED]> 写道：
> > >
> > > Hi there,
> > > Sure it will, you just have to configure it to do that. Pop over to
> > > $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there
> there
> > > is
> > > an attribute called "pathSuffix". Change that to handle whatever type
> of
> > > rss
> > > file you want to crawl. That will work locally. For web-based crawls,
> > you
> > > need to make sure that the content type being returned for your RSS
> > > content
> > > matches the content type specified in the plugin.xml file that
> parse-rss
> > > claims to support.
> > >
> > > Note that you might not have * a lot * of success with being able to
> > > control the content type for rss files returned by web servers. I've
> > seen
> > > a
> > > LOT of inconsistency among the way that they're configured by the
> > > administrators, etc. However, just to let you know, there are some
> > people
> > > in
> > > the group that are working on a solution to addressing this.
> > >
> > > Hope that helps.
> > >
> > > Cheers,
> > > Chris
> > >
> > >
> > >
> > > On 2/3/06 7:16 AM, "盖世豪侠" <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hi *Chris,*
> > > >
> > > > The files of RSS 1.0 have a postfix of rdf. So willthe parser
> > recognize
> > > it
> > > > automatically as a rss file?
> > > >
> > > >
> > > > 在06-2-3，Chris Mattmann <[EMAIL PROTECTED]> 写道：
> > > >>
> > > >> Hi there,
> > > >>
> > > >> parse-rss is based on commons-feedparser
> > > >> (http://jakarta.apache.org/commons/sandbox/feedparser). From the
> > > >> feedparser
> > > >> website:
> > > >>
> > > >> "...commons-feedparser supports all versions of RSS (0.9, 0.91,
> 0.92,
> > > 1.0,
> > > >> and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc
> > > extension
> > > >> and RSS 1.0 modules capability..."
> > > >>
> > > >> Hope that helps.
> > > >>
> > > >> Thanks,
> > > >> Chris
> > > >>
> > > >>
> > > >> On 2/3/06 6:46 AM, "盖世豪侠" <[EMAIL PROTECTED]> wrote:
> > > >>
> > > >>> I see the test file is of version 0.91.
> > > >>> Does the plugin support higher versions like 1.0 or 2.0?
> > > >>>
> > > >>> --
> > > >>> 《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周
> 星驰岂是池中物，喜剧天
> > > 分>>> 既
> > > >>> 然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既
> 得千里马，又失千里马，
> > > 当>>> 然
> > > >>> 后悔莫及。
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > 《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星
> 驰岂是池中物，喜剧天分既
> > > > 然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一展风采。无线既得
> 千里马，又失千里马，当然
> > > > 后悔莫及。
> > >
> > >
> > >
> >
> >
> > --
> > 《盖世豪侠》好评如潮，让无线收视居高不下，无线高兴之余，仍未重用。周星驰岂
> 是池中物，喜剧天分既然崭露，当然不甘心受冷落，于是转投电影界，在大银幕上一
> > 展风采。无线既得千里马，又失千里马，当然后悔莫及。
>
>


--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。

[Nutch-general] Re: Which version of rss does parse-rss plugin support?

Reply via email to