Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED]> wrote:

> thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
> individual page .then when i search the some
thing for example "nutch-open
> source", the nutch return a hit which contain

   title : nutch-open source

> description : nutch nutch nutch ....nutch  nutch
   url :
> http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
> the plugin parse-rss can satisfy what i need?

<item>
    <title>nutch--open
> source</title>
   <description>
>
>        nutch nutch nutch ....nutch
> nutch
> >     </description>
> >
> >
> >
> <link>http://lucene.apache.org/nutch</link>
> >
> >
> >     <category>news
> </category>
> >
> >
> >     <author>kauu</author>



On 1/31/07, Chris
> Mattmann <[EMAIL PROTECTED]> wrote:
>
> Hi there,
>
> I could most
> likely be of assistance, if you gave me some more
> information.
> For
> instance: I'm wondering if the use case you describe below is already
>
> supported by the current RSS parse plugin?
>
> The current RSS parser,
> parse-rss, does in fact index individual items
> that
> are pointed to by an
> RSS document. The items are added as Nutch Outlinks,
> and added to the
> overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below?
> Or am I missing something?
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 6:01 PM,
> "kauu" <[EMAIL PROTECTED]> wrote:
>
> > Hi folks :
> >
> >    What's I want to
> do is to separate a rss file into several pages .
> >
> >   Just as what has
> been discussed before. I want fetch a rss page and
> index
> > it as different
> documents in the index. So the searcher can search the
> > Item's info as a
> individual hit.
> >
> >  What's my opinion create a protocol for fetch the rss
> page and store it
> as
> > several one which just contain one ITEM tag .but
> the unique key is the
> url ,
> > so how can I store them with the ITEM's link
> tag as the unique key for a
> > document.
> >
> >   So my question is how to
> realize this function in nutch-.0.8.x.
> >
> >   I've check the code of the
> plug-in protocol-http's code ,but I can't
> > find the code where to store a
> page to a document. I want to separate
> the
> > rss page to several ones
> before storing it as a document but several
> ones.
> >
> >   So any one can
> give me some hints?
> >
> > Any reply will be appreciated !
> >
> >
> >
> >
>
> >
> >   ITEM's structure
> >
> >  <item>
> >
> >
> >     <title>欧洲暴风雪后发制人 致航班
> 延误交通混乱(组图)</title>
> >
> >
> >     <description>暴风雪横扫欧洲,导致多次航班延误 1
> 月24日,几架民航客机在德
> > 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
> 的慕尼黑机场
> > 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
> >
>
> >
> >
> >     </description>
> >
> >
> >
> <link>http://news.sohu.com/20070125
> >
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</
> >
> link>
> >
> >
> >     <category>搜狐焦点图新闻</category>
> >
> >
> >
> <author>[EMAIL PROTECTED]
> > </author>
> >
> >
> >     <pubDate>Thu, 25 Jan 2007
> 11:29:11 +0800</pubDate>
> >
> >
> >     <comments
> >>
> http://comment.news.sohu.com
> >
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >
> /comment/topic.jsp?id=247833847</comments>
> >
> >
> > </item
> >
> >
>
> >
>
>
>


--
www.babatu.com



Reply via email to