Hi there, On 1/30/07 7:00 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:
> Chris, > > I saw your name associated with the rss parser in nutch. My understanding is > that nutch is using feedparser. I had two questions: > > 1. Have you looked at vtd as an rss parser? I haven't in fact; what are its benefits over those of commons-feedparser? > 2. Any view on asynchronous communication as the underlying protocol? I do > not believe that feedparser uses that at this point. I'm not sure exactly what asynchronous communication when parsing rss feeds affords you: what type of communications are you talking about above? Nutch handles the communications layer for fetching content using a pluggable, Protocol-based model. The only feature that Nutch's rss parser uses from the underlying feedparser library is its object model and callback framework for parsing RSS/Atom/Feed XML documents. When you mention asynchronous above, are you talking about the protocol for fetching the different RSS documents? Thanks! Cheers, Chris > > Thanks > > > -----Original Message----- > From: Chris Mattmann <[EMAIL PROTECTED]> > Date: Tue, 30 Jan 2007 18:16:44 > To:<nutch-dev@lucene.apache.org> > Subject: Re: RSS-fecter and index individul-how can i realize this function > > Hi there, > > I could most likely be of assistance, if you gave me some more information. > For instance: I'm wondering if the use case you describe below is already > supported by the current RSS parse plugin? > > The current RSS parser, parse-rss, does in fact index individual items that > are pointed to by an RSS document. The items are added as Nutch Outlinks, > and added to the overall queue of URLs to fetch. Doesn't this satisfy what > you mention below? Or am I missing something? > > Cheers, > Chris > > > > On 1/30/07 6:01 PM, "kauu" <[EMAIL PROTECTED]> wrote: > >> Hi folks : >> >> What’s I want to do is to separate a rss file into several pages . >> >> Just as what has been discussed before. I want fetch a rss page and index >> it as different documents in the index. So the searcher can search the >> Item’s info as a individual hit. >> >> What’s my opinion create a protocol for fetch the rss page and store it as >> several one which just contain one ITEM tag .but the unique key is the url , >> so how can I store them with the ITEM’s link tag as the unique key for a >> document. >> >> So my question is how to realize this function in nutch-.0.8.x. >> >> I’ve check the code of the plug-in protocol-http’s code ,but I can’t >> find the code where to store a page to a document. I want to separate the >> rss page to several ones before storing it as a document but several ones. >> >> So any one can give me some hints? >> >> Any reply will be appreciated ! >> >> >> >> >> >> ITEM’s structure >> >> <item> >> >> >> <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title> >> >> >> <description>暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 >> 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 >> 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... >> >> >> >> </description> >> >> >> <link>http://news.sohu.com/20070125 >> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml</ >> link> >> >> >> <category>搜狐焦点图新闻</category> >> >> >> <author>[EMAIL PROTECTED] >> </author> >> >> >> <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate> >> >> >> <comments >>> http://comment.news.sohu.com >> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847> >> /comment/topic.jsp?id=247833847</comments> >> >> >> </item >> >> >> > >