it's a great idea i think .
we can't just have more than one document in the index because of the unique
key is the URL.
but the only problem is that how to write a separate protocol for the RSS.



On 1/28/07, Alan Tanaman <[EMAIL PROTECTED]> wrote:

This is a problem that we have encountered too (although in a different
context than RSS).  The problem is that the "unique key" is the URL - you
cannot have more than one document in the index with the same URL.

The way around this might be to have a separate protocol (instead of the
usual http one) that will be used only for RSS feeds, and which will
append
an sequential number to the real-URL (passing say 10 identical copies of
each page to the parse-rss).  The parse-rss would need to extract only the
nth news item from the whole page.

Any comments?

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-----Original Message-----
From: kauu [mailto:[EMAIL PROTECTED]
Sent: 27 January 2007 06:43
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: parse-rss make them items as different pages

who can tell  me where and how to build a nutch document in nutch-0.8.1?

for example , one html page is a document , but i want to detach a
document
to several ones .

On 1/27/07, kauu <[EMAIL PROTECTED]> wrote:
>
> that's the right thing.
>
> i think we should to do some thing when nutch fetch a page successfully,
> judge if a rss then create as many pages as the items'  number.i  don't
> know whether it work.
> In the other hand , we can do some thing in the segment just like what u
> say .
>
>
> i don't know that whether we can write a plugin to get the
functionality.
>
> anyone who can give me some hint?
>
> On 1/26/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:
> >
> > Hi Kauu,
> >
> > The functionality you require doesn't exist in the current parse-rss
> > plugin. I need the same functionality but it doesn't exist and I
believe
> > it's not a simple task.
> >
> > The functionality required basically is to create a page in a segment
> > for each item and the URL to the crawldb.
> >
> > Since the data already exists in the item element there is no reason
to
> > "fetch" the page (item). After that the only thing left is to index
it.
> >
> > Any thoughts on how to achieve that goal?
> >
> > Gal.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: kauu [mailto:[EMAIL PROTECTED]
> > Sent: Friday, January 26, 2007 4:17 AM
> > To: nutch-dev@lucene.apache.org
> > Subject: parse-rss make them items as different pages
> >
> > i want to crawl the rss feeds and parse them ,then index them and at
> > last
> > when search the content I just want that the hit just like an
individual
> > page.
> >
> >
> > i don't know wether i tell u clearly.
> >
> > <item>
> >     <title>欧洲暴风雪后发制人 致航班延误交通混乱(组图)</title>
> >     <description>暴风雪横扫欧洲,导致多次航班延误
> > 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工
作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
> > 据报道,迟来的暴风雪连续两天横扫中...
> >     </description>
> >     <link>http://news.sohu.com/20070125/n247833568.shtml </link>
> >     <category>搜狐焦点图新闻</category>
> >     <author>[EMAIL PROTECTED]</author>
> >     <pubDate>Thu, 25 Jan 2007 11:29:11 +0800</pubDate>
> >     <comments>
> > http://comment.news.sohu.com/comment/topic.jsp?id=247833847</comments>
> > </item>
> >
> > this one item in an rss file
> >
> > i want nutch deal with an item like an individual page.
> >
> > so i search something in this item,the nutch return it as a hit.
> >
> > so ...
> > any one can tell me how to do about ?
> > any reply will be appreciated
> >
> > --
> > www.babatu.com
> >
>
>
>
> --
> www.babatu.com




--
www.babatu.com




--
www.babatu.com
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to