> Actually, this isn't entirely the case. parse-rss actually indexes the
item text (see line 148 in RSSParser.java) as well. Additionally,
parse-rss adds the individual item links to the Outlinks (see lines 161
and 163 in
RSSParser.java) , and they get crawled as well, in addition to the
channel text (see line 123 in RSSParser.java) and channel outlink (see
lines 130 and
132 in RSSParser.java).

Yep, I wasn't clear enough maybe. Sorry Chris ;)
RSSParser actually reads the items and allows to index the concated
text.
But they are not individually returned and then can't be individually
indexed right away.
But if you decide to fetch and parse each item "link", parse-rss
actually returns all the links.
Then you could extract the item text or do other parsing for each
individual item page.
Sorry if I confused some people.

I am personally focusing on only RSS and I am trying to index as much as
I can from the RSS feed directly to avoid to have to extract the item
text from the full HTML page. Of course, I then limit myself to whatever
I have in the feed.


> I haven't really noticed any formats not really handled by
commons-feedparser. What formats have you noticed that it doesn't
handle?

I think I had problems with ATOM <content> from feeds like this one:
http://meetvinz.blogspot.com/atom.xml 
and the RSS <content:encoded> for instance from
http://feeds.feedburner.com/TechCrunch

Was it my mistake?
If it was, I'd love to go back to feedparser, as it is apparently faster
than ROME. ;)



> 
> 
> -----Original Message-----
> From: Dima Gritsenko [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 28, 2006 10:44 AM
> To: [email protected]
> Subject: RSS search by nutch
> 
> Hi,
> 
> Does nutch have a class for searching incoming RSS feeds in real time?
> Thank you. 
> Dima. 



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to