Re: Indexing Feeds & Blog Posts with Nutch

Rick Moynihan Mon, 15 Oct 2007 02:44:46 -0700

Pike wrote:

Hi Ricky, Chris

I've not noticed muchdifference, with both plugins failing on the feedburner feed:
- http://feeds.feedburner.com/Techcrunch


Strange, but that feed is indeed invalid xml if I wget it.
It starts with newlines and ends with comments. Very
picky, but that's not allowed afaik.

Yes, I did a wget shortly after posting and the feed is clearly invalidXML; however as unfortunate as it is the web is full of invalid XMLfeeds that still need to be parsed somehow. To paraphrase the wellknown mantra, my feelings are that these plugins need to be more liberalin what they except and people need to be more conservative in what theyproduce.

Another problem I seem to have just now is that some of the searchresults link to their XML feeds, rather than to the destination of theiritems.
I have this with all results: what is indexed
seems to be 1 record per feed, containing a
parsed version of the content including all its items,
with sometimes bits of xml and html markup in it.

I was assuming this is the intended behaviour ?

It may well be the intended behaviour, but it's not the behaviour Iwant. The strategy I'd like to employ is the strategy you mentiontrying to get going on another thread; i.e. to crawl feed items (ratherthan the feed) with a depth of 1.


If you manage to do this successfully then I'd love to hear how.

R.

Re: Indexing Feeds & Blog Posts with Nutch

Reply via email to