Hi Jack,

  I'm not necessarily sure that this is a "bug" per se: it's just the fact that 
several different content types are potentially possible when any ol' webserver 
returns an RSS file. To be honest, I performed a pretty detailed crawl (100s of 
thousands of pages) when I originally wrote the plugin way back in March/April 
of this year, and the two content types that you see in the code right now that 
it checks for are what I found to be the most pervasive content type returned 
from webservers for RSS. However, in no way did I mean for that list to be 
exhaustive: for instance, web servers may also return "application/rss", or 
"text/rss", or even "text/plain" I have seen for RSS. It all depends on how the 
webmaster has configured the web server. So it's kind of difficult to 
accurately and reliably discriminate against the content type within a parser 
plugin itself, because it is inherently out of the parsers hands what gets 
returned for a particular type of file, and even though th!
 ere are some "best practices" for what should be returned for different file 
types, there is by no means any "standards", that must be followed.

So, I would propose the following. I believe the checking for the content type 
and then throwing an exception block of code exists in other plugins in Nutch 
as well. I propose we nix that, and remove the content type checking and 
exception message from the plugins themselves, and move it up to a higher 
level, i.e., the actually plugin factory or something. Let it get taken care of 
there, and let it be configurable, out of the code of each plugin for instance. 
Because that way, I believe you can customize whatever plugin to do whatever 
your need is, * without * having to recompile the code just to add another 
accepted content type to a plugin so it doesn't throw an error message.

What say you guys? :-)

Cheers,
  Chris


----- Original Message -----
From: Jack Tang <[EMAIL PROTECTED]>
Date: Wednesday, September 7, 2005 10:58 pm
Subject: RSS Parser Bug!?

> Hi Guys
> 
> Did someone install parse-rss and try to fetch rss feeds?
> It failed on my side. I enabled the plugin and it fetched, not rss
> parser didnot work.
> My feed is http://www.craigslist.org/evs/index.rss
> 
> Here is the error:
> 
> org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but
> can't parse http://beijing.craigslist.org/jjj/index.rss, reason:
> failed(2,203): Content-Type not text/html: application/xml;
> charset=ISO-8859-1
> 
> The content-type is application/xml. Mattmann's comment is this:
>        // check that contentType is one we can handle
>        String contentType = content.getContentType();
>        if (contentType != null
>                && (!contentType.startsWith("text/xml") &&
> !contentType.startsWith("application/rss+xml")))
>            return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
>                    "Content-Type not text/xml or 
> application/rss+xml: "
>                            + contentType).getEmptyParse();
> 
> So, it does not "application/xml" content type yet?
> 
> 
> Thanks
> /Jack
> -- 
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
> 



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to