I just took the whole contentType check out of the plugin and added the
following types to separate "implementation" elements in plugin.xml:
text/xml
application/xml
application/rdf
application/rss
application/atom
Some of these end up matching multiple actual content types (e.g. it
also matches application/rdf+xml etc..) since
ParserFactory.findExtension uses them as prefix matches.
Jack Tang wrote:
Hi Chris
Thanks for your explain.
I wanna let "application/xml" content type go in parse-rss plugin, so
I add the statement
if (contentType != null
&& (!contentType.startsWith("text/xml") &&
!contentType.startsWith("application/rss+xml") &&
!contentType.startsWith("application/xml")))
return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
"Content-Type not text/xml, application/xml or
application/rss+xml: "
+ contentType).getEmptyParse();
But unfortunately, it failed again. Here is the error message:
-------------------------------------------------------------------------------------------------------------------------
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.proxy.host = null
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.proxy.port = 8080
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.timeout = 10000
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.content.limit = 65536
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.agent = NutchCVS/0.06-dev (Nutch;
http://www.nutch.org/docs/en/bot.html;
[EMAIL PROTECTED])
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.auth.ntlm.username =
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
fetcher.server.delay = 1000
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.max.delays = 100
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] - Configured Client
050908 231023 org.apache.nutch.fetcher.Fetcher$FetcherThread [11] -
SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:262)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged. Exiting fetcher.
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
at net.recruit.fetch.JobCrawlTool.main(JobCrawlTool.java:150)
It seems plugins confliction?
My question is that can parse-rss support "application/xml" or more
content-type?
Thanks
/Jack
On 9/8/05, CHRIS A MATTMANN <[EMAIL PROTECTED]> wrote:
Hi Jack,
I'm not necessarily sure that this is a "bug" per se: it's just the fact that several different content
types are potentially possible when any ol' webserver returns an RSS file. To be honest, I performed a pretty detailed
crawl (100s of thousands of pages) when I originally wrote the plugin way back in March/April of this year, and the two
content types that you see in the code right now that it checks for are what I found to be the most pervasive content
type returned from webservers for RSS. However, in no way did I mean for that list to be exhaustive: for instance, web
servers may also return "application/rss", or "text/rss", or even "text/plain" I have
seen for RSS. It all depends on how the webmaster has configured the web server. So it's kind of difficult to
accurately and reliably discriminate against the content type within a parser plugin itself, because it is inherently
out of the parsers hands what gets returned for a particular type of file, and even though th!
ere are some "best practices" for what should be returned for different file types, there
is by no means any "standards", that must be followed.
So, I would propose the following. I believe the checking for the content type
and then throwing an exception block of code exists in other plugins in Nutch
as well. I propose we nix that, and remove the content type checking and
exception message from the plugins themselves, and move it up to a higher
level, i.e., the actually plugin factory or something. Let it get taken care of
there, and let it be configurable, out of the code of each plugin for instance.
Because that way, I believe you can customize whatever plugin to do whatever
your need is, * without * having to recompile the code just to add another
accepted content type to a plugin so it doesn't throw an error message.
What say you guys? :-)
Cheers,
Chris
----- Original Message -----
From: Jack Tang <[EMAIL PROTECTED]>
Date: Wednesday, September 7, 2005 10:58 pm
Subject: RSS Parser Bug!?
Hi Guys
Did someone install parse-rss and try to fetch rss feeds?
It failed on my side. I enabled the plugin and it fetched, not rss
parser didnot work.
My feed is http://www.craigslist.org/evs/index.rss
Here is the error:
org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but
can't parse http://beijing.craigslist.org/jjj/index.rss, reason:
failed(2,203): Content-Type not text/html: application/xml;
charset=ISO-8859-1
The content-type is application/xml. Mattmann's comment is this:
// check that contentType is one we can handle
String contentType = content.getContentType();
if (contentType != null
&& (!contentType.startsWith("text/xml") &&
!contentType.startsWith("application/rss+xml")))
return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
"Content-Type not text/xml or
application/rss+xml: "
+ contentType).getEmptyParse();
So, it does not "application/xml" content type yet?
Thanks
/Jack
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers