Hi Chris
Thanks for your explain.
I wanna let "application/xml" content type go in parse-rss plugin, so
I add the statement
if (contentType != null
&& (!contentType.startsWith("text/xml") &&
!contentType.startsWith("application/rss+xml") &&
!contentType.startsWith("application/xml")))
return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
"Content-Type not text/xml, application/xml or
application/rss+xml: "
+ contentType).getEmptyParse();
But unfortunately, it failed again. Here is the error message:
-------------------------------------------------------------------------------------------------------------------------
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.proxy.host = null
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.proxy.port = 8080
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.timeout = 10000
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.content.limit = 65536
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.agent = NutchCVS/0.06-dev (Nutch;
http://www.nutch.org/docs/en/bot.html;
[EMAIL PROTECTED])
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.auth.ntlm.username =
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
fetcher.server.delay = 1000
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.max.delays = 100
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] - Configured Client
050908 231023 org.apache.nutch.fetcher.Fetcher$FetcherThread [11] -
SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:262)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged. Exiting fetcher.
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
at net.recruit.fetch.JobCrawlTool.main(JobCrawlTool.java:150)
It seems plugins confliction?
My question is that can parse-rss support "application/xml" or more
content-type?
Thanks
/Jack
On 9/8/05, CHRIS A MATTMANN <[EMAIL PROTECTED]> wrote:
> Hi Jack,
>
> I'm not necessarily sure that this is a "bug" per se: it's just the fact
> that several different content types are potentially possible when any ol'
> webserver returns an RSS file. To be honest, I performed a pretty detailed
> crawl (100s of thousands of pages) when I originally wrote the plugin way
> back in March/April of this year, and the two content types that you see in
> the code right now that it checks for are what I found to be the most
> pervasive content type returned from webservers for RSS. However, in no way
> did I mean for that list to be exhaustive: for instance, web servers may also
> return "application/rss", or "text/rss", or even "text/plain" I have seen for
> RSS. It all depends on how the webmaster has configured the web server. So
> it's kind of difficult to accurately and reliably discriminate against the
> content type within a parser plugin itself, because it is inherently out of
> the parsers hands what gets returned for a particular type of file, and even
> though th!
> ere are some "best practices" for what should be returned for different file
> types, there is by no means any "standards", that must be followed.
>
> So, I would propose the following. I believe the checking for the content
> type and then throwing an exception block of code exists in other plugins in
> Nutch as well. I propose we nix that, and remove the content type checking
> and exception message from the plugins themselves, and move it up to a higher
> level, i.e., the actually plugin factory or something. Let it get taken care
> of there, and let it be configurable, out of the code of each plugin for
> instance. Because that way, I believe you can customize whatever plugin to do
> whatever your need is, * without * having to recompile the code just to add
> another accepted content type to a plugin so it doesn't throw an error
> message.
>
> What say you guys? :-)
>
> Cheers,
> Chris
>
>
> ----- Original Message -----
> From: Jack Tang <[EMAIL PROTECTED]>
> Date: Wednesday, September 7, 2005 10:58 pm
> Subject: RSS Parser Bug!?
>
> > Hi Guys
> >
> > Did someone install parse-rss and try to fetch rss feeds?
> > It failed on my side. I enabled the plugin and it fetched, not rss
> > parser didnot work.
> > My feed is http://www.craigslist.org/evs/index.rss
> >
> > Here is the error:
> >
> > org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but
> > can't parse http://beijing.craigslist.org/jjj/index.rss, reason:
> > failed(2,203): Content-Type not text/html: application/xml;
> > charset=ISO-8859-1
> >
> > The content-type is application/xml. Mattmann's comment is this:
> > // check that contentType is one we can handle
> > String contentType = content.getContentType();
> > if (contentType != null
> > && (!contentType.startsWith("text/xml") &&
> > !contentType.startsWith("application/rss+xml")))
> > return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
> > "Content-Type not text/xml or
> > application/rss+xml: "
> > + contentType).getEmptyParse();
> >
> > So, it does not "application/xml" content type yet?
> >
> >
> > Thanks
> > /Jack
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers