Hi Chris

Thanks for your explain.
I wanna let "application/xml" content type go in parse-rss plugin, so
I add the statement

        if (contentType != null
                && (!contentType.startsWith("text/xml") &&
!contentType.startsWith("application/rss+xml") &&
!contentType.startsWith("application/xml")))
            return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
                    "Content-Type not text/xml, application/xml or
application/rss+xml: "
                            + contentType).getEmptyParse();


 But unfortunately, it failed again.  Here is the error message:
-------------------------------------------------------------------------------------------------------------------------
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.proxy.host = null
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.proxy.port = 8080
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.timeout = 10000
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.content.limit = 65536
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.agent = NutchCVS/0.06-dev (Nutch;
http://www.nutch.org/docs/en/bot.html;
[EMAIL PROTECTED])
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.auth.ntlm.username =
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
fetcher.server.delay = 1000
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
http.max.delays = 100
050908 231018 org.apache.nutch.protocol.httpclient.Http [11] - Configured Client
050908 231023 org.apache.nutch.fetcher.Fetcher$FetcherThread [11] -
SEVERE error writing output:java.lang.NullPointerException
java.lang.NullPointerException
        at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
        at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
        at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
        at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:262)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged.  Exiting fetcher.
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
        at net.recruit.fetch.JobCrawlTool.main(JobCrawlTool.java:150)

It seems plugins confliction? 
My question is that can parse-rss support "application/xml" or more
content-type?

Thanks
/Jack

On 9/8/05, CHRIS A MATTMANN <[EMAIL PROTECTED]> wrote:
> Hi Jack,
> 
>   I'm not necessarily sure that this is a "bug" per se: it's just the fact 
> that several different content types are potentially possible when any ol' 
> webserver returns an RSS file. To be honest, I performed a pretty detailed 
> crawl (100s of thousands of pages) when I originally wrote the plugin way 
> back in March/April of this year, and the two content types that you see in 
> the code right now that it checks for are what I found to be the most 
> pervasive content type returned from webservers for RSS. However, in no way 
> did I mean for that list to be exhaustive: for instance, web servers may also 
> return "application/rss", or "text/rss", or even "text/plain" I have seen for 
> RSS. It all depends on how the webmaster has configured the web server. So 
> it's kind of difficult to accurately and reliably discriminate against the 
> content type within a parser plugin itself, because it is inherently out of 
> the parsers hands what gets returned for a particular type of file, and even 
> though th!
>  ere are some "best practices" for what should be returned for different file 
> types, there is by no means any "standards", that must be followed.
> 
> So, I would propose the following. I believe the checking for the content 
> type and then throwing an exception block of code exists in other plugins in 
> Nutch as well. I propose we nix that, and remove the content type checking 
> and exception message from the plugins themselves, and move it up to a higher 
> level, i.e., the actually plugin factory or something. Let it get taken care 
> of there, and let it be configurable, out of the code of each plugin for 
> instance. Because that way, I believe you can customize whatever plugin to do 
> whatever your need is, * without * having to recompile the code just to add 
> another accepted content type to a plugin so it doesn't throw an error 
> message.
> 
> What say you guys? :-)
> 
> Cheers,
>   Chris
> 
> 
> ----- Original Message -----
> From: Jack Tang <[EMAIL PROTECTED]>
> Date: Wednesday, September 7, 2005 10:58 pm
> Subject: RSS Parser Bug!?
> 
> > Hi Guys
> >
> > Did someone install parse-rss and try to fetch rss feeds?
> > It failed on my side. I enabled the plugin and it fetched, not rss
> > parser didnot work.
> > My feed is http://www.craigslist.org/evs/index.rss
> >
> > Here is the error:
> >
> > org.apache.nutch.fetcher.Fetcher$FetcherThread [11] - fetch okay, but
> > can't parse http://beijing.craigslist.org/jjj/index.rss, reason:
> > failed(2,203): Content-Type not text/html: application/xml;
> > charset=ISO-8859-1
> >
> > The content-type is application/xml. Mattmann's comment is this:
> >        // check that contentType is one we can handle
> >        String contentType = content.getContentType();
> >        if (contentType != null
> >                && (!contentType.startsWith("text/xml") &&
> > !contentType.startsWith("application/rss+xml")))
> >            return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
> >                    "Content-Type not text/xml or
> > application/rss+xml: "
> >                            + contentType).getEmptyParse();
> >
> > So, it does not "application/xml" content type yet?
> >
> >
> > Thanks
> > /Jack
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to