Here is another example that keeps saying it can't parse it...
SegmentReader: get ' http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir' Content:: Version: 2 url: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir base: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir contentType: metadata: nutch.segment.name=20070601050840 nutch.crawl.score=3.5455807E-5 Content: These are the headers: HTTP/1.1 200 OK Date: Fri, 01 Jun 2007 15:38:15 GMT Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 Window-Target: _top X-Highwire-SessionId: nh2ukcdpv1.JS1 Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html So, that's it..... any ideas? On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
So, here is one: http://hea.sagepub.com/cgi/alerts Segment Reader reports: Content:: Version: 2 url: http://hea.sagepub.com/cgi/alerts base: http://hea.sagepub.com/cgi/alerts contentType: metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.041666668 Content: So, I notice when I try to crawl that url specifically, I get a job failed (array index out of bounds -1 exception). But if I use curl like: curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt I get content and the headers are: HTTP/1.1 200 OK Date: Fri, 01 Jun 2007 15:03:28 GMT Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 Cache-Control: no-store X-Highwire-SessionId: xlz2cgcww1.JS1 Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/ Transfer-Encoding: chunked Content-Type: text/html So, I'm lost. On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > > Hi, > > On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > > So, I have been having huge problems with parsing. It seems that many > > urls are being ignored because the parser plugins throw and exception > > saying there is no parser found for, what is reportedly, and > > unresolved contentType. So, if you look at the exception: > > > > org.apache.nutch.parse.ParseException: parser not found for > > contentType= url= > http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl > > > > You can see that it says the contentType is "". But, if you look at > > the headers for this request you can see that the Content-Type header > > is set at "text/html": > > > > HTTP/1.1 200 OK > > Date: Fri, 01 Jun 2007 13:54:19 GMT > > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > > Cache-Control: no-store > > X-Highwire-SessionId: y1851mbb91.JS1 > > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ > > Transfer-Encoding: chunked > > Content-Type: text/html > > > > Is there something that I have set up wrong? This happens on a LOT of > > > pages/sites. My current plugins are set at: > > > > > "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > > > > > > Here is another URL: > > > > http://www.bionews.org.uk/ > > > > > > Same issue with parsing (parrser not found for contentType= > > url= http://www.bionews.org.uk/), but the header says: > > > > HTTP/1.0 200 OK > > Server: Lasso/3.6.5 ID/ACGI > > MIME-Version: 1.0 > > Content-type: text/html > > Content-length: 69417 > > > > > > Any clues? Does nutch look at the headers or not? > > Can you do a > bin/nutch readseg -get <segment> <url> -noparse -noparsetext > -noparsedata -nofetch -nogenerate > > And send the result? This should show use what nutch fetched as content. > > > > > > > > -- > > "Conscious decisions by conscious minds are what make reality real" > > > > > -- > Doğacan Güney > -- "Conscious decisions by conscious minds are what make reality real"
-- "Conscious decisions by conscious minds are what make reality real"
