Looking into the first URL.. Don't look at the second, I screwed up on that. It's a Disallow.... bad example... But working on finding the segment for the first.... thanks for your quick response, I'll be getting right back to you.
<http://www.bionews.org.uk/> On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
Hi, On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote: > So, I have been having huge problems with parsing. It seems that many > urls are being ignored because the parser plugins throw and exception > saying there is no parser found for, what is reportedly, and > unresolved contentType. So, if you look at the exception: > > org.apache.nutch.parse.ParseException: parser not found for > contentType= url= http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl > > You can see that it says the contentType is "". But, if you look at > the headers for this request you can see that the Content-Type header > is set at "text/html": > > HTTP/1.1 200 OK > Date: Fri, 01 Jun 2007 13:54:19 GMT > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2 > Cache-Control: no-store > X-Highwire-SessionId: y1851mbb91.JS1 > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/ > Transfer-Encoding: chunked > Content-Type: text/html > > Is there something that I have set up wrong? This happens on a LOT of > pages/sites. My current plugins are set at: > > "protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) > > > Here is another URL: > > http://www.bionews.org.uk/ > > > Same issue with parsing (parrser not found for contentType= > url=http://www.bionews.org.uk/), but the header says: > > HTTP/1.0 200 OK > Server: Lasso/3.6.5 ID/ACGI > MIME-Version: 1.0 > Content-type: text/html > Content-length: 69417 > > > Any clues? Does nutch look at the headers or not? Can you do a bin/nutch readseg -get <segment> <url> -noparse -noparsetext -noparsedata -nofetch -nogenerate And send the result? This should show use what nutch fetched as content. > > > -- > "Conscious decisions by conscious minds are what make reality real" > -- Doğacan Güney
-- "Conscious decisions by conscious minds are what make reality real"
