Re: Content Type Not Resolved Correctly?

Briggs Fri, 01 Jun 2007 08:39:28 -0700

Here is another example that keeps saying it can't parse it...


SegmentReader: get '
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir'
Content::
Version: 2
url: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
base:
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
contentType:
metadata: nutch.segment.name=20070601050840 nutch.crawl.score=3.5455807E-5
Content:

These are the headers:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 15:38:15 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Window-Target: _top
X-Highwire-SessionId: nh2ukcdpv1.JS1
Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html



So, that's it..... any ideas?



On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:



So, here is one:

http://hea.sagepub.com/cgi/alerts

Segment Reader reports:

Content::
Version: 2
url: http://hea.sagepub.com/cgi/alerts
base: http://hea.sagepub.com/cgi/alerts
contentType:
metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.041666668
Content:

So, I notice when I try to crawl that url specifically, I get a job failed
(array index out of bounds -1 exception).

But if I use curl like:

curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt

I get content and the headers are:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 15:03:28 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: xlz2cgcww1.JS1
Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html

So, I'm lost.


On 6/1/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
> > So, I have been having huge problems with parsing.  It seems that many
> > urls are being ignored because the parser plugins throw and exception
> > saying there is no parser found for, what is reportedly, and
> > unresolved contentType.  So, if you look at the exception:
> >
> >   org.apache.nutch.parse.ParseException: parser not found for
> > contentType= url=
> http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
> >
> > You can see that it says the contentType is "".  But, if you look at
> > the headers for this request you can see that the Content-Type header
> > is set at "text/html":
> >
> > HTTP/1.1 200 OK
> > Date: Fri, 01 Jun 2007 13:54:19 GMT
> > Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
> > Cache-Control: no-store
> > X-Highwire-SessionId: y1851mbb91.JS1
> > Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
> > Transfer-Encoding: chunked
> > Content-Type: text/html
> >
> > Is there something that I have set up wrong?  This happens on a LOT of
>
> > pages/sites.  My current plugins are set at:
> >
> >
> 
"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> >
> >
> > Here is another URL:
> >
> > http://www.bionews.org.uk/
> >
> >
> > Same issue with parsing (parrser not found for contentType=
> > url= http://www.bionews.org.uk/), but the header says:
> >
> > HTTP/1.0 200 OK
> > Server: Lasso/3.6.5 ID/ACGI
> > MIME-Version: 1.0
> > Content-type: text/html
> > Content-length: 69417
> >
> >
> > Any clues?  Does nutch look at the headers or not?
>
> Can you do a
> bin/nutch readseg -get <segment> <url> -noparse -noparsetext
> -noparsedata -nofetch -nogenerate
>
> And send the result? This should show use what nutch fetched as content.
>
>
> >
> >
> > --
> > "Conscious decisions by conscious minds are what make reality real"
> >
>
>
> --
> Doğacan Güney
>



--
"Conscious decisions by conscious minds are what make reality real"




--
"Conscious decisions by conscious minds are what make reality real"

Re: Content Type Not Resolved Correctly?

Reply via email to