Hi,
On 6/1/07, Briggs <[EMAIL PROTECTED]> wrote:
So, I have been having huge problems with parsing. It seems that many
urls are being ignored because the parser plugins throw and exception
saying there is no parser found for, what is reportedly, and
unresolved contentType. So, if you look at the exception:
org.apache.nutch.parse.ParseException: parser not found for
contentType= url=http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
You can see that it says the contentType is "". But, if you look at
the headers for this request you can see that the Content-Type header
is set at "text/html":
HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 13:54:19 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: y1851mbb91.JS1
Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html
Is there something that I have set up wrong? This happens on a LOT of
pages/sites. My current plugins are set at:
"protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
Here is another URL:
http://www.bionews.org.uk/
Same issue with parsing (parrser not found for contentType=
url=http://www.bionews.org.uk/), but the header says:
HTTP/1.0 200 OK
Server: Lasso/3.6.5 ID/ACGI
MIME-Version: 1.0
Content-type: text/html
Content-length: 69417
Any clues? Does nutch look at the headers or not?
Can you do a
bin/nutch readseg -get <segment> <url> -noparse -noparsetext
-noparsedata -nofetch -nogenerate
And send the result? This should show use what nutch fetched as content.
--
"Conscious decisions by conscious minds are what make reality real"
--
Doğacan Güney