Mhhh.
I tried both:
Setting the http.content.limit to 5MB and anothertime to -1
Result is always the same:
it still says
 fetch okay, but can't parse 
 http://www.uni-koeln.de/uni/map.html, reason: failed(2,203): 
 Content-Type not application/pdf:

another idea what's going wrong??

greetings
peter



RE: try to parse pdf

Richard Braman
Mon, 13 Mar 2006 09:22:02 -0800

That error is actually not from the http content limit, but I would
recommend setting the content limit to -1.  For some reason this error
sems to happen sometimes even after you add the pdf parsing plug in like
you did.  I think nutch must cache the plug in properties in
nutch-default.  It will start to parse pdfs at some point.

-----Original Message-----
From: Jeff Pettenski [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 13, 2006 10:30 AM
To: [email protected]
Subject: Re: try to parse pdf


Peter,

Check the http.content.limit, if pdf size exceeds that limit, you will
get the error you are describing.

On 3/13/06, Peter Swoboda <[EMAIL PROTECTED]> wrote:
>
> Hi
> I tried to crawl including the pdf plugin.
> doesn't seem to work.
> Does anyone know what could be the problem?
>
> nutch-site.xml is
> ..
> <property>
>   <name>plugin.includes</name>
>
>
>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h
tml|js|pdf)|index-basic|query-(basic|site|url)|language-identifier</valu
e>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints 
> plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins.
>   </description>
> </property>
> ..
>
> seems to be included:
> 060313 134732 parsing: /home/../plugins/parse-pdf/plugin.xml
> 060313 134732 impl: point=org.apache.nutch.parse.Parser
> class=org.apache.nutch.parse.pdf.PdfParser
>
> but:
> 060313 134822 fetch okay, but can't parse 
> http://www.uni-koeln.de/uni/map.html, reason: failed(2,203): 
> Content-Type not application/pdf:
>
>
> --
> Echte DSL-Flatrate dauerhaft für 0,- Euro*!
> "Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl

-- 
Echte DSL-Flatrate dauerhaft für 0,- Euro*!
"Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to