Mhhh. I tried both: Setting the http.content.limit to 5MB and anothertime to -1 Result is always the same: it still says fetch okay, but can't parse http://www.uni-koeln.de/uni/map.html, reason: failed(2,203): Content-Type not application/pdf:
another idea what's going wrong?? greetings peter RE: try to parse pdf Richard Braman Mon, 13 Mar 2006 09:22:02 -0800 That error is actually not from the http content limit, but I would recommend setting the content limit to -1. For some reason this error sems to happen sometimes even after you add the pdf parsing plug in like you did. I think nutch must cache the plug in properties in nutch-default. It will start to parse pdfs at some point. -----Original Message----- From: Jeff Pettenski [mailto:[EMAIL PROTECTED] Sent: Monday, March 13, 2006 10:30 AM To: [email protected] Subject: Re: try to parse pdf Peter, Check the http.content.limit, if pdf size exceeds that limit, you will get the error you are describing. On 3/13/06, Peter Swoboda <[EMAIL PROTECTED]> wrote: > > Hi > I tried to crawl including the pdf plugin. > doesn't seem to work. > Does anyone know what could be the problem? > > nutch-site.xml is > .. > <property> > <name>plugin.includes</name> > > > <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|h tml|js|pdf)|index-basic|query-(basic|site|url)|language-identifier</valu e> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints > plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. > </description> > </property> > .. > > seems to be included: > 060313 134732 parsing: /home/../plugins/parse-pdf/plugin.xml > 060313 134732 impl: point=org.apache.nutch.parse.Parser > class=org.apache.nutch.parse.pdf.PdfParser > > but: > 060313 134822 fetch okay, but can't parse > http://www.uni-koeln.de/uni/map.html, reason: failed(2,203): > Content-Type not application/pdf: > > > -- > Echte DSL-Flatrate dauerhaft für 0,- Euro*! > "Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl -- Echte DSL-Flatrate dauerhaft für 0,- Euro*! "Feel free" mit GMX DSL! http://www.gmx.net/de/go/dsl ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
