Re: Help: parsing pdf files

Ismael Thu, 17 Jan 2008 03:16:24 -0800

I am not sure, but I think that PDF maximum size goes with this property:

<property>
  <name>file.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.
  If this value is nonnegative (>=0), content longer than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>


2008/1/17, Krishnamohan Meduri <[EMAIL PROTECTED]>:
> Hi Martin,
>
> Thanks for the response.
> My pdf file size is much less than the default 65536
>    <name>http.content.limit</name>
>    <value>65536</value>
>
> Can you suggest anything else?
>
> thanks,
> Krishna.
>
> Martin Kuen wrote:
> > Hi,
> >
> > what comes to my mind is that there is a setting for the maximum size of
> > a downloaded file.
> > Have a look at "nutch-default.xml" and override it in "nutch-site.xml".
> > pdf-files tend to be quite big (compared to html). so probably this is
> > the source of your problem.
> > pdf files are downloaded and may get truncated - however the pdf parser
> > cannot handle these truncated pdf files. (truncated html files are okay)
> > If that's the case you should see a warning in the log file.
> >
> > So, you should try to increase/modify the logging level/settings in
> > order to see what is happening. Have a look at "log/hadoop.log". These
> > logging statements are valuable information regarding your problem.
> > Logging is controlled via "conf/log4j.properties" - if you're not
> > running nutch in a servlet container. (ok - you still may controll
> > logging from the same place, but I think that's hardly done (?) ). In
> > the mentioned hadoop.log file you'll also see which plugins are loaded.
> >
> > btw. you don't need to "mess" around with compilation in order to get
> > this running. (Just looking at the link . . .)
> >
> >
> > Hope it helps,
> >
> > Martin
> >
> > PS: This kind of question should be asked on the nutch-user list not
> > dev. Reposted this on user
> > PPS: I think you should subscribe to the mailing list . . . it's useful,
> > really ;)
> >
> > On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <[EMAIL PROTECTED]
> > <mailto:[EMAIL PROTECTED]>> wrote:
> >
> >     Hello,
> >
> >     I want crawler to fetch pdf files also. I set the url to be
> >     http://localhost:8080/ and I have several html and pdf files in my
> >     document root.
> >
> >     crawler is able to fetch html files but not pdf files.
> >     I saw
> >     http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html
> >     <http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html>
> >
> >     In <nutch_home>/nutch-site.xml, I added the following:
> >     ---------
> >     <property>
> >       <name>plugin.includes</name>
> >
> >     
> > <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> >       <description>description</description>
> >     </property>
> >     ---------
> >
> >     I installed nutch 0.9 and I see all plugins including parse-pdf in
> >     plugins directory. So thought I don't have to do anything else.
> >
> >     It doesn't work. Can you pls help.
> >
> >     PS: I am not on any mailing list. Can you pls CC me on your replies.
> >
> >     thanks,
> >     Krishna.
> >
> >
>

Re: Help: parsing pdf files

Reply via email to