Re: Help: parsing pdf files

Krishnamohan Meduri Thu, 17 Jan 2008 02:28:24 -0800

Hi Martin,

Thanks for the response.
My pdf file size is much less than the default 65536
  <name>http.content.limit</name>
  <value>65536</value>


Can you suggest anything else?

thanks,
Krishna.

Martin Kuen wrote:

Hi,
what comes to my mind is that there is a setting for the maximum size ofa downloaded file.
Have a look at "nutch-default.xml" and override it in "nutch-site.xml".
pdf-files tend to be quite big (compared to html). so probably this isthe source of your problem.pdf files are downloaded and may get truncated - however the pdf parsercannot handle these truncated pdf files. (truncated html files are okay)If that's the case you should see a warning in the log file.
So, you should try to increase/modify the logging level/settings inorder to see what is happening. Have a look at "log/hadoop.log". Theselogging statements are valuable information regarding your problem.Logging is controlled via "conf/log4j.properties" - if you're notrunning nutch in a servlet container. (ok - you still may controlllogging from the same place, but I think that's hardly done (?) ). Inthe mentioned hadoop.log file you'll also see which plugins are loaded.
btw. you don't need to "mess" around with compilation in order to getthis running. (Just looking at the link . . .)
Hope it helps,

Martin
PS: This kind of question should be asked on the nutch-user list notdev. Reposted this on userPPS: I think you should subscribe to the mailing list . . . it's useful,really ;)
On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
    Hello,

    I want crawler to fetch pdf files also. I set the url to be
    http://localhost:8080/ and I have several html and pdf files in my
    document root.

    crawler is able to fetch html files but not pdf files.
    I saw
    http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html
    <http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html>

    In <nutch_home>/nutch-site.xml, I added the following:
    ---------
    <property>
      <name>plugin.includes</name>

    
<value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

      <description>description</description>
    </property>
    ---------

    I installed nutch 0.9 and I see all plugins including parse-pdf in
    plugins directory. So thought I don't have to do anything else.

    It doesn't work. Can you pls help.

    PS: I am not on any mailing list. Can you pls CC me on your replies.

    thanks,
    Krishna.

Re: Help: parsing pdf files

Reply via email to