Hi Martin,
Thanks for the response.
My pdf file size is much less than the default 65536
<name>http.content.limit</name>
<value>65536</value>
Can you suggest anything else?
thanks,
Krishna.
Martin Kuen wrote:
Hi,
what comes to my mind is that there is a setting for the maximum size of
a downloaded file.
Have a look at "nutch-default.xml" and override it in "nutch-site.xml".
pdf-files tend to be quite big (compared to html). so probably this is
the source of your problem.
pdf files are downloaded and may get truncated - however the pdf parser
cannot handle these truncated pdf files. (truncated html files are okay)
If that's the case you should see a warning in the log file.
So, you should try to increase/modify the logging level/settings in
order to see what is happening. Have a look at "log/hadoop.log". These
logging statements are valuable information regarding your problem.
Logging is controlled via "conf/log4j.properties" - if you're not
running nutch in a servlet container. (ok - you still may controll
logging from the same place, but I think that's hardly done (?) ). In
the mentioned hadoop.log file you'll also see which plugins are loaded.
btw. you don't need to "mess" around with compilation in order to get
this running. (Just looking at the link . . .)
Hope it helps,
Martin
PS: This kind of question should be asked on the nutch-user list not
dev. Reposted this on user
PPS: I think you should subscribe to the mailing list . . . it's useful,
really ;)
On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Hello,
I want crawler to fetch pdf files also. I set the url to be
http://localhost:8080/ and I have several html and pdf files in my
document root.
crawler is able to fetch html files but not pdf files.
I saw
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html
<http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html>
In <nutch_home>/nutch-site.xml, I added the following:
---------
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>description</description>
</property>
---------
I installed nutch 0.9 and I see all plugins including parse-pdf in
plugins directory. So thought I don't have to do anything else.
It doesn't work. Can you pls help.
PS: I am not on any mailing list. Can you pls CC me on your replies.
thanks,
Krishna.