Hi, what comes to my mind is that there is a setting for the maximum size of a downloaded file. Have a look at "nutch-default.xml" and override it in "nutch-site.xml". pdf-files tend to be quite big (compared to html). so probably this is the source of your problem. pdf files are downloaded and may get truncated - however the pdf parser cannot handle these truncated pdf files. (truncated html files are okay) If that's the case you should see a warning in the log file.
So, you should try to increase/modify the logging level/settings in order to see what is happening. Have a look at "log/hadoop.log". These logging statements are valuable information regarding your problem. Logging is controlled via "conf/log4j.properties" - if you're not running nutch in a servlet container. (ok - you still may controll logging from the same place, but I think that's hardly done (?) ). In the mentioned hadoop.log file you'll also see which plugins are loaded. btw. you don't need to "mess" around with compilation in order to get this running. (Just looking at the link . . .) Hope it helps, Martin PS: This kind of question should be asked on the nutch-user list not dev. Reposted this on user PPS: I think you should subscribe to the mailing list . . . it's useful, really ;) On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <[EMAIL PROTECTED]> wrote: > Hello, > > I want crawler to fetch pdf files also. I set the url to be > http://localhost:8080/ and I have several html and pdf files in my > document root. > > crawler is able to fetch html files but not pdf files. > I saw > http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html > > In <nutch_home>/nutch-site.xml, I added the following: > --------- > <property> > <name>plugin.includes</name> > > > <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>description</description> > </property> > --------- > > I installed nutch 0.9 and I see all plugins including parse-pdf in > plugins directory. So thought I don't have to do anything else. > > It doesn't work. Can you pls help. > > PS: I am not on any mailing list. Can you pls CC me on your replies. > > thanks, > Krishna. >
