I am not sure, but I think that PDF maximum size goes with this property: <property> <name>file.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property>
2008/1/17, Krishnamohan Meduri <[EMAIL PROTECTED]>: > Hi Martin, > > Thanks for the response. > My pdf file size is much less than the default 65536 > <name>http.content.limit</name> > <value>65536</value> > > Can you suggest anything else? > > thanks, > Krishna. > > Martin Kuen wrote: > > Hi, > > > > what comes to my mind is that there is a setting for the maximum size of > > a downloaded file. > > Have a look at "nutch-default.xml" and override it in "nutch-site.xml". > > pdf-files tend to be quite big (compared to html). so probably this is > > the source of your problem. > > pdf files are downloaded and may get truncated - however the pdf parser > > cannot handle these truncated pdf files. (truncated html files are okay) > > If that's the case you should see a warning in the log file. > > > > So, you should try to increase/modify the logging level/settings in > > order to see what is happening. Have a look at "log/hadoop.log". These > > logging statements are valuable information regarding your problem. > > Logging is controlled via "conf/log4j.properties" - if you're not > > running nutch in a servlet container. (ok - you still may controll > > logging from the same place, but I think that's hardly done (?) ). In > > the mentioned hadoop.log file you'll also see which plugins are loaded. > > > > btw. you don't need to "mess" around with compilation in order to get > > this running. (Just looking at the link . . .) > > > > > > Hope it helps, > > > > Martin > > > > PS: This kind of question should be asked on the nutch-user list not > > dev. Reposted this on user > > PPS: I think you should subscribe to the mailing list . . . it's useful, > > really ;) > > > > On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <[EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]>> wrote: > > > > Hello, > > > > I want crawler to fetch pdf files also. I set the url to be > > http://localhost:8080/ and I have several html and pdf files in my > > document root. > > > > crawler is able to fetch html files but not pdf files. > > I saw > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html > > <http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html> > > > > In <nutch_home>/nutch-site.xml, I added the following: > > --------- > > <property> > > <name>plugin.includes</name> > > > > > > <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > > > <description>description</description> > > </property> > > --------- > > > > I installed nutch 0.9 and I see all plugins including parse-pdf in > > plugins directory. So thought I don't have to do anything else. > > > > It doesn't work. Can you pls help. > > > > PS: I am not on any mailing list. Can you pls CC me on your replies. > > > > thanks, > > Krishna. > > > > >
