Hi Martin,
Thanks so much for your tips.
After I enabled directory listing && created a urls list that consists
of just one url for a pdf file, it worked.
Now that I got a clue, I may bug you more :-)
Thanks
Krishna.
Martin Kuen wrote:
Hi,
The settings "file.content.limit" and "http.content.limit" are used for
different protocols. If you are crawling an url like
"http://localhost/whatever <http://localhost/whatever>" the http plugin
is used for fetching (as you've already guessed). If you have a url
starting with "file" another plugin is used.
"file.content.limit" is used for crawling a local disk (or if you have a
network drive mounted).
"http.content.limit" is used for content that is fetched via http.
These two settings are not related to the mime-type of downloaded content.
>2008-01-16 18:38:44,717 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
This is okay since you are on solaris and the "native" stuff is just
avail. for linux
>2008-01-16 18:38:42,419 WARN regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
The urlnormalizer is controlled via "regex-normalize.xml". This warning
tells you that it'll use the patterns found in this file regardless of
the current "scope". You can ignore this, or disable the urlnormalizer
plugin.
Idea:
>I want crawler to fetch pdf files also. I set the url to be
>http://localhost:8080/ and I have several html and pdf files in my
>document root.
1.) And all your pdf files have an in-link - or are they "just there"?
Try to make a seed urls list which consists of just one url for such a
pdf. See if that pdf is fetched . . .
If the pdfs don't have an in-link . . . no way to discover them via http
(assuming that directory listings are turned off on your server, which
should be the default).
2.) Is your crawl-depth set deep enough?
One last thing I can recommend to you is to increase the log-level.
Best regards,
Martin
On Jan 17, 2008 12:15 PM, Ismael < [EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
I am not sure, but I think that PDF maximum size goes with this
property:
<property>
<name>file.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
2008/1/17, Krishnamohan Meduri <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>>:
> Hi Martin,
>
> Thanks for the response.
> My pdf file size is much less than the default 65536
> <name>http.content.limit</name>
> <value>65536</value>
>
> Can you suggest anything else?
>
> thanks,
> Krishna.
>
> Martin Kuen wrote:
> > Hi,
> >
> > what comes to my mind is that there is a setting for the
maximum size of
> > a downloaded file.
> > Have a look at "nutch-default.xml" and override it in
"nutch-site.xml".
> > pdf-files tend to be quite big (compared to html). so probably
this is
> > the source of your problem.
> > pdf files are downloaded and may get truncated - however the
pdf parser
> > cannot handle these truncated pdf files. (truncated html files
are okay)
> > If that's the case you should see a warning in the log file.
> >
> > So, you should try to increase/modify the logging level/settings in
> > order to see what is happening. Have a look at
"log/hadoop.log". These
> > logging statements are valuable information regarding your
problem.
> > Logging is controlled via "conf/log4j.properties" - if you're not
> > running nutch in a servlet container. (ok - you still may controll
> > logging from the same place, but I think that's hardly done (?)
). In
> > the mentioned hadoop.log file you'll also see which plugins are
loaded.
> >
> > btw. you don't need to "mess" around with compilation in order
to get
> > this running. (Just looking at the link . . .)
> >
> >
> > Hope it helps,
> >
> > Martin
> >
> > PS: This kind of question should be asked on the nutch-user
list not
> > dev. Reposted this on user
> > PPS: I think you should subscribe to the mailing list . . .
it's useful,
> > really ;)
> >
> > On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> > <mailto:[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>>> wrote:
> >
> > Hello,
> >
> > I want crawler to fetch pdf files also. I set the url to be
> > http://localhost:8080/ and I have several html and pdf
files in my
> > document root.
> >
> > crawler is able to fetch html files but not pdf files.
> > I saw
> >
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html
> >
<http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html>
> >
> > In <nutch_home>/nutch- site.xml, I added the following:
> > ---------
> > <property>
> > <name>plugin.includes</name>
> >
> >
<value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >
> > <description>description</description>
> > </property>
> > ---------
> >
> > I installed nutch 0.9 and I see all plugins including
parse-pdf in
> > plugins directory. So thought I don't have to do anything else.
> >
> > It doesn't work. Can you pls help.
> >
> > PS: I am not on any mailing list. Can you pls CC me on your
replies.
> >
> > thanks,
> > Krishna.
> >
> >
>