Re: Help: parsing pdf files

Krishnamohan Meduri Thu, 17 Jan 2008 12:40:31 -0800

Hi Martin,

Thanks so much for your tips.

After I enabled directory listing && created a urls list that consistsof just one url for a pdf file, it worked.


Now that I got a clue, I may bug you more :-)

Thanks
Krishna.

Martin Kuen wrote:

Hi,

The settings "file.content.limit" and "http.content.limit" are used fordifferent protocols. If you are crawling an url like"http://localhost/whatever <http://localhost/whatever>" the http pluginis used for fetching (as you've already guessed). If you have a urlstarting with "file" another plugin is used."file.content.limit" is used for crawling a local disk (or if you have anetwork drive mounted).

"http.content.limit" is used for content that is fetched via http.

These two settings are not related to the mime-type of downloaded content.

 >2008-01-16 18:38:44,717 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes

This is okay since you are on solaris and the "native" stuff is justavail. for linux


 >2008-01-16 18:38:42,419 WARN   regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default

The urlnormalizer is controlled via "regex-normalize.xml". This warningtells you that it'll use the patterns found in this file regardless ofthe current "scope". You can ignore this, or disable the urlnormalizerplugin.


Idea:
 >I want crawler to fetch pdf files also. I set the url to be
 >http://localhost:8080/ and I have several html and pdf files in my
 >document root.
1.) And all your pdf files have an in-link - or are they "just there"?

Try to make a seed urls list which consists of just one url for such apdf. See if that pdf is fetched . . .If the pdfs don't have an in-link . . . no way to discover them via http(assuming that directory listings are turned off on your server, whichshould be the default).

2.) Is your crawl-depth set deep enough?

One last thing I can recommend to you is to increase the log-level.


Best regards,

Martin

On Jan 17, 2008 12:15 PM, Ismael < [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:


    I am not sure, but I think that PDF maximum size goes with this
    property:

    <property>
     <name>file.content.limit</name>
     <value>-1</value>
     <description>The length limit for downloaded content, in bytes.
     If this value is nonnegative (>=0), content longer than it will be
    truncated;
     otherwise, no truncation at all.
     </description>
    </property>

    2008/1/17, Krishnamohan Meduri <[EMAIL PROTECTED]
    <mailto:[EMAIL PROTECTED]>>:
     > Hi Martin,
     >
     > Thanks for the response.
     > My pdf file size is much less than the default 65536
     >    <name>http.content.limit</name>
     >    <value>65536</value>
     >
     > Can you suggest anything else?
     >
     > thanks,
     > Krishna.
     >
     > Martin Kuen wrote:
     > > Hi,
     > >
     > > what comes to my mind is that there is a setting for the
    maximum size of
     > > a downloaded file.
     > > Have a look at "nutch-default.xml" and override it in
    "nutch-site.xml".
     > > pdf-files tend to be quite big (compared to html). so probably
    this is
     > > the source of your problem.
     > > pdf files are downloaded and may get truncated - however the
    pdf parser
     > > cannot handle these truncated pdf files. (truncated html files
    are okay)
     > > If that's the case you should see a warning in the log file.
     > >
     > > So, you should try to increase/modify the logging level/settings in
     > > order to see what is happening. Have a look at
    "log/hadoop.log". These
     > > logging statements are valuable information regarding your
    problem.
     > > Logging is controlled via "conf/log4j.properties" - if you're not
     > > running nutch in a servlet container. (ok - you still may controll
     > > logging from the same place, but I think that's hardly done (?)
    ). In
     > > the mentioned hadoop.log file you'll also see which plugins are
    loaded.
     > >
     > > btw. you don't need to "mess" around with compilation in order
    to get
     > > this running. (Just looking at the link . . .)
     > >
     > >
     > > Hope it helps,
     > >
     > > Martin
     > >
     > > PS: This kind of question should be asked on the nutch-user
    list not
     > > dev. Reposted this on user
     > > PPS: I think you should subscribe to the mailing list . . .
    it's useful,
     > > really ;)
     > >
     > > On Jan 16, 2008 9:31 PM, Krishnamohan Meduri <
    [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
     > > <mailto:[EMAIL PROTECTED]
    <mailto:[EMAIL PROTECTED]>>> wrote:
     > >
     > >     Hello,
     > >
     > >     I want crawler to fetch pdf files also. I set the url to be
     > >     http://localhost:8080/ and I have several html and pdf
    files in my
     > >     document root.
     > >
     > >     crawler is able to fetch html files but not pdf files.
     > >     I saw

> >http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html> ><http://www.mail-archive.com/[EMAIL PROTECTED]/msg00344.html>

     > >
     > >     In <nutch_home>/nutch- site.xml, I added the following:
     > >     ---------
     > >     <property>
     > >       <name>plugin.includes</name>
     > >


     > >
     > >       <description>description</description>
     > >     </property>
     > >     ---------
     > >
     > >     I installed nutch 0.9 and I see all plugins including
    parse-pdf in
     > >     plugins directory. So thought I don't have to do anything else.
     > >
     > >     It doesn't work. Can you pls help.
     > >
     > >     PS: I am not on any mailing list. Can you pls CC me on your
    replies.
     > >
     > >     thanks,
     > >     Krishna.
     > >
     > >
     >

Re: Help: parsing pdf files

Reply via email to