Re: Newbie query: problem indexing pdf files

Susam Pal Fri, 28 Sep 2007 06:01:28 -0700

If you have not set the agent properties, you must set them.

http.agent.name
http.agent.description
http.agent.url
http.agent.email


The significance of the properties are explained within the
<description> tags. For the time being you can set some dummy values
and get started.

Regards,
Susam Pal
http://susam.in/

On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:
> I do indeed see a fatal error stating :-
>
> FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
> first in 'http.robots.agents' property!
>
> Obviously this seems critical - the tutorial
> (http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in
> much detail - are the values of significance ?
>
> Thanks !
>
> Susam Pal wrote:
> > Have you set the agent properties in 'conf/nutch-site.xml'? Please
> > check 'logs/hadoop.log' and search for the following words without the
> > single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
> >
> > Also search for 'fetching' in 'logs/hadoop.log' to see whether it
> > attempted to fetch any URLs you were expecting.
> >
> > Regards,
> > Susam Pal
> > http://susam.in/
> >
> > On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:
> >> Hope someone can help. I'd like to index and search only a single
> >> directory of my website. Doesn't work so far (both building the index
> >> and consequent searches). Here's my config :-
> >>
> >> Url of files to index : http://localhost:8080/mytest/filestore
> >>
> >> a) Under the nutch root directory (i.e. ~/nutch), I created a file
> >> urls/mytest that contains just this entry :-
> >>
> >> http://localhost:8080/mytest/filestore
> >>
> >> b) Edited conf/nutch-site.xml to have these extra entries (included pdf
> >> to be parsed) :-
> >>
> >> <property>
> >>    <name>http.content.limit</name>
> >>    <value>-1</value>
> >>    <description>The length limit for downloaded content, in bytes.
> >>    If this value is nonnegative (>=0), content longer than it will be
> >> truncated;
> >>    otherwise, no truncation at all.
> >>    </description>
> >> </property>
> >>
> >> <property>
> >>    <name>plugin.includes</name>
> >>
> >> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >>    <description>Regular expression naming plugin directory names to
> >>    include.  Any plugin not matching this expression is excluded.
> >>    In any case you need at least include the nutch-extensionpoints
> >> plugin. By
> >>    default Nutch includes crawling just HTML and plain text via HTTP,
> >>    and basic indexing and search plugins. In order to use HTTPS please
> >> enable
> >>    protocol-httpclient, but be aware of possible intermittent problems
> >> with the
> >>    underlying commons-httpclient library.
> >>    </description>
> >> </property>
> >>
> >> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
> >> added this line for my domain :-
> >>
> >> +^http://([a-z0-9]*\.)*localhost:8080/
> >>
> >> The filestore directory contains lots of pdfs but executing :-
> >>
> >> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
> >> the 0.8 tutorial) does not index the files.
> >>
> >> Any help much appreciated !
> >>
> >>
>
>
> --
> Gareth Gale
> Hewlett-Packard Laboratories, Bristol
> United Kingdom
> e: [EMAIL PROTECTED]
> t: +44 (117) 3129606
>
> Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks
> RG12 1HN
> Registered No: 690597 England
>
> The contents of this message and any attachments to it are confidential
> and may be legally privileged. If you have received this message in
> error, you should delete it from your system immediately and advise the
> sender.
>
> To any recipient of this message within HP, unless otherwise stated you
> should consider this message and attachments as "HP CONFIDENTIAL".
>
>

Re: Newbie query: problem indexing pdf files

Reply via email to