Re: Newbie query: problem indexing pdf files

Gareth Gale Fri, 28 Sep 2007 05:49:37 -0700

I do indeed see a fatal error stating :-

FATAL api.RobotRulesParser - Agent we advertise (testing) not listedfirst in 'http.robots.agents' property!

Obviously this seems critical - the tutorial(http://lucene.apache.org/nutch/tutorial8.html) mentions this but not inmuch detail - are the values of significance ?


Thanks !

Susam Pal wrote:

Have you set the agent properties in 'conf/nutch-site.xml'? Please
check 'logs/hadoop.log' and search for the following words without the
single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?

Also search for 'fetching' in 'logs/hadoop.log' to see whether it
attempted to fetch any URLs you were expecting.

Regards,
Susam Pal
http://susam.in/

On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:

Hope someone can help. I'd like to index and search only a single
directory of my website. Doesn't work so far (both building the index
and consequent searches). Here's my config :-

Url of files to index : http://localhost:8080/mytest/filestore

a) Under the nutch root directory (i.e. ~/nutch), I created a file
urls/mytest that contains just this entry :-

http://localhost:8080/mytest/filestore

b) Edited conf/nutch-site.xml to have these extra entries (included pdf
to be parsed) :-

<property>
   <name>http.content.limit</name>
   <value>-1</value>
   <description>The length limit for downloaded content, in bytes.
   If this value is nonnegative (>=0), content longer than it will be
truncated;
   otherwise, no truncation at all.
   </description>
</property>

<property>
   <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
   <description>Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints
plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please
enable
   protocol-httpclient, but be aware of possible intermittent problems
with the
   underlying commons-httpclient library.
   </description>
</property>

c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
added this line for my domain :-

+^http://([a-z0-9]*\.)*localhost:8080/

The filestore directory contains lots of pdfs but executing :-

~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
the 0.8 tutorial) does not index the files.

Any help much appreciated !



--
Gareth Gale
Hewlett-Packard Laboratories, Bristol
United Kingdom
e: [EMAIL PROTECTED]
t: +44 (117) 3129606

Hewlett-Packard Limited registered Office: Cain Road, Bracknell, BerksRG12 1HN

Registered No: 690597 England

The contents of this message and any attachments to it are confidentialand may be legally privileged. If you have received this message inerror, you should delete it from your system immediately and advise thesender.

To any recipient of this message within HP, unless otherwise stated youshould consider this message and attachments as "HP CONFIDENTIAL".

Re: Newbie query: problem indexing pdf files

Reply via email to