Re: Newbie query: problem indexing pdf files

Gareth Gale Fri, 28 Sep 2007 06:05:33 -0700

Sorry, I should have been clearer. Those properties are set, althoughwith non-significant values. Here's my nutch-site.xml file in total :-


<configuration>


<property>
<name>http.agent.name</name>
<value>testing</value>
<description>testing</description>
</property>

<property>
<name>http.agent.description</name>
<value>testing</value>
<description>testing</description>
</property>

<property>
<name>http.agent.url</name>
<value>testing</value>
<description>testing</description>
</property>

<property>
<name>http.agent.email</name>
<value>testing</value>
<description>testing</description>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded content, in bytes.

If this value is nonnegative (>=0), content longer than it will betruncated;

  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.

In any case you need at least include the nutch-extensionpointsplugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

and basic indexing and search plugins. In order to use HTTPS pleaseenableprotocol-httpclient, but be aware of possible intermittent problemswith the

  underlying commons-httpclient library.
  </description>
</property>


</configuration>



Susam Pal wrote:

If you have not set the agent properties, you must set them.

http.agent.name
http.agent.description
http.agent.url
http.agent.email

The significance of the properties are explained within the
<description> tags. For the time being you can set some dummy values
and get started.

Regards,
Susam Pal
http://susam.in/

On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:

I do indeed see a fatal error stating :-

FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
first in 'http.robots.agents' property!

Obviously this seems critical - the tutorial
(http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in
much detail - are the values of significance ?

Thanks !

Susam Pal wrote:

Have you set the agent properties in 'conf/nutch-site.xml'? Please
check 'logs/hadoop.log' and search for the following words without the
single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?

Also search for 'fetching' in 'logs/hadoop.log' to see whether it
attempted to fetch any URLs you were expecting.

Regards,
Susam Pal
http://susam.in/

On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:

Hope someone can help. I'd like to index and search only a single
directory of my website. Doesn't work so far (both building the index
and consequent searches). Here's my config :-

Url of files to index : http://localhost:8080/mytest/filestore

a) Under the nutch root directory (i.e. ~/nutch), I created a file
urls/mytest that contains just this entry :-

http://localhost:8080/mytest/filestore

b) Edited conf/nutch-site.xml to have these extra entries (included pdf
to be parsed) :-

<property>
   <name>http.content.limit</name>
   <value>-1</value>
   <description>The length limit for downloaded content, in bytes.
   If this value is nonnegative (>=0), content longer than it will be
truncated;
   otherwise, no truncation at all.
   </description>
</property>

<property>
   <name>plugin.includes</name>

<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
   <description>Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints
plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please
enable
   protocol-httpclient, but be aware of possible intermittent problems
with the
   underlying commons-httpclient library.
   </description>
</property>

c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
added this line for my domain :-

+^http://([a-z0-9]*\.)*localhost:8080/

The filestore directory contains lots of pdfs but executing :-

~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
the 0.8 tutorial) does not index the files.

Any help much appreciated !

Re: Newbie query: problem indexing pdf files

Reply via email to