Sorry, I should have been clearer. Those properties are set, although
with non-significant values. Here's my nutch-site.xml file in total :-
<configuration>
<property>
<name>http.agent.name</name>
<value>testing</value>
<description>testing</description>
</property>
<property>
<name>http.agent.description</name>
<value>testing</value>
<description>testing</description>
</property>
<property>
<name>http.agent.url</name>
<value>testing</value>
<description>testing</description>
</property>
<property>
<name>http.agent.email</name>
<value>testing</value>
<description>testing</description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>
</configuration>
Susam Pal wrote:
If you have not set the agent properties, you must set them.
http.agent.name
http.agent.description
http.agent.url
http.agent.email
The significance of the properties are explained within the
<description> tags. For the time being you can set some dummy values
and get started.
Regards,
Susam Pal
http://susam.in/
On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:
I do indeed see a fatal error stating :-
FATAL api.RobotRulesParser - Agent we advertise (testing) not listed
first in 'http.robots.agents' property!
Obviously this seems critical - the tutorial
(http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in
much detail - are the values of significance ?
Thanks !
Susam Pal wrote:
Have you set the agent properties in 'conf/nutch-site.xml'? Please
check 'logs/hadoop.log' and search for the following words without the
single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
Also search for 'fetching' in 'logs/hadoop.log' to see whether it
attempted to fetch any URLs you were expecting.
Regards,
Susam Pal
http://susam.in/
On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote:
Hope someone can help. I'd like to index and search only a single
directory of my website. Doesn't work so far (both building the index
and consequent searches). Here's my config :-
Url of files to index : http://localhost:8080/mytest/filestore
a) Under the nutch root directory (i.e. ~/nutch), I created a file
urls/mytest that contains just this entry :-
http://localhost:8080/mytest/filestore
b) Edited conf/nutch-site.xml to have these extra entries (included pdf
to be parsed) :-
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be
truncated;
otherwise, no truncation at all.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems
with the
underlying commons-httpclient library.
</description>
</property>
c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and
added this line for my domain :-
+^http://([a-z0-9]*\.)*localhost:8080/
The filestore directory contains lots of pdfs but executing :-
~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from
the 0.8 tutorial) does not index the files.
Any help much appreciated !