If you have not set the agent properties, you must set them. http.agent.name http.agent.description http.agent.url http.agent.email
The significance of the properties are explained within the <description> tags. For the time being you can set some dummy values and get started. Regards, Susam Pal http://susam.in/ On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote: > I do indeed see a fatal error stating :- > > FATAL api.RobotRulesParser - Agent we advertise (testing) not listed > first in 'http.robots.agents' property! > > Obviously this seems critical - the tutorial > (http://lucene.apache.org/nutch/tutorial8.html) mentions this but not in > much detail - are the values of significance ? > > Thanks ! > > Susam Pal wrote: > > Have you set the agent properties in 'conf/nutch-site.xml'? Please > > check 'logs/hadoop.log' and search for the following words without the > > single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue? > > > > Also search for 'fetching' in 'logs/hadoop.log' to see whether it > > attempted to fetch any URLs you were expecting. > > > > Regards, > > Susam Pal > > http://susam.in/ > > > > On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote: > >> Hope someone can help. I'd like to index and search only a single > >> directory of my website. Doesn't work so far (both building the index > >> and consequent searches). Here's my config :- > >> > >> Url of files to index : http://localhost:8080/mytest/filestore > >> > >> a) Under the nutch root directory (i.e. ~/nutch), I created a file > >> urls/mytest that contains just this entry :- > >> > >> http://localhost:8080/mytest/filestore > >> > >> b) Edited conf/nutch-site.xml to have these extra entries (included pdf > >> to be parsed) :- > >> > >> <property> > >> <name>http.content.limit</name> > >> <value>-1</value> > >> <description>The length limit for downloaded content, in bytes. > >> If this value is nonnegative (>=0), content longer than it will be > >> truncated; > >> otherwise, no truncation at all. > >> </description> > >> </property> > >> > >> <property> > >> <name>plugin.includes</name> > >> > >> <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > >> <description>Regular expression naming plugin directory names to > >> include. Any plugin not matching this expression is excluded. > >> In any case you need at least include the nutch-extensionpoints > >> plugin. By > >> default Nutch includes crawling just HTML and plain text via HTTP, > >> and basic indexing and search plugins. In order to use HTTPS please > >> enable > >> protocol-httpclient, but be aware of possible intermittent problems > >> with the > >> underlying commons-httpclient library. > >> </description> > >> </property> > >> > >> c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and > >> added this line for my domain :- > >> > >> +^http://([a-z0-9]*\.)*localhost:8080/ > >> > >> The filestore directory contains lots of pdfs but executing :- > >> > >> ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from > >> the 0.8 tutorial) does not index the files. > >> > >> Any help much appreciated ! > >> > >> > > > -- > Gareth Gale > Hewlett-Packard Laboratories, Bristol > United Kingdom > e: [EMAIL PROTECTED] > t: +44 (117) 3129606 > > Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks > RG12 1HN > Registered No: 690597 England > > The contents of this message and any attachments to it are confidential > and may be legally privileged. If you have received this message in > error, you should delete it from your system immediately and advise the > sender. > > To any recipient of this message within HP, unless otherwise stated you > should consider this message and attachments as "HP CONFIDENTIAL". > >
