Have you set the agent properties in 'conf/nutch-site.xml'? Please check 'logs/hadoop.log' and search for the following words without the single quotes, 'fetch', 'ERROR', 'FATAL'. Do you get any clue?
Also search for 'fetching' in 'logs/hadoop.log' to see whether it attempted to fetch any URLs you were expecting. Regards, Susam Pal http://susam.in/ On 9/28/07, Gareth Gale <[EMAIL PROTECTED]> wrote: > Hope someone can help. I'd like to index and search only a single > directory of my website. Doesn't work so far (both building the index > and consequent searches). Here's my config :- > > Url of files to index : http://localhost:8080/mytest/filestore > > a) Under the nutch root directory (i.e. ~/nutch), I created a file > urls/mytest that contains just this entry :- > > http://localhost:8080/mytest/filestore > > b) Edited conf/nutch-site.xml to have these extra entries (included pdf > to be parsed) :- > > <property> > <name>http.content.limit</name> > <value>-1</value> > <description>The length limit for downloaded content, in bytes. > If this value is nonnegative (>=0), content longer than it will be > truncated; > otherwise, no truncation at all. > </description> > </property> > > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(text|html|htm|js|pdf|msword)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints > plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems > with the > underlying commons-httpclient library. > </description> > </property> > > c) Made sure the conf/crawl-urlfilter.txt didn't skip pdf files and > added this line for my domain :- > > +^http://([a-z0-9]*\.)*localhost:8080/ > > The filestore directory contains lots of pdfs but executing :- > > ~/nutch/bin/nutch crawl urls -dir crawl -depth 3 -topN 50 (taken from > the 0.8 tutorial) does not index the files. > > Any help much appreciated ! > >
