My subject should've been clearer, e.g. it should've read Nutch 1.1 nightly build crawl issue.
Also, I did verify that Nutch 1.0 successfully completes crawling the javadoc html file and can be verified with luke-1.0.1 and searched using command line bin/nutch org.apache.nutch.searcher.NutchBean java On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: > using Nutch nightly build nutch-2010-04-27_04-00-28: > > I am trying to bin/nutch crawl a single html file generated by javadoc > and no links are followed. I verified this with bin/nutch readdb and > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base > seed doc specified is processed. > > I searched and reviewed the nutch-user archive and tried several > different settings but none of the settings appear to have any effect. > > I then downloaded maven-2.2.1 so that I could mvn install tika and > produce tika-app-0.7.jar to command line extract information about the > html javadoc file. I am not familiar w/ tika but the command line > version doesn't return any metadata, e.g. no 'src=' links from the html > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how > nutch uses tika and maybe it's not related . . . > > Has anyone crawled javadoc files or have any suggestions? Thanks. > > -m. >