Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files.
Could you please open an issue in JIRA https://issues.apache.org/jira/browse/NUTCH) and attach the file you are trying to process? I'll have a look and see if it is related to TIKA-379. Thanks Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 29 April 2010 17:02, matthew a. grisius <mgris...@comcast.net> wrote: > in nutch-site.xml I modified plugin.includes > > parse-(html) works > parse-(tika) does not > > I need to also parse pdfs so I need both features, I tried parse-(html| > tika) to see if html would be selected before tika and that did not > work. > > On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: > > using Nutch nightly build nutch-2010-04-27_04-00-28: > > > > I am trying to bin/nutch crawl a single html file generated by javadoc > > and no links are followed. I verified this with bin/nutch readdb and > > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base > > seed doc specified is processed. > > > > I searched and reviewed the nutch-user archive and tried several > > different settings but none of the settings appear to have any effect. > > > > I then downloaded maven-2.2.1 so that I could mvn install tika and > > produce tika-app-0.7.jar to command line extract information about the > > html javadoc file. I am not familiar w/ tika but the command line > > version doesn't return any metadata, e.g. no 'src=' links from the html > > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how > > nutch uses tika and maybe it's not related . . . > > > > Has anyone crawled javadoc files or have any suggestions? Thanks. > > > > -m. > > > >