if u r using nigthly build, Did u changed d same in parse-plugin.xml?? uncomment this: <mimeType name="text/html"> <plugin id="parse-html" /> </mimeType>
hopefully this helps u On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius <mgris...@comcast.net> wrote: > in nutch-site.xml I modified plugin.includes > > parse-(html) works > parse-(tika) does not > > I need to also parse pdfs so I need both features, I tried parse-(html| > tika) to see if html would be selected before tika and that did not > work. > > On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote: >> using Nutch nightly build nutch-2010-04-27_04-00-28: >> >> I am trying to bin/nutch crawl a single html file generated by javadoc >> and no links are followed. I verified this with bin/nutch readdb and >> bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base >> seed doc specified is processed. >> >> I searched and reviewed the nutch-user archive and tried several >> different settings but none of the settings appear to have any effect. >> >> I then downloaded maven-2.2.1 so that I could mvn install tika and >> produce tika-app-0.7.jar to command line extract information about the >> html javadoc file. I am not familiar w/ tika but the command line >> version doesn't return any metadata, e.g. no 'src=' links from the html >> 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how >> nutch uses tika and maybe it's not related . . . >> >> Has anyone crawled javadoc files or have any suggestions? Thanks. >> >> -m. >> > > -- Regards, Arpit Khurdiya