Re: nutch crawl issue

matthew a. grisius Thu, 29 Apr 2010 09:04:12 -0700

in nutch-site.xml I modified plugin.includes

parse-(html) works
parse-(tika) does not


I need to also parse pdfs so I need both features, I tried parse-(html|
tika) to see if html would be selected before tika and that did not
work.

On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
> using Nutch nightly build nutch-2010-04-27_04-00-28:
> 
> I am trying to bin/nutch crawl a single html file generated by javadoc
> and no links are followed. I verified this with bin/nutch readdb and
> bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
> seed doc specified is processed.
> 
> I searched and reviewed the nutch-user archive and tried several
> different settings but none of the settings appear to have any effect.
> 
> I then downloaded maven-2.2.1 so that I could mvn install tika and
> produce tika-app-0.7.jar to command line extract information about the
> html javadoc file. I am not familiar w/ tika but the command line
> version doesn't return any metadata, e.g. no 'src=' links from the html
> 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
> nutch uses tika and maybe it's not related . . .
> 
> Has anyone crawled javadoc files or have any suggestions? Thanks.
> 
> -m.
>

Re: nutch crawl issue

Reply via email to