Re: nutch crawl issue

Julien Nioche Thu, 29 Apr 2010 10:36:58 -0700

Hi Matthew,

There is an open issue with Tika (e.g.
https://issues.apache.org/jira/browse/TIKA-379) that could explain the
differences betwen parse-html and parse-tika. Note that you can specify :
*parse-(html|pdf) *in order to get both HTML and PDF files.


Could you please open an issue in JIRA
https://issues.apache.org/jira/browse/NUTCH) and attach the file you are
trying to process? I'll have a look and see if it is related to TIKA-379.

Thanks

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

On 29 April 2010 17:02, matthew a. grisius <mgris...@comcast.net> wrote:

> in nutch-site.xml I modified plugin.includes
>
> parse-(html) works
> parse-(tika) does not
>
> I need to also parse pdfs so I need both features, I tried parse-(html|
> tika) to see if html would be selected before tika and that did not
> work.
>
> On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
> > using Nutch nightly build nutch-2010-04-27_04-00-28:
> >
> > I am trying to bin/nutch crawl a single html file generated by javadoc
> > and no links are followed. I verified this with bin/nutch readdb and
> > bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
> > seed doc specified is processed.
> >
> > I searched and reviewed the nutch-user archive and tried several
> > different settings but none of the settings appear to have any effect.
> >
> > I then downloaded maven-2.2.1 so that I could mvn install tika and
> > produce tika-app-0.7.jar to command line extract information about the
> > html javadoc file. I am not familiar w/ tika but the command line
> > version doesn't return any metadata, e.g. no 'src=' links from the html
> > 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
> > nutch uses tika and maybe it's not related . . .
> >
> > Has anyone crawled javadoc files or have any suggestions? Thanks.
> >
> > -m.
> >
>
>

Re: nutch crawl issue

Reply via email to