Re: nutch crawl issue

arpit khurdiya Thu, 29 Apr 2010 09:28:10 -0700

 if u r using  nigthly build, Did u changed d same in parse-plugin.xml??
uncomment this:
 <mimeType name="text/html">
        <plugin id="parse-html" />
        </mimeType>


hopefully this helps u

On Thu, Apr 29, 2010 at 9:32 PM, matthew a. grisius
<mgris...@comcast.net> wrote:
> in nutch-site.xml I modified plugin.includes
>
> parse-(html) works
> parse-(tika) does not
>
> I need to also parse pdfs so I need both features, I tried parse-(html|
> tika) to see if html would be selected before tika and that did not
> work.
>
> On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
>> using Nutch nightly build nutch-2010-04-27_04-00-28:
>>
>> I am trying to bin/nutch crawl a single html file generated by javadoc
>> and no links are followed. I verified this with bin/nutch readdb and
>> bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
>> seed doc specified is processed.
>>
>> I searched and reviewed the nutch-user archive and tried several
>> different settings but none of the settings appear to have any effect.
>>
>> I then downloaded maven-2.2.1 so that I could mvn install tika and
>> produce tika-app-0.7.jar to command line extract information about the
>> html javadoc file. I am not familiar w/ tika but the command line
>> version doesn't return any metadata, e.g. no 'src=' links from the html
>> 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
>> nutch uses tika and maybe it's not related . . .
>>
>> Has anyone crawled javadoc files or have any suggestions? Thanks.
>>
>> -m.
>>
>
>



-- 
Regards,
Arpit Khurdiya

Re: nutch crawl issue

Reply via email to