Hi,

I think the nutch page is not upto date. Nutch does have plugins for
parsing non-HTML content like word, rtf, and pdf. A few people had
reported an issue of the parsing stage hanging when PDF files are
being parsed. I had faced this issue and it is a random occurance. If
you don't find anything else, you can try to investigate this issue
and bless everyone with a solution :-)

Good luck,
Praveen.


On Wed, 23 Feb 2005 17:28:33 -0300, Leonardo Barbosa
<[EMAIL PROTECTED]> wrote:
> Hi,
> 
> My name is Leonardo Barbosa, I'm from Brasil, and I'm really
> interested in helping Nutch project.
> I already read the Lucene in Action book (ok, not the whole book, but
> I'll get there :) because I'm working in a project with it, and I
> started to read nutch's code and docs yesterday.
> Like Yi Chen, I can start with translation to portuguese, but I really
> want to code.
> After checking at "How to contribute" developers home page, I tried to
> find why nutch only supports HTML content accessed by HTTP. After
> including 'parse-pdf' in the nutch-default.xml's "plugin.includes"
> property, I used ' bin/nutch crawl ' to crawl my intranet, and could
> search for pdf contents!
> So, I think I misunderstood something. What should be done in this
> issue? Is this out of date or my brain isn't working after so many
> coffees as 6 PM ? :-)
> 
> Thanks,
> Leonardo Barbosa
> 
> --
> ------------------------------------------------------------------------------------------
> Encumbered forever by desire and ambition
> There's a hunger still unsatisfied
> Our weary eyes still stray to the horizon
> Though down this road we've been so many times
> 
> Pink Floyd (David Gilmour/Polly Samson) - High Hopes
> ------------------------------------------------------------------------------------------
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to