Hi, I think the nutch page is not upto date. Nutch does have plugins for parsing non-HTML content like word, rtf, and pdf. A few people had reported an issue of the parsing stage hanging when PDF files are being parsed. I had faced this issue and it is a random occurance. If you don't find anything else, you can try to investigate this issue and bless everyone with a solution :-)
Good luck, Praveen. On Wed, 23 Feb 2005 17:28:33 -0300, Leonardo Barbosa <[EMAIL PROTECTED]> wrote: > Hi, > > My name is Leonardo Barbosa, I'm from Brasil, and I'm really > interested in helping Nutch project. > I already read the Lucene in Action book (ok, not the whole book, but > I'll get there :) because I'm working in a project with it, and I > started to read nutch's code and docs yesterday. > Like Yi Chen, I can start with translation to portuguese, but I really > want to code. > After checking at "How to contribute" developers home page, I tried to > find why nutch only supports HTML content accessed by HTTP. After > including 'parse-pdf' in the nutch-default.xml's "plugin.includes" > property, I used ' bin/nutch crawl ' to crawl my intranet, and could > search for pdf contents! > So, I think I misunderstood something. What should be done in this > issue? Is this out of date or my brain isn't working after so many > coffees as 6 PM ? :-) > > Thanks, > Leonardo Barbosa > > -- > ------------------------------------------------------------------------------------------ > Encumbered forever by desire and ambition > There's a hunger still unsatisfied > Our weary eyes still stray to the horizon > Though down this road we've been so many times > > Pink Floyd (David Gilmour/Polly Samson) - High Hopes > ------------------------------------------------------------------------------------------ > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Nutch-developers mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-developers > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
