Hi, My name is Leonardo Barbosa, I'm from Brasil, and I'm really interested in helping Nutch project. I already read the Lucene in Action book (ok, not the whole book, but I'll get there :) because I'm working in a project with it, and I started to read nutch's code and docs yesterday. Like Yi Chen, I can start with translation to portuguese, but I really want to code. After checking at "How to contribute" developers home page, I tried to find why nutch only supports HTML content accessed by HTTP. After including 'parse-pdf' in the nutch-default.xml's "plugin.includes" property, I used ' bin/nutch crawl ' to crawl my intranet, and could search for pdf contents! So, I think I misunderstood something. What should be done in this issue? Is this out of date or my brain isn't working after so many coffees as 6 PM ? :-)
Thanks, Leonardo Barbosa -- ------------------------------------------------------------------------------------------ Encumbered forever by desire and ambition There's a hunger still unsatisfied Our weary eyes still stray to the horizon Though down this road we've been so many times Pink Floyd (David Gilmour/Polly Samson) - High Hopes ------------------------------------------------------------------------------------------ ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
