Hi,

My name is Leonardo Barbosa, I'm from Brasil, and I'm really
interested in helping Nutch project.
I already read the Lucene in Action book (ok, not the whole book, but
I'll get there :) because I'm working in a project with it, and I
started to read nutch's code and docs yesterday.
Like Yi Chen, I can start with translation to portuguese, but I really
want to code.
After checking at "How to contribute" developers home page, I tried to
find why nutch only supports HTML content accessed by HTTP. After
including 'parse-pdf' in the nutch-default.xml's "plugin.includes"
property, I used ' bin/nutch crawl ' to crawl my intranet, and could
search for pdf contents!
So, I think I misunderstood something. What should be done in this
issue? Is this out of date or my brain isn't working after so many
coffees as 6 PM ? :-)

Thanks,
Leonardo Barbosa
 
-- 
------------------------------------------------------------------------------------------
Encumbered forever by desire and ambition
There's a hunger still unsatisfied
Our weary eyes still stray to the horizon
Though down this road we've been so many times

Pink Floyd (David Gilmour/Polly Samson) - High Hopes
------------------------------------------------------------------------------------------


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to