In Nutch-default.xml, Include plugin for word and PDF as below. <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> But reco is to include the property in nutch-site.xml
Hope this helps. Michael Ji <[EMAIL PROTECTED]> wrote: hi there, Is there any specific setting need to be added in configuration file in order to crawl and index pdf and word file? thanks, Michael, __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com Sudhi Seshachala http://sudhilogs.blogspot.com/ --------------------------------- Blab-away for as little as 1ยข/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice.
