hi Sudhendra: I use the same configuration as you suggested in nutch-site.xml
I did a testing and after look at the fetch log, found the following error message " fetch okay, but can't parse http://www.ucis.pitt.edu/cwes/papers/work_papers/wp6_2005.pdf, reason: failed(2,203): Content-Type not text/html: application/pdf " Does that mean pdf is downloaded but doesn't parse successfully? So we can't search the word in pdf file directly? thanks, Michael, By the way, I use nutch 07 to do testing. --- sudhendra seshachala <[EMAIL PROTECTED]> wrote: > In Nutch-default.xml, > Include plugin for word and PDF as below. > > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)</value> > <description>Regular expression naming plugin > directory names to > include. Any plugin not matching this expression > is excluded. > In any case you need at least include the > nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and > plain text via HTTP, > and basic indexing and search plugins. > </description> > </property> > But reco is to include the property in > nutch-site.xml > > Hope this helps. > > Michael Ji <[EMAIL PROTECTED]> wrote: > hi there, > > Is there any specific setting need to be added in > configuration file in order to crawl and index pdf > and > word file? > > thanks, > > Michael, > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam > protection around > http://mail.yahoo.com > > > > Sudhi Seshachala > http://sudhilogs.blogspot.com/ > > > > > --------------------------------- > Blab-away for as little as 1ยข/min. Make PC-to-Phone > Calls using Yahoo! Messenger with Voice. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
