hi there, I have problem to configure nutch 07 to let it crawling and query MS-word and pdf file correctly.
1. I adding lines in nutch-site.xml as followings: " <!-- plugin properties --> <property> <name>plugin.includes</name> <value> nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|pdf|msword)| index-(basic|pdf|msword)| query-(basic|site|url|pdf|msword) </value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> " 2. I check regex-urlfilter.txt, that I didn't exclude pdf and ms-word 3. I checked mime-type.xml, all the set for pdf and ms-word are there. 4. I checked nutch fetching log, pdf and ms-word plugin are applied correctly as followings: "060327 204736 parsing: C:\cygwin\jifeng\versionControl\new_dev\nutch_V07_CNI_Alfa\nutch\build\plugins\parse-msword\plugin.xml 060327 204736 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.msword.MSWordParser 060327 204736 parsing: C:\cygwin\jifeng\versionControl\new_dev\nutch_V07_CNI_Alfa\nutch\build\plugins\parse-pdf\plugin.xml 060327 204736 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.pdf.PdfParser " I wonder if I still missing something in configuration. thanks, Michael, __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
