Hi, I have made some experiments with the 3.0-alpha1 version of Jakarta POI (used by parse-msword and parse-mspowerpoint). Since this version contains the hwpf package it enables to parse msword documents too (the actual version in lib-jakarta-poi plugin doesn't contain this package). The benefit is that we can remove the poi-2.1 jars bundled with parse-msword and simply add a dependency to the lib-jakarta-poi plugin (like for parse-mspowerpoint) : Just one version of POI libs is bundled in Nutch. I had performed some tests on a lot of zipped doc files (cool to test two plugins at the same time) from the 3GPP site and all is working fine. I do not perform a lot of tests on powerpoints, but unit tests are ok.
If there is no objection, I will commit changes by the end of the week. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
