On Mon, 8 Jan 2007, Joerg Hohwiller wrote:
For msword I tried HWPF but the result was really bad.

Were you using org.apache.poi.hwpf.extractor.WordExtractor ? It doesn't filter out all the "text" entries that aren't really text, but any patches to fix that would be appreciated :)

For spidering, it's normally fine to use, since it doesn't normally matter if you get a few "bonus" words through for some of the special fields.

I have modified the sources so that the constructor can also take a POIFilesystem and not only a File. There are still some bugs. I would fix them but would I be allowed to create a new release of this stuff and publish it with my project? Or is there a way how to submit a patch to textmining.org?

textmining.org belongs to Ryan Ackley, who used to contribute to POI, until he went to work for a company that licenses the file format documentation from Microsoft. You'll need to contact him yourself with any patches.

For powerpoint I tried HSLF what could not parse most of the documents.

That's odd. I have almost no trouble using org.apache.poi.hslf.extractor.PowerPointExtractor on a wide range of powerpoint documents. What problems did you hit?

(Normally you want to catch CorruptPowerPointFileException and EncryptedPowerPointFileException, and skip over them, and catch ArrayIndexOutOfBoundsException, and report bugs for those)

For excel I tried HSSF what throws an exception for every document I read.

You shouldn't really have any problems with HSSF. There are lots of examples for hssf, did you follow them?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Reply via email to