Re: help with POI & Co.

Nick Burch Tue, 09 Jan 2007 02:50:41 -0800

On Mon, 8 Jan 2007, Joerg Hohwiller wrote:

For msword I tried HWPF but the result was really bad.

Were you using org.apache.poi.hwpf.extractor.WordExtractor ? It doesn'tfilter out all the "text" entries that aren't really text, but any patchesto fix that would be appreciated :)

For spidering, it's normally fine to use, since it doesn't normally matterif you get a few "bonus" words through for some of the special fields.

I have modified the sources so that the constructor can also take aPOIFilesystem and not only a File. There are still some bugs. I wouldfix them but would I be allowed to create a new release of this stuffand publish it with my project? Or is there a way how to submit a patchto textmining.org?

textmining.org belongs to Ryan Ackley, who used to contribute to POI,until he went to work for a company that licenses the file formatdocumentation from Microsoft. You'll need to contact him yourself with anypatches.

For powerpoint I tried HSLF what could not parse most of the documents.

That's odd. I have almost no trouble usingorg.apache.poi.hslf.extractor.PowerPointExtractor on a wide range ofpowerpoint documents. What problems did you hit?

(Normally you want to catch CorruptPowerPointFileException andEncryptedPowerPointFileException, and skip over them, and catchArrayIndexOutOfBoundsException, and report bugs for those)

For excel I tried HSSF what throws an exception for every document I read.

You shouldn't really have any problems with HSSF. There are lots ofexamples for hssf, did you follow them?


Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: help with POI & Co.

Reply via email to