Hi there,

I am a newbie to this list.

For my open-source project I wrote a search solution using lucene
that can extract text content from binary files.
For MS-Office files I use POI.

The POI basics seem to work fine and stable but the problem is about
the parts build ontop used to extract the text.

For msword I tried HWPF but the result was really bad. I discovered
tm-extractors from textmining.org what is not perfect but quite useful.
Somehow this stuff seems to be related to POI but I can not get many infos
since the site www.textmining.org was hacked a long time ago and
so the project seems to be quite dead.
>From the sources I found in the maven repository it was written by
Ryan Ackley. I have modified the sources so that the constructor can also
take a POIFilesystem and not only a File.
There are still some bugs. I would fix them but would I be allowed to
create a new release of this stuff and publish it with my project?
Or is there a way how to submit a patch to extmining.org?

For powerpoint I tried HSLF what could not parse most of the documents.
For this one I wrote my own solution that seems to accept all
documents but causes strange duplications of text passages. Maybe someone
out there has some knowledge to help me with that.

For excel I tried HSSF what throws an exception for every document I read.
Maybe I use the API in a wrong way. Since writing the powerpoint parser myself
was a real pain (these formats are so ugly), I do not want to go through hell
again for excel.

Please help me, if you have any hints...

You can find my work at:
http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/

Best regards
  Jörg

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Reply via email to