Hi there, I am a newbie to this list.
For my open-source project I wrote a search solution using lucene that can extract text content from binary files. For MS-Office files I use POI. The POI basics seem to work fine and stable but the problem is about the parts build ontop used to extract the text. For msword I tried HWPF but the result was really bad. I discovered tm-extractors from textmining.org what is not perfect but quite useful. Somehow this stuff seems to be related to POI but I can not get many infos since the site www.textmining.org was hacked a long time ago and so the project seems to be quite dead. >From the sources I found in the maven repository it was written by Ryan Ackley. I have modified the sources so that the constructor can also take a POIFilesystem and not only a File. There are still some bugs. I would fix them but would I be allowed to create a new release of this stuff and publish it with my project? Or is there a way how to submit a patch to extmining.org? For powerpoint I tried HSLF what could not parse most of the documents. For this one I wrote my own solution that seems to accept all documents but causes strange duplications of text passages. Maybe someone out there has some knowledge to help me with that. For excel I tried HSSF what throws an exception for every document I read. Maybe I use the API in a wrong way. Since writing the powerpoint parser myself was a real pain (these formats are so ugly), I do not want to go through hell again for excel. Please help me, if you have any hints... You can find my work at: http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/ Best regards Jörg --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
