Nick Burch schrieb: > On Mon, 8 Jan 2007, Joerg Hohwiller wrote: >> For msword I tried HWPF but the result was really bad. > > Were you using org.apache.poi.hwpf.extractor.WordExtractor ? It doesn't > filter out all the "text" entries that aren't really text, but any > patches to fix that would be appreciated :) That is what I tried. Well it throw exceptions for most of the documents. My problem is that I have a hughe repository with very old to very new documents. This technically means that you can find all sins of the office history in the documents I need to read... I read that textmining also supports older versions of word that are not supported by HWPF. Besides I used the official POI release which is very old. I did NOT try the HEAD from svn. > > For spidering, it's normally fine to use, since it doesn't normally > matter if you get a few "bonus" words through for some of the special > fields. > >> I have modified the sources so that the constructor can also take a >> POIFilesystem and not only a File. There are still some bugs. I would >> fix them but would I be allowed to create a new release of this stuff >> and publish it with my project? Or is there a way how to submit a >> patch to textmining.org? > > textmining.org belongs to Ryan Ackley, who used to contribute to POI, > until he went to work for a company that licenses the file format > documentation from Microsoft. You'll need to contact him yourself with > any patches. I will see what I can do... > >> For powerpoint I tried HSLF what could not parse most of the documents. > > That's odd. I have almost no trouble using > org.apache.poi.hslf.extractor.PowerPointExtractor on a wide range of > powerpoint documents. What problems did you hit? I did NOT even open most of the documents. The constructor caused an exception. Something like illegal fileformat or magic-number or something. > > (Normally you want to catch CorruptPowerPointFileException and > EncryptedPowerPointFileException, and skip over them, and catch > ArrayIndexOutOfBoundsException, and report bugs for those) If an ArrayIndexOutOfBoundException is thrown by a method where the user did not supply an index as parameter the implementation looks like a hack to me. Same applies to NullPointerExceptions. I got all of these... The POIFilesystem and the stuff to extract the metadata seems to be very stable to me. But I did not make good experience with the rest of POI. Anyhow I now have written a PPT extractor from scratch that is only based on POIFilesystem but NOT on the HSLF stuff. The advantage is that I have support for low memory footprint: my class can be configured not to extend a specific buffer size for allocation so users do NOT get OutOfMemoryError if there was an evil file that was to big and especially even those evil files are parsed but only as much data is extracted as allowed by the configured buffer size.
My problem is that I extract many parts of text twice from the file. It seems to me that they are really in there twice even though not visible to the powerpoint application user. If someone can help me with that I would be very pleased for any hit: http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/mmm-search-parser-ppt/src/main/java/net/sf/mmm/search/parser/impl/ContentParserPpt.java > >> For excel I tried HSSF what throws an exception for every document I >> read. > > You shouldn't really have any problems with HSSF. There are lots of > examples for hssf, did you follow them? I suppose NOT. I will look at them. This is my code: http://m-m-m.googlecode.com/svn/trunk/mmm-search/mmm-search-parser/mmm-search-parser-xls/src/main/java/net/sf/mmm/search/parser/impl/ContentParserXls.java After I checked my mistakes I will send you the stacktraces of remaining problems. > > Nick Thanks Jörg --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
