Nick Burch schrieb: > On Tue, 9 Jan 2007, Joerg Hohwiller wrote: >> Besides I used the official POI release which is very old. I did NOT >> try the >> HEAD from svn. > > You should probably try with the svn head, you will generally have more > luck with HWPF and HSLF from there. Okay, thanks for the tip. > >> I did NOT even open most of the documents. The constructor caused an >> exception. Something like illegal fileformat or magic-number or >> something. > > I use hslf for a web spider that tries lots of random documents, and > it's ok on almost all of them, so it's odd that you're having such problems > >>> (Normally you want to catch CorruptPowerPointFileException and >>> EncryptedPowerPointFileException, and skip over them, and catch >>> ArrayIndexOutOfBoundsException, and report bugs for those) >> >> If an ArrayIndexOutOfBoundException is thrown by a method where the >> user did not supply an index as parameter the implementation looks >> like a hack to me. Same applies to NullPointerExceptions. > > These two are caused by powerpoint files containing things that we > didn't know they might, and which our test documents don't. If you > report bugs for them, and include the problem document, we can try and > figure out which of our assumptions on the file format are wrong, and > work to fix them. I already debugged into it. It occured when an UnknownRecord was created. Generally not a good idea to assume anything about you dont even know. I such situations you should always check indices and length before accessing or copying arrays. Besides i have seen printStackTrace() calls which is genrally sick for a library. Please use nested exceptions for situations like this. I hope this is already fixed in the last 2,5 years since the relase... > >> My problem is that I extract many parts of text twice from the file. >> It seems to me that they are really in there twice even though not >> visible to the powerpoint application user. > > Yup, that's to be expected on quicksaved files. > QuickButCruddyTextExtractor will do something similar. okay. > > Your only option if you want to avoid that is to implement all the > PersistPtr stuff, then parse SlideListWithTexts, and DoTheRightThing(tm) > with it all. At which point, you've re-implemented most of hslf.... Sounds like some hints on that. I will have a look at it and also compare this option with using the latest trunk. Thanks! > > Nick Regards Jörg
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
