regards everyone, it appears that i have some useful results regarding content language information extraction from ppt documents using hslf. Note that all information below is derived from reverse engineering and its correctness can be argued upon until more experimentation.
The data set i used consisted of many ppt documents from different versions of ms powerpoint in Greek, French, English and Italian, and i have noticed the following patterns: 4008 atoms are followed by a 4010 atom that stores language information. This is true also for 4000 atoms that have unicode text *with* non-Unicode (my guess, the system's default language is stored as unicode, all others as ascii - more to find on that). Now, the 4010 atoms that correspond to 4008 atoms have a fairly consistent appearence, that is: first 4 bytes as known (record type and code) next 4 bytes length (also known) what follows are records that hold information regarding language ID and spelling info in the following format: first, the no of characters this record applies to (4 bytes) the next bytes are a bit more complicated. So far I have encountered 2 types of data. Either the value 0x00000006 or 0x000k00000007 (the lengths are correct, that is they *are* different) that certainly have spelling information, which is apparent from the transition from the second to the first when the "ignore spelling" option is selected for that text. k above is a value that varies and i have been unable so far to attribute to some property, presumably due to the simplicity of the text i have used. After that comes the language information (2 bytes, as known for ms formats) and then a trailer value of 2 bytes that is constantly 0x0000 The atom ends with bytes that i have been so far unable to control as to how they appear. They are of small length, say 8 or 10 bytes, mostly zero. They could be information that applies to the slide as a whole, but i have nothing on it yet. Hope this information is of some value for more generalized results. Time contstraints do not allow me to work on this much however, so some feedback is as always appreciated. I will post any futher findings i come up with. cheers, Chris Gioran --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
