On 2/13/07, Yegor Kozlov <[EMAIL PROTECTED]> wrote:
Probably the language info is really stored in TextSpecInfoAtom. Do you have any idea about internal structure of this record? If you do, please share it. Yegor
I do not have any knowledge about the internal structure of 4010 atoms (or any other for that matter) that does not come from either the SDK specs or from reverse engineering. However, I will tell you what i have noticed. First, let me state that for the moment all of my efforts are focused on Slide text extraction. I will do notes later. Also, though it does not harm generality, i work with English and greek text, so all of my examples are with MS language codes 1032 and 1033. My PowerPoint version is 2003 SP2 on WinXP Pro SP2.
From my early tests, i notice that text is stored in two types of
records. Atom 4000 holds Unicode text and atom 4008 holds any text that can be represented as pure ascii. This comes from the fact that any attempt to insert greek characters in the text results in storing the string as Unicode, whereas all pure english text is stored as atom 4008. What is interesting is that all 4008 atoms are followed by a 4010 atom that has a value 1033 (0x0904 little endian) at offset 0x12 or 0x16 (pointed by the atom header - to find out more). Atoms 4000 are not necessarily followed by such an atom, and if they do it has no value that can be mapped to a language code, a fact that can be explained as that 4010 atom holding some information for the text that has nothing to do with language (as per the spec, 4010 atoms also hold non-language info). I believe that my presentation is consistent, since at Unicode text the notion of language ID does not apply and any ascii text has language info attached to it. However, besides the fact that I could be way of the mark here, I am troubled by the absence of codepages in non-english text. I have yet to get PowerPoint to store greek text as non-Unicode. I will continue my work with more ppt's and get back with more info. I would appreciate some feedback though, even if it is simply ideas for test cases. cheers, Chris --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
