On 2/13/07, Yegor Kozlov <[EMAIL PROTECTED]> wrote:

Probably the language info is really stored in TextSpecInfoAtom.
Do you have any idea about internal structure of this record? If you
do, please share it.

Yegor


I do not have any knowledge about the internal structure of 4010 atoms
(or any other for that matter) that does not come from either the SDK
specs or from reverse engineering. However, I will tell you what i
have noticed.

First, let me state that for the moment all of my efforts are focused
on Slide text extraction. I will do notes later. Also, though it does
not harm generality, i work with English and greek text, so all of my
examples are with MS language codes 1032 and 1033. My PowerPoint
version is 2003 SP2 on WinXP Pro SP2.

From my early tests, i notice that text is stored in two types of
records. Atom 4000 holds Unicode text and atom 4008 holds any text
that can be represented as pure ascii. This comes from the fact that
any attempt to insert greek characters in the text results in storing
the string as Unicode, whereas all pure english text is stored as atom
4008. What is interesting is that all 4008 atoms are followed by a
4010 atom that has a value 1033 (0x0904 little endian) at offset 0x12
or 0x16 (pointed by the atom header - to find out more). Atoms 4000
are not necessarily followed by such an atom, and if they do it has no
value that can be mapped to a language code, a fact that can be
explained as that 4010 atom holding some information for the text that
has nothing to do with language (as per the spec, 4010 atoms also hold
non-language info).

I believe that my presentation is consistent, since at Unicode text
the notion of language ID does not apply and any ascii text has
language info attached to it. However, besides the fact that I could
be way of the mark here, I am troubled by the absence of codepages in
non-english text. I have yet to get PowerPoint to store greek text as
non-Unicode.

I will continue my work with more ppt's and get back with more info. I
would appreciate some feedback though, even if it is simply ideas for
test cases.

cheers,
Chris

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Reply via email to