Re: PPT Unicode

Tales Paiva Nogueira Thu, 07 Dec 2006 05:08:53 -0800

Hi,

I'm not changing the text. I just read it. My problem occurs whenthere is any TextCharsAtom because the platform I am using doesn'tsupport Unicode, just ISO-8859-1. So I had to change the code replacingUTF-16LE by ISO-8859-1.

   So I think I have no way out but show the text, without styles.


Thanks a lot,
--
Tales Paiva


Nick Burch wrote:

On Tue, 5 Dec 2006, Tales Paiva Nogueira wrote:
When PowerPoint stores text in Unicode a unknown char (byte value =0) is placed between every "normal" char making the text 2 timeslonger than it really is.
TextCharsAtoms, and other unicode containing fields in powerpointfiles, are stored as UTF-16. That means two bytes are used to storeevery character. US-ASCII will be stored with the second byte zero,but other characters will need to make some use of the second byte.
If you call getText() on a TextCharsAtom, it'll convert it to a stringfor you. You should really be using that, not getting the bytes directly.
Is there any way to keep the style information and get the text as aTextByteAtom, instead of TextCharsAtom?
Why? PowerPoint decided to make it a TextCharsAtom, rather than aTextByteAtom, since your string contained at least one character thatcouldn't be represented in a TextByteAtom.
HSLF supports upgrading a TextByteAtom to a TextCharsAtom if you tryto set text that can't be held in a TextByteAtom. It doesn't do theother way around.
If you really want just the low order bytes, call getText() on theTextCharsAtom, and mangle the string yourself. Not sure why you'd wantto though....
Nick



Yegor Kozlov wrote:

Hi,

Could you provide a test case?

As I understood you did something like this:

 - take a ppt file with a text.
 - programmatically change the text using HSLF API
 - save file
 - style information is wrong after save.

 Is it correct?
Yegor
TPN> Hi List,
TPN> When PowerPoint stores text in Unicode a unknown char (byte value =TPN> 0) is placed between every "normal" char making the text 2 times longerTPN> than it really is. I can ignore these garbage chars, but I lost the textTPN> style informations, as it's indexes are based in the original unicodeTPN> text with all that unicode trash. :(
TPN> Is there any way to keep the style information and get the text as aTPN> TextByteAtom, instead of TextCharsAtom?
TPN> Thank you very much.
TPN> --
TPN> Tales Paiva


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: PPT Unicode

Reply via email to