Are you reading the Encoding information from the Font that is being used to 
show the text?

You MUST do so - you can't treat the text in the content stream w/o it...

Leonard

-----Original Message-----
From: Jo Van der Snickt [mailto:[email protected]] 
Sent: Wednesday, August 04, 2010 10:54 AM
To: 'Dominik Seichter'; [email protected]
Subject: Re: [Podofo-users] Weird buffer returned by PdfString::GetString()

Hello Dom,

That's indeed what I'm doing. Based on the operator I determine how to extract 
the text. In one of the documents I noticed that the array associated with the 
show-text operator (TJ), contained string typed elements ( 
textArray[i].IsString() returns true ) and some elements contain hex-string 
typed elements ( textArray[i].IsHexString() returns true ).

The string typed elements are not problem. These are simple utf8 texts. For the 
hex strings I get the weird string I mentioned below. So, I get something like 
(printed in hex): 
00 4c 00 51 00 57 00 55 00 52 00 47 00 58 00 46 00 46 00 4c

If I assume that it is a two-byte encoding, I still don't get what I want. But, 
if I add 0x1d to every other byte and I consider it a 16-bit encoding, then it 
is (printed in hex):

00 69 00 6e 00 74 00 72 00 6f 00 64 00 75 00 63 00 63 00 69
    i     n     t     r     o     d     u     c     c     i

And this is exactly the word I observe at that location when I open the PDF 
document in Acrobat reader.

So, I was wondering whether someone already encountered something like that and 
how it could be solved. 

I investigated podofotxtextract and used it as a starting point for this 
application. But, I didn't find anything like that in it.

Best regards,
Jo


-----Original Message-----
From: Dominik Seichter [mailto:[email protected]] 
Sent: woensdag 4 augustus 2010 16:03
To: [email protected]; Jo Van der Snickt
Subject: Re: [Podofo-users] Weird buffer returned by PdfString::GetString()

Hello Jo,

So you are extracting strings from the contents stream? 
These are dependent on the actual encoding of the font being used. So PdfString 
does not know how to convert them into Unicode. You might want to look at the 
podofotxtextract example on how to do that, but please note: 
PoDoFo does not support all possible encodings yet. So you will need to add the 
missing encodings your self.

Best regards,
        Dom

Am Montag 02 August 2010 schrieb Jo Van der Snickt:
> Hello,
> 
> I'm trying to parse a PDF document to extract all the text. For the 
> array  type I check each element for its type and I only consider the 
> array  elements that contain either a string or a hexstring.
> 
> For the strings I retrieve the value with
>  textArray.GetString().GetStringUtf8() which works just fine. But, for 
> the  hexstring I get weird results in the buffer. To investigate the 
> content of  the buffer I used the following piece of code:
> 
>   else if ( textArray[i].IsHexString() )
>   {
>     char * ptrHexString = static_cast<char *>( malloc( sizeof(char) * 
> (
>  textArray[i].GetString().GetLength() + 2 ) ) ); memcpy( ptrHexString,  
> textArray[i].GetString().GetString(), 
> textArray[i].GetString().GetLength()
>  );
> 
>     for ( int strIndex = 0; strIndex <  
> static_cast<int>(textArray[i].GetString().GetLength()); strIndex++ ) {
>        cout << setw(2) << setfill('0') << dec << strIndex << ": " << 
> hex <<
>  setw(2) << setfill('0') << static_cast<int>(ptrHexString[strIndex]) << " "
>  << static_cast<char>(ptrHexString[strIndex] + 0x1d) << endl; }
>     free( ptrHexString );
>   }
> 
> This displays something like:
> 
> 00: 00
> 01: 4c i
> 02: 00
> 03: 51 n
> 04: 00
> 05: 57 t
> 06: 00
> 07: 55 r
> 08: 00
> 09: 52 o
> 10: 00
> 11: 47 d
> 12: 00
> 13: 58 u
> 14: 00
> 15: 46 c
> 16: 00
> 17: 46 c
> 18: 00
> 19: 4c i
> 
> I looks like a two byte encoding (first byte 0x00), but note that I 
> had to  add 0x1d to the actual byte to get the character I'm expecting 
> (here the  text "introduci").
> 
> Any idea what I could have done wrong?
> The document that I use to test displays correctly in Acrobat Reader.
> 
> - Jo
> 
> 
> 
> This e-mail and any attachments thereto may contain information which 
> is  confidential and/or protected by intellectual property rights and 
> are  intended for the sole use of the recipient(s) named above. Any 
> use of the  information contained herein (including, but not limited 
> to, total or  partial reproduction, communication or distribution in 
> any form) by  persons other then the designated recipient(s) is 
> prohibited. If you have  received this e-mail in error, please notify 
> the sender either by  telephone or by e-mail and delete the material 
> from any computer. Thank  you for your cooperation.
> 
> Dilys bvba
> Nieuwe Stationsstraat 23
> 9160 Lokeren
> 
> tel +32 9 356 97 13
> fax +32 9 353 90 11
> 
> mailto:[email protected]
> http://www.dilys.be
> 
> 
> 
> ----------------------------------------------------------------------
> -----
> --- The Palm PDK Hot Apps Program offers developers who use the 
> Plug-In Development Kit to bring their C/C++ apps to Palm for a share 
> of $1 Million in cash or HP Products. Visit us here for more details:
> http://p.sf.net/sfu/dev2dev-palm
> _______________________________________________
> Podofo-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/podofo-users
> 


--
**********************************************************************
Dominik Seichter - [email protected]
KRename  - http://www.krename.net  - Powerful batch renamer for KDE KBarcode - 
http://www.kbarcode.net - Barcode and label printing PoDoFo - 
http://podofo.sf.net - PDF generation and parsing library SchafKopf - 
http://schafkopf.berlios.de - Schafkopf, a card game,  for KDE Alan - 
http://alan.sf.net - A Turing Machine in Java
**********************************************************************


This e-mail and any attachments thereto may contain information which is 
confidential 
and/or protected by intellectual property rights and are intended for the sole 
use of the 
recipient(s) named above. Any use of the information contained herein 
(including, but 
not limited to, total or partial reproduction, communication or distribution in 
any form) 
by persons other then the designated recipient(s) is prohibited. If you have 
received this 
e-mail in error, please notify the sender either by telephone or by e-mail and 
delete the 
material from any computer. Thank you for your cooperation.

Dilys bvba
Nieuwe Stationsstraat 23
9160 Lokeren

tel +32 9 356 97 13
fax +32 9 353 90 11

mailto:[email protected]
http://www.dilys.be



------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to