Hello Dom,
That's indeed what I'm doing. Based on the operator I determine how to extract
the text. In one of the documents I noticed that the array associated with the
show-text operator (TJ), contained string typed elements (
textArray[i].IsString() returns true ) and some elements contain hex-string
typed elements ( textArray[i].IsHexString() returns true ).
The string typed elements are not problem. These are simple utf8 texts. For the
hex strings I get the weird string I mentioned below. So, I get something like
(printed in hex):
00 4c 00 51 00 57 00 55 00 52 00 47 00 58 00 46 00 46 00 4c
If I assume that it is a two-byte encoding, I still don't get what I want. But,
if I add 0x1d to every other byte and I consider it a 16-bit encoding, then it
is (printed in hex):
00 69 00 6e 00 74 00 72 00 6f 00 64 00 75 00 63 00 63 00 69
i n t r o d u c c i
And this is exactly the word I observe at that location when I open the PDF
document in Acrobat reader.
So, I was wondering whether someone already encountered something like that and
how it could be solved.
I investigated podofotxtextract and used it as a starting point for this
application. But, I didn't find anything like that in it.
Best regards,
Jo
-----Original Message-----
From: Dominik Seichter [mailto:[email protected]]
Sent: woensdag 4 augustus 2010 16:03
To: [email protected]; Jo Van der Snickt
Subject: Re: [Podofo-users] Weird buffer returned by PdfString::GetString()
Hello Jo,
So you are extracting strings from the contents stream?
These are dependent on the actual encoding of the font being used. So PdfString
does not know how to convert them into Unicode. You might want to look at the
podofotxtextract example on how to do that, but please note:
PoDoFo does not support all possible encodings yet. So you will need to add the
missing encodings your self.
Best regards,
Dom
Am Montag 02 August 2010 schrieb Jo Van der Snickt:
> Hello,
>
> I'm trying to parse a PDF document to extract all the text. For the
> array type I check each element for its type and I only consider the
> array elements that contain either a string or a hexstring.
>
> For the strings I retrieve the value with
> textArray.GetString().GetStringUtf8() which works just fine. But, for
> the hexstring I get weird results in the buffer. To investigate the
> content of the buffer I used the following piece of code:
>
> else if ( textArray[i].IsHexString() )
> {
> char * ptrHexString = static_cast<char *>( malloc( sizeof(char) *
> (
> textArray[i].GetString().GetLength() + 2 ) ) ); memcpy( ptrHexString,
> textArray[i].GetString().GetString(),
> textArray[i].GetString().GetLength()
> );
>
> for ( int strIndex = 0; strIndex <
> static_cast<int>(textArray[i].GetString().GetLength()); strIndex++ ) {
> cout << setw(2) << setfill('0') << dec << strIndex << ": " <<
> hex <<
> setw(2) << setfill('0') << static_cast<int>(ptrHexString[strIndex]) << " "
> << static_cast<char>(ptrHexString[strIndex] + 0x1d) << endl; }
> free( ptrHexString );
> }
>
> This displays something like:
>
> 00: 00
> 01: 4c i
> 02: 00
> 03: 51 n
> 04: 00
> 05: 57 t
> 06: 00
> 07: 55 r
> 08: 00
> 09: 52 o
> 10: 00
> 11: 47 d
> 12: 00
> 13: 58 u
> 14: 00
> 15: 46 c
> 16: 00
> 17: 46 c
> 18: 00
> 19: 4c i
>
> I looks like a two byte encoding (first byte 0x00), but note that I
> had to add 0x1d to the actual byte to get the character I'm expecting
> (here the text "introduci").
>
> Any idea what I could have done wrong?
> The document that I use to test displays correctly in Acrobat Reader.
>
> - Jo
>
>
>
> This e-mail and any attachments thereto may contain information which
> is confidential and/or protected by intellectual property rights and
> are intended for the sole use of the recipient(s) named above. Any
> use of the information contained herein (including, but not limited
> to, total or partial reproduction, communication or distribution in
> any form) by persons other then the designated recipient(s) is
> prohibited. If you have received this e-mail in error, please notify
> the sender either by telephone or by e-mail and delete the material
> from any computer. Thank you for your cooperation.
>
> Dilys bvba
> Nieuwe Stationsstraat 23
> 9160 Lokeren
>
> tel +32 9 356 97 13
> fax +32 9 353 90 11
>
> mailto:[email protected]
> http://www.dilys.be
>
>
>
> ----------------------------------------------------------------------
> -----
> --- The Palm PDK Hot Apps Program offers developers who use the
> Plug-In Development Kit to bring their C/C++ apps to Palm for a share
> of $1 Million in cash or HP Products. Visit us here for more details:
> http://p.sf.net/sfu/dev2dev-palm
> _______________________________________________
> Podofo-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/podofo-users
>
--
**********************************************************************
Dominik Seichter - [email protected]
KRename - http://www.krename.net - Powerful batch renamer for KDE KBarcode -
http://www.kbarcode.net - Barcode and label printing PoDoFo -
http://podofo.sf.net - PDF generation and parsing library SchafKopf -
http://schafkopf.berlios.de - Schafkopf, a card game, for KDE Alan -
http://alan.sf.net - A Turing Machine in Java
**********************************************************************
This e-mail and any attachments thereto may contain information which is
confidential
and/or protected by intellectual property rights and are intended for the sole
use of the
recipient(s) named above. Any use of the information contained herein
(including, but
not limited to, total or partial reproduction, communication or distribution in
any form)
by persons other then the designated recipient(s) is prohibited. If you have
received this
e-mail in error, please notify the sender either by telephone or by e-mail and
delete the
material from any computer. Thank you for your cooperation.
Dilys bvba
Nieuwe Stationsstraat 23
9160 Lokeren
tel +32 9 356 97 13
fax +32 9 353 90 11
mailto:[email protected]
http://www.dilys.be
------------------------------------------------------------------------------
The Palm PDK Hot Apps Program offers developers who use the
Plug-In Development Kit to bring their C/C++ apps to Palm for a share
of $1 Million in cash or HP Products. Visit us here for more details:
http://p.sf.net/sfu/dev2dev-palm
_______________________________________________
Podofo-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/podofo-users