Hello Leonard, No I don't. And it looks like you're right. Each time this hex string type of stuff occurs, it is always preceded by a select-font (Tf) operator.
Thanks for the tip. Now I just have to figure out how to read the encoding information ;-) Jo -----Original Message----- From: Leonard Rosenthol [mailto:[email protected]] Sent: woensdag 4 augustus 2010 20:25 To: Jo Van der Snickt; 'Dominik Seichter'; [email protected] Subject: RE: [Podofo-users] Weird buffer returned by PdfString::GetString() Are you reading the Encoding information from the Font that is being used to show the text? You MUST do so - you can't treat the text in the content stream w/o it... Leonard -----Original Message----- From: Jo Van der Snickt [mailto:[email protected]] Sent: Wednesday, August 04, 2010 10:54 AM To: 'Dominik Seichter'; [email protected] Subject: Re: [Podofo-users] Weird buffer returned by PdfString::GetString() Hello Dom, That's indeed what I'm doing. Based on the operator I determine how to extract the text. In one of the documents I noticed that the array associated with the show-text operator (TJ), contained string typed elements ( textArray[i].IsString() returns true ) and some elements contain hex-string typed elements ( textArray[i].IsHexString() returns true ). The string typed elements are not problem. These are simple utf8 texts. For the hex strings I get the weird string I mentioned below. So, I get something like (printed in hex): 00 4c 00 51 00 57 00 55 00 52 00 47 00 58 00 46 00 46 00 4c If I assume that it is a two-byte encoding, I still don't get what I want. But, if I add 0x1d to every other byte and I consider it a 16-bit encoding, then it is (printed in hex): 00 69 00 6e 00 74 00 72 00 6f 00 64 00 75 00 63 00 63 00 69 i n t r o d u c c i And this is exactly the word I observe at that location when I open the PDF document in Acrobat reader. So, I was wondering whether someone already encountered something like that and how it could be solved. I investigated podofotxtextract and used it as a starting point for this application. But, I didn't find anything like that in it. Best regards, Jo -----Original Message----- From: Dominik Seichter [mailto:[email protected]] Sent: woensdag 4 augustus 2010 16:03 To: [email protected]; Jo Van der Snickt Subject: Re: [Podofo-users] Weird buffer returned by PdfString::GetString() Hello Jo, So you are extracting strings from the contents stream? These are dependent on the actual encoding of the font being used. So PdfString does not know how to convert them into Unicode. You might want to look at the podofotxtextract example on how to do that, but please note: PoDoFo does not support all possible encodings yet. So you will need to add the missing encodings your self. Best regards, Dom Am Montag 02 August 2010 schrieb Jo Van der Snickt: > Hello, > > I'm trying to parse a PDF document to extract all the text. For the > array type I check each element for its type and I only consider the > array elements that contain either a string or a hexstring. > > For the strings I retrieve the value with > textArray.GetString().GetStringUtf8() which works just fine. But, for > the hexstring I get weird results in the buffer. To investigate the > content of the buffer I used the following piece of code: > > else if ( textArray[i].IsHexString() ) > { > char * ptrHexString = static_cast<char *>( malloc( sizeof(char) * > ( > textArray[i].GetString().GetLength() + 2 ) ) ); memcpy( ptrHexString, > textArray[i].GetString().GetString(), > textArray[i].GetString().GetLength() > ); > > for ( int strIndex = 0; strIndex < > static_cast<int>(textArray[i].GetString().GetLength()); strIndex++ ) { > cout << setw(2) << setfill('0') << dec << strIndex << ": " << > hex << > setw(2) << setfill('0') << static_cast<int>(ptrHexString[strIndex]) << " " > << static_cast<char>(ptrHexString[strIndex] + 0x1d) << endl; } > free( ptrHexString ); > } > > This displays something like: > > 00: 00 > 01: 4c i > 02: 00 > 03: 51 n > 04: 00 > 05: 57 t > 06: 00 > 07: 55 r > 08: 00 > 09: 52 o > 10: 00 > 11: 47 d > 12: 00 > 13: 58 u > 14: 00 > 15: 46 c > 16: 00 > 17: 46 c > 18: 00 > 19: 4c i > > I looks like a two byte encoding (first byte 0x00), but note that I > had to add 0x1d to the actual byte to get the character I'm expecting > (here the text "introduci"). > > Any idea what I could have done wrong? > The document that I use to test displays correctly in Acrobat Reader. > > - Jo > > > > This e-mail and any attachments thereto may contain information which > is confidential and/or protected by intellectual property rights and > are intended for the sole use of the recipient(s) named above. Any > use of the information contained herein (including, but not limited > to, total or partial reproduction, communication or distribution in > any form) by persons other then the designated recipient(s) is > prohibited. If you have received this e-mail in error, please notify > the sender either by telephone or by e-mail and delete the material > from any computer. Thank you for your cooperation. > > Dilys bvba > Nieuwe Stationsstraat 23 > 9160 Lokeren > > tel +32 9 356 97 13 > fax +32 9 353 90 11 > > mailto:[email protected] > http://www.dilys.be > > > > ---------------------------------------------------------------------- > ----- > --- The Palm PDK Hot Apps Program offers developers who use the > Plug-In Development Kit to bring their C/C++ apps to Palm for a share > of $1 Million in cash or HP Products. Visit us here for more details: > http://p.sf.net/sfu/dev2dev-palm > _______________________________________________ > Podofo-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/podofo-users > -- ********************************************************************** Dominik Seichter - [email protected] KRename - http://www.krename.net - Powerful batch renamer for KDE KBarcode - http://www.kbarcode.net - Barcode and label printing PoDoFo - http://podofo.sf.net - PDF generation and parsing library SchafKopf - http://schafkopf.berlios.de - Schafkopf, a card game, for KDE Alan - http://alan.sf.net - A Turing Machine in Java ********************************************************************** This e-mail and any attachments thereto may contain information which is confidential and/or protected by intellectual property rights and are intended for the sole use of the recipient(s) named above. Any use of the information contained herein (including, but not limited to, total or partial reproduction, communication or distribution in any form) by persons other then the designated recipient(s) is prohibited. If you have received this e-mail in error, please notify the sender either by telephone or by e-mail and delete the material from any computer. Thank you for your cooperation. Dilys bvba Nieuwe Stationsstraat 23 9160 Lokeren tel +32 9 356 97 13 fax +32 9 353 90 11 mailto:[email protected] http://www.dilys.be ------------------------------------------------------------------------------ The Palm PDK Hot Apps Program offers developers who use the Plug-In Development Kit to bring their C/C++ apps to Palm for a share of $1 Million in cash or HP Products. Visit us here for more details: http://p.sf.net/sfu/dev2dev-palm _______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users This e-mail and any attachments thereto may contain information which is confidential and/or protected by intellectual property rights and are intended for the sole use of the recipient(s) named above. Any use of the information contained herein (including, but not limited to, total or partial reproduction, communication or distribution in any form) by persons other then the designated recipient(s) is prohibited. If you have received this e-mail in error, please notify the sender either by telephone or by e-mail and delete the material from any computer. Thank you for your cooperation. Dilys bvba Nieuwe Stationsstraat 23 9160 Lokeren tel +32 9 356 97 13 fax +32 9 353 90 11 mailto:[email protected] http://www.dilys.be ------------------------------------------------------------------------------ The Palm PDK Hot Apps Program offers developers who use the Plug-In Development Kit to bring their C/C++ apps to Palm for a share of $1 Million in cash or HP Products. Visit us here for more details: http://p.sf.net/sfu/dev2dev-palm _______________________________________________ Podofo-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/podofo-users
