Re: [iText-questions] Differences btw text extraction from iText and Acrobat Reader?

wwkloo Wed, 20 Mar 2013 01:38:35 -0700

wwkloo wrote
> 
> mkl wrote
>> wwkloo,
>> wwkloo wrote
>>> I have a PDF with Asian font
>>> iTextExtract_W.pdf
>>> <http://itext-general.2136553.n4.nabble.com/file/n4657836/iTextExtract_W.pdf>
>>>   
>>> 
>>> When I extract the text from it through TextRenderInfo.GetText() inside
>>> RenderText() of an implemented ITextExtractionStrategy by calling
>>> PdfTextExtractor.GetTextFromPage(), it returns an incorrect character
>>> for character 1 and correct for character 2. However, when I extract it
>>> using Acrobat Reader XI by copy and paste, both charcters are extracted
>>> correctly.
>>> 
>>> 1 is U+20547
>>> 2 is U+92DB
>> You seem to have forgotten to register with the mailing list. Thus, your
>> question was only visible to those who follow the mailing list on nabble
>> which is a small minority.
>> 
>> That being said, let's look at your issue.
>> 
>> Considering the method names you use you seem to be working with
>> iTextSharp in .Net, not with iText in Java. I'm on the Java side, though,
>> thus I inspected your file using Java.
> Yes, I am working with iTextSharp in .Net.
> mkl wrote
>> In Java char is a 16bit type; thus, one cannot expect text extraction to
>> return that first character as 0x20547; instead the UTF16 representation
>> might be expected, i.e. 0xD841 0xDD47.
>> 
>> Thus, I applied iText text extraction to your file:
>> 
>>         PdfReader reader = new PdfReader(TEST_FILE.toString());
>>         String text = PdfTextExtractor.getTextFromPage(reader, 1);
>>         for (char c: text.toCharArray())
>>         {
>>             int i = c<0 ? Integer.MAX_VALUE + c : c;
>>             System.out.print("\\u");
>>             System.out.print(Integer.toHexString(i));
>>         }
>> 
>> and retrieved:
>> 
>>         \u31\u20\ud841\udd47\u20\ua\u32\u20\u92db\u20
>> 
>> I.e. "\ud841\udd47" and "\u92db" for your Asian characters.
>> 
>> So everything seems ok in Java. Does the situation differ in .Net?
>> 
>> Regards,   Michael
> Thanks for the code and try.
> I followed to try similar things in .Net. With UTF16, the 1st Asian
> character returned is 0xFFFD. The 2nd is correct.
> 
> === C# ===
> PdfReader rdr = new PdfReader(ofdFile.FileName);
> ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
> String txt = PdfTextExtractor.GetTextFromPage(rdr, 1, strategy);
> byte[] bs16 = Encoding.Unicode.GetBytes(txt);
> foreach (byte b in bs16)
> {
>       Console.Write("{0:X2} ", b);
> }
> Console.Write("\n");
> === C# ===
> 
> === OUTPUT ===
> 31 00 20 00 FD FF 20 00 0A 00 32 00 20 00 DB 92 20 00 
> === OUTPUT ===
> 
> Regards


Additional information:
When create the PDF with another program, the text can be extracted by iText
and Acrobat Reader XI correctly.
- 1: 0xD841 0xDD47
- 2: 0x92DB

However, the character is not displayed correctly. :(

iTextExtract_O.pdf
<http://itext-general.2136553.n4.nabble.com/file/n4657858/iTextExtract_O.pdf>  

Please help!


Regards
wwkloo



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Differences-btw-text-extraction-from-iText-and-Acrobat-Reader-tp4657836p4657858.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Differences btw text extraction from iText and Acrobat Reader?

Reply via email to