Re: [iText-questions] Differences btw text extraction from iText and Acrobat Reader?

wwkloo Tue, 19 Mar 2013 19:45:58 -0700

mkl wrote
> wwkloo,
> wwkloo wrote
>> I have a PDF with Asian font
>> iTextExtract_W.pdf
>> <http://itext-general.2136553.n4.nabble.com/file/n4657836/iTextExtract_W.pdf>
>>   
>> 
>> When I extract the text from it through TextRenderInfo.GetText() inside
>> RenderText() of an implemented ITextExtractionStrategy by calling
>> PdfTextExtractor.GetTextFromPage(), it returns an incorrect character for
>> character 1 and correct for character 2. However, when I extract it using
>> Acrobat Reader XI by copy and paste, both charcters are extracted
>> correctly.
>> 
>> 1 is U+20547
>> 2 is U+92DB
> You seem to have forgotten to register with the mailing list. Thus, your
> question was only visible to those who follow the mailing list on nabble
> which is a small minority.
> 
> That being said, let's look at your issue.
> 
> Considering the method names you use you seem to be working with
> iTextSharp in .Net, not with iText in Java. I'm on the Java side, though,
> thus I inspected your file using Java.


Yes, I am working with iTextSharp in .Net.


mkl wrote
> In Java char is a 16bit type; thus, one cannot expect text extraction to
> return that first character as 0x20547; instead the UTF16 representation
> might be expected, i.e. 0xD841 0xDD47.
> 
> Thus, I applied iText text extraction to your file:
> 
>         PdfReader reader = new PdfReader(TEST_FILE.toString());
>         String text = PdfTextExtractor.getTextFromPage(reader, 1);
>         for (char c: text.toCharArray())
>         {
>             int i = c<0 ? Integer.MAX_VALUE + c : c;
>             System.out.print("\\u");
>             System.out.print(Integer.toHexString(i));
>         }
> 
> and retrieved:
> 
>         \u31\u20\ud841\udd47\u20\ua\u32\u20\u92db\u20
> 
> I.e. "\ud841\udd47" and "\u92db" for your Asian characters.
> 
> So everything seems ok in Java. Does the situation differ in .Net?
> 
> Regards,   Michael

Thanks for the code and try.
I followed to try similar things in .Net. With UTF16, the 1st Asian
character returned is 0xFFFD. The 2nd is correct.

=== C# ===
PdfReader rdr = new PdfReader(ofdFile.FileName);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String txt = PdfTextExtractor.GetTextFromPage(rdr, 1, strategy);
byte[] bs16 = Encoding.Unicode.GetBytes(txt);
foreach (byte b in bs16)
{
        Console.Write("{0:X2} ", b);
}
Console.Write("\n");
=== C# ===

=== OUTPUT ===
31 00 20 00 FD FF 20 00 0A 00 32 00 20 00 DB 92 20 00 
=== OUTPUT ===

Regards



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Differences-btw-text-extraction-from-iText-and-Acrobat-Reader-tp4657836p4657853.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] Differences btw text extraction from iText and Acrobat Reader?

Reply via email to