wwkloo,

wwkloo wrote
> I have a PDF with Asian font
> iTextExtract_W.pdf
> <http://itext-general.2136553.n4.nabble.com/file/n4657836/iTextExtract_W.pdf> 
>  
> 
> When I extract the text from it through TextRenderInfo.GetText() inside
> RenderText() of an implemented ITextExtractionStrategy by calling
> PdfTextExtractor.GetTextFromPage(), it returns an incorrect character for
> character 1 and correct for character 2. However, when I extract it using
> Acrobat Reader XI by copy and paste, both charcters are extracted
> correctly.
> 
> 1 is U+20547
> 2 is U+92DB

You seem to have forgotten to register with the mailing list. Thus, your
question was only visible to those who follow the mailing list on nabble
which is a small minority.

That being said, let's look at your issue.

Considering the method names you use you seem to be working with iTextSharp
in .Net, not with iText in Java. I'm on the Java side, though, thus I
inspected your file using Java.

In Java char is a 16bit type; thus, one cannot expect text extraction to
return that first character as 0x20547; instead the UTF16 representation
might be expected, i.e. 0xD841 0xDD47.

Thus, I applied iText text extraction to your file:

        PdfReader reader = new PdfReader(TEST_FILE.toString());
        String text = PdfTextExtractor.getTextFromPage(reader, 1);
        for (char c: text.toCharArray())
        {
            int i = c<0 ? Integer.MAX_VALUE + c : c;
            System.out.print("\\u");
            System.out.print(Integer.toHexString(i));
        }

and retrieved:

        \u31\u20\ud841\udd47\u20\ua\u32\u20\u92db\u20

I.e. "\ud841\udd47" and "\u92db" for your Asian characters.

So everything seems ok in Java. Does the situation differ in .Net?

Regards,   Michael



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Differences-btw-text-extraction-from-iText-and-Acrobat-Reader-tp4657836p4657844.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to