mkl wrote
> wwkloo,
> wwkloo wrote
>> I have a PDF with Asian font
>> iTextExtract_W.pdf
>> <http://itext-general.2136553.n4.nabble.com/file/n4657836/iTextExtract_W.pdf>
>>
>>
>> When I extract the text from it through TextRenderInfo.GetText() inside
>> RenderText() of an implemented ITextExtractionStrategy by calling
>> PdfTextExtractor.GetTextFromPage(), it returns an incorrect character for
>> character 1 and correct for character 2. However, when I extract it using
>> Acrobat Reader XI by copy and paste, both charcters are extracted
>> correctly.
>>
>> 1 is U+20547
>> 2 is U+92DB
> You seem to have forgotten to register with the mailing list. Thus, your
> question was only visible to those who follow the mailing list on nabble
> which is a small minority.
>
> That being said, let's look at your issue.
>
> Considering the method names you use you seem to be working with
> iTextSharp in .Net, not with iText in Java. I'm on the Java side, though,
> thus I inspected your file using Java.
Yes, I am working with iTextSharp in .Net.
mkl wrote
> In Java char is a 16bit type; thus, one cannot expect text extraction to
> return that first character as 0x20547; instead the UTF16 representation
> might be expected, i.e. 0xD841 0xDD47.
>
> Thus, I applied iText text extraction to your file:
>
> PdfReader reader = new PdfReader(TEST_FILE.toString());
> String text = PdfTextExtractor.getTextFromPage(reader, 1);
> for (char c: text.toCharArray())
> {
> int i = c<0 ? Integer.MAX_VALUE + c : c;
> System.out.print("\\u");
> System.out.print(Integer.toHexString(i));
> }
>
> and retrieved:
>
> \u31\u20\ud841\udd47\u20\ua\u32\u20\u92db\u20
>
> I.e. "\ud841\udd47" and "\u92db" for your Asian characters.
>
> So everything seems ok in Java. Does the situation differ in .Net?
>
> Regards, Michael
Thanks for the code and try.
I followed to try similar things in .Net. With UTF16, the 1st Asian
character returned is 0xFFFD. The 2nd is correct.
=== C# ===
PdfReader rdr = new PdfReader(ofdFile.FileName);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
String txt = PdfTextExtractor.GetTextFromPage(rdr, 1, strategy);
byte[] bs16 = Encoding.Unicode.GetBytes(txt);
foreach (byte b in bs16)
{
Console.Write("{0:X2} ", b);
}
Console.Write("\n");
=== C# ===
=== OUTPUT ===
31 00 20 00 FD FF 20 00 0A 00 32 00 20 00 DB 92 20 00
=== OUTPUT ===
Regards
--
View this message in context:
http://itext-general.2136553.n4.nabble.com/Differences-btw-text-extraction-from-iText-and-Acrobat-Reader-tp4657836p4657853.html
Sent from the iText - General mailing list archive at Nabble.com.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples:
http://itextpdf.com/themes/keywords.php