Frederic Laruelle, Frederic Laruelle wrote > Any idea why doing a text parse of the following doc in Java (Groovy): > > def url = > "http://www.perspecsys.com/wp-content/uploads/2013/02/Java-Developer.pdf" > def reader = new PdfReader(new URL(url)) > PdfTextExtractor.getTextFromPage(reader, 1) > > returns text that seems mixed up: > "...Are you n aoutstanding Java Developer looking for an exciting company > where you can contribute to today's hottest information technology? We are > currently lookingfo fourr (4) developers for projects in distributed > networking, secure servers, and database management > You we ilwl obrking with some of the world’s leading cloud solutions, and > inventing the next generation of cloud s ecurity > Our elite engineering team has immediate openings for experien cored > junior software engineers with expertise inen terprise server software and > web application development along withth e capability to become an elite > member of the te > a mWe are loongki for creative, out-‐of-‐the-‐box developers eager to > tackle difficult problems..."
This is due to the /ToUnicode mapping of the font in question mapping a single character (glyph) code to multiple codes (to multiple whitespaces in the case at hand). This seems to be done to offer multiple possible interpretations of the code. I'm not completely sure but I think that this is not intended by the PDF specification when it talks about mapping a source code to a string of destination codes. Instead I think the specification intended this mechanism to indeed map a single glyph to a string. This at least is how iText interprets this structure and, therefore, sometimes mixes up the text. For example: In your PDF you see Are you an outstanding Java Developer iText's LocationTextExtractionStrategy parses this as Are you n aoutstanding Java Developer The content stream here contains (somewhat beautified): [first this for "Are you a"] q 0.24 0 0 0.24 72 575.76 cm BT 0.0103 Tc 45 0 0 45 0 0 Tm /F1.1 1 Tf [ (:*) 4 (&) 2 (!) 6 (;) 2 (\(7!) 6 (#) ] TJ ET Q [followed by this for "n outstanding"] q 0.24 0 0 0.24 114.4746 575.76 cm BT 0.0101 Tc 45 0 0 45 0 0 Tm /F1.1 1 Tf [ (2!) 6 (\(75) 4 (/) 3 (5) 4 (#) 1 (239) 5 (2<) ] TJ ET Q The questionable mapping in /ToUnicode is: 1 beginbfchar <21>< 0009 000d 0020 00a0 > endbfchar This makes iText map the character code 21 (displayed as '!' in the stream above) to the sequence of horizontal tab, carriage return, space, and non-breaking space. During calculation of the width of the strings this makes iText think the spaces in "Are you a" are wider than they really are. As the following "n outstanding" is positioned absolutely, iText thinks that those strings overlap, that the trailing "a" (displayed as '#') of the former is located after the "n " (displayed as '2!') of the latter. The PDF specification says on this topic: To support mappings from a source code to a string of destination codes, this extension has been made to the ranges defined after a beginbfchar operator: n beginbfchar srcCode dstString endbfchar where dstString may be a string of up to 512 bytes. referencing the sample 1 beginbfchar <3A51> <D840DC3E> endbfchar [...] the character code <3A 51> is mapped to the Unicode value U+2003E, which is expressed by the byte sequence <D840DC3E> in UTF-16BE encoding. Thus, I think iText is right to assume that in the situation above '!' is to be interpreted as a four character string while the document is wrong to offer alternative interpretations that way. This being said, though, iText is wrong when it uses the text resulting from the /ToUnicode mapping for calculating the width of the displayed glyphs: First mapping glyph codes to Unicode characters using /ToUnicode and then back to glyphs using the font encoding need not result in the same glyph it started with; thus, using the width of the resulting glyphs of that double mapping is wrong. Instead the TextRenderInfo objects should also transport the original glyph codes and use them for widths calculation (and also for splitting up using getCharacterRenderInfos which only makes sense when used glyph-wise). Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Mixed-up-text-tp4657916p4657987.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php