Re: [iText-questions] [SPAM] Re: Mixed up text

Frederic Laruelle Tue, 09 Apr 2013 07:40:13 -0700

wow, that's quite an explanation ;-)
Tks Michael,
so, it seems there no way to "fix this" with the current version of iText,
correct?


On Tue, Apr 9, 2013 at 1:30 AM, mkl <[email protected]> wrote:

> Frederic Laruelle,
>
> Frederic Laruelle wrote
> > Any idea why doing a text parse of the following doc in Java (Groovy):
> >
> > def url =
> > "http://www.perspecsys.com/wp-content/uploads/2013/02/Java-Developer.pdf
> "
> > def reader = new PdfReader(new URL(url))
> > PdfTextExtractor.getTextFromPage(reader, 1)
> >
> > returns text that seems mixed up:
> > "...Are you n aoutstanding Java Developer looking for an exciting company
> > where you can contribute to today's hottest information technology? We
> are
> > currently lookingfo fourr  (4) developers for projects in distributed
> > networking, secure servers, and database management
> >  You we ilwl obrking with some of the world’s leading cloud solutions,
> and
> > inventing the next generation of cloud s ecurity
> >  Our elite engineering team has immediate openings for experien cored
> > junior software engineers with expertise inen terprise server software
> and
> > web application development along withth e capability to become an elite
> > member of the te
> >  a mWe are loongki for creative, out-‐of-‐the-‐box developers eager to
> > tackle difficult problems..."
>
> This is due to the /ToUnicode mapping of the font in question mapping a
> single character (glyph) code to multiple codes (to multiple whitespaces in
> the case at hand). This seems to be done to offer multiple possible
> interpretations of the code.
>
> I'm not completely sure but I think that this is not intended by the PDF
> specification when it talks about mapping a source code to a string of
> destination codes. Instead I think the specification intended this
> mechanism
> to indeed map a single glyph to a string.
>
> This at least is how iText interprets this structure and, therefore,
> sometimes mixes up the text.
>
> For example:
>
> In your PDF you see
>
>     Are you an outstanding Java Developer
>
> iText's LocationTextExtractionStrategy parses this as
>
>     Are you n aoutstanding Java Developer
>
> The content stream here contains (somewhat beautified):
>
> [first this for "Are you a"]
>     q
>     0.24 0 0 0.24 72 575.76 cm
>     BT
>     0.0103 Tc
>     45 0 0 45 0 0 Tm
>     /F1.1 1 Tf
>     [ (:*) 4 (&) 2 (!) 6 (;) 2 (\(7!) 6 (#) ] TJ
>     ET
>     Q
>
> [followed by this for "n outstanding"]
>     q
>     0.24 0 0 0.24 114.4746 575.76 cm
>     BT
>     0.0101 Tc
>     45 0 0 45 0 0 Tm
>     /F1.1 1 Tf
>     [ (2!) 6 (\(75) 4 (/) 3 (5) 4 (#) 1 (239) 5 (2<) ] TJ
>     ET
>     Q
>
> The questionable mapping in /ToUnicode is:
>
>     1 beginbfchar
>     <21>< 0009 000d 0020 00a0 >
>     endbfchar
>
> This makes iText map the character code 21 (displayed as '!' in the stream
> above) to the sequence of horizontal tab, carriage return, space, and
> non-breaking space. During calculation of the width of the strings this
> makes iText think the spaces in "Are you a" are wider than they really are.
> As the following "n outstanding" is positioned absolutely, iText thinks
> that
> those strings overlap, that the trailing "a" (displayed as '#') of the
> former is located after the "n " (displayed as '2!') of the latter.
>
> The PDF specification says on this topic:
>
>     To support mappings from a source code to a string of destination
> codes,
> this extension has been made to the ranges defined after a beginbfchar
> operator:
>
>     n beginbfchar
>     srcCode dstString
>     endbfchar
>
>     where dstString may be a string of up to 512 bytes.
>
> referencing the sample
>
>     1 beginbfchar
>     <3A51> <D840DC3E>
>     endbfchar
>
>     [...] the character code <3A 51> is mapped to the Unicode value
> U+2003E,
> which is expressed by the byte sequence <D840DC3E> in UTF-16BE encoding.
>
> Thus, I think iText is right to assume that in the situation above '!' is
> to
> be interpreted as a four character string while the document is wrong to
> offer alternative interpretations that way.
>
> This being said, though, iText is wrong when it uses the text resulting
> from
> the /ToUnicode mapping for calculating the width of the displayed glyphs:
> First mapping glyph codes to Unicode characters using /ToUnicode and then
> back to glyphs using the font encoding need not result in the same glyph it
> started with; thus, using the width of the resulting glyphs of that double
> mapping is wrong. Instead the TextRenderInfo objects should also transport
> the original glyph codes and use them for widths calculation (and also for
> splitting up using getCharacterRenderInfos which only makes sense when used
> glyph-wise).
>
> Regards,   Michael
>
>
>
> --
> View this message in context:
> http://itext-general.2136553.n4.nabble.com/Mixed-up-text-tp4657916p4657987.html
> Sent from the iText - General mailing list archive at Nabble.com.
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> iText-questions mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/itext-questions
>
> iText(R) is a registered trademark of 1T3XT BVBA.
> Many questions posted to this list can (and will) be answered with a
> reference to the iText book: http://www.itextpdf.com/book/
> Please check the keywords list before you ask for examples:
> http://itextpdf.com/themes/keywords.php

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Re: [iText-questions] [SPAM] Re: Mixed up text

Reply via email to