Frederic Laruelle,

Frederic Laruelle wrote
> Any idea why doing a text parse of the following doc in Java (Groovy):
> 
> def url =
> "http://www.perspecsys.com/wp-content/uploads/2013/02/Java-Developer.pdf";
> def reader = new PdfReader(new URL(url))
> PdfTextExtractor.getTextFromPage(reader, 1)
> 
> returns text that seems mixed up:
> "...Are you n aoutstanding Java Developer looking for an exciting company
> where you can contribute to today's hottest information technology? We are
> currently lookingfo fourr  (4) developers for projects in distributed
> networking, secure servers, and database management
>  You we ilwl obrking with some of the world’s leading cloud solutions, and
> inventing the next generation of cloud s ecurity 
>  Our elite engineering team has immediate openings for experien cored
> junior software engineers with expertise inen terprise server software and
> web application development along withth e capability to become an elite
> member of the te
>  a mWe are loongki for creative, out-­‐of-­‐the-­‐box developers eager to
> tackle difficult problems..."

This is due to the /ToUnicode mapping of the font in question mapping a
single character (glyph) code to multiple codes (to multiple whitespaces in
the case at hand). This seems to be done to offer multiple possible
interpretations of the code.

I'm not completely sure but I think that this is not intended by the PDF
specification when it talks about mapping a source code to a string of
destination codes. Instead I think the specification intended this mechanism
to indeed map a single glyph to a string.

This at least is how iText interprets this structure and, therefore,
sometimes mixes up the text.

For example:

In your PDF you see

    Are you an outstanding Java Developer

iText's LocationTextExtractionStrategy parses this as

    Are you n aoutstanding Java Developer

The content stream here contains (somewhat beautified):

[first this for "Are you a"]
    q 
    0.24 0 0 0.24 72 575.76 cm 
    BT 
    0.0103 Tc 
    45 0 0 45 0 0 Tm 
    /F1.1 1 Tf 
    [ (:*) 4 (&) 2 (!) 6 (;) 2 (\(7!) 6 (#) ] TJ 
    ET
    Q 
    
[followed by this for "n outstanding"]
    q 
    0.24 0 0 0.24 114.4746 575.76 cm 
    BT 
    0.0101 Tc 
    45 0 0 45 0 0 Tm 
    /F1.1 1 Tf 
    [ (2!) 6 (\(75) 4 (/) 3 (5) 4 (#) 1 (239) 5 (2<) ] TJ 
    ET 
    Q

The questionable mapping in /ToUnicode is:

    1 beginbfchar
    <21>< 0009 000d 0020 00a0 >
    endbfchar

This makes iText map the character code 21 (displayed as '!' in the stream
above) to the sequence of horizontal tab, carriage return, space, and
non-breaking space. During calculation of the width of the strings this
makes iText think the spaces in "Are you a" are wider than they really are.
As the following "n outstanding" is positioned absolutely, iText thinks that
those strings overlap, that the trailing "a" (displayed as '#') of the
former is located after the "n " (displayed as '2!') of the latter.

The PDF specification says on this topic:

    To support mappings from a source code to a string of destination codes,
this extension has been made to the ranges defined after a beginbfchar
operator:

    n beginbfchar
    srcCode dstString
    endbfchar

    where dstString may be a string of up to 512 bytes.

referencing the sample

    1 beginbfchar
    <3A51> <D840DC3E>
    endbfchar

    [...] the character code <3A 51> is mapped to the Unicode value U+2003E,
which is expressed by the byte sequence <D840DC3E> in UTF-16BE encoding.

Thus, I think iText is right to assume that in the situation above '!' is to
be interpreted as a four character string while the document is wrong to
offer alternative interpretations that way.

This being said, though, iText is wrong when it uses the text resulting from
the /ToUnicode mapping for calculating the width of the displayed glyphs:
First mapping glyph codes to Unicode characters using /ToUnicode and then
back to glyphs using the font encoding need not result in the same glyph it
started with; thus, using the width of the resulting glyphs of that double
mapping is wrong. Instead the TextRenderInfo objects should also transport
the original glyph codes and use them for widths calculation (and also for
splitting up using getCharacterRenderInfos which only makes sense when used
glyph-wise).

Regards,   Michael 



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Mixed-up-text-tp4657916p4657987.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to