[
https://issues.apache.org/jira/browse/PDFBOX-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
John Hewson updated PDFBOX-816:
-------------------------------
Component/s: (was: Parsing)
Utilities
> 1.2.1 - PDFTextStripper* uses different Y values when cropbox has non-zero Y:
> not so for X coordinates.
> -------------------------------------------------------------------------------------------------------
>
> Key: PDFBOX-816
> URL: https://issues.apache.org/jira/browse/PDFBOX-816
> Project: PDFBox
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 1.2.1
> Environment: Mac OS X 10.6.4 + JDK 1.6.0_20, and Ubuntu 10.4 (kernel
> 2.6.32-24) with Sun Java 1.6.0_20.
> Reporter: Larry West
> Attachments: TaxReturn-1.pdf
>
> Original Estimate: 31h
> Remaining Estimate: 31h
>
> [First off, kudos to the folks who work on PDFBox. It's got some great
> functionality.]
> The issue is that a cropbox with non-zero "lower-left-corner" changes
> positions reported for text by PDFTextStripper. In the Y coordinate only.
> See page 5 of the attached PDF (which is a phony tax return, not real data).
> As an example, near the top is the tax year, "2009". Using a program such
> as Apple's Preview, one would estimate that a snug bounding rectangle for
> that text would be x=300, y=54, w=41, and h=18.
> And on other PDFs, that would be fine with PDFTextStripperByArea. But this
> PDF has a non-zero-origin cropbox set, one with the alleged lower-left-corner
> at [-24.0, -24.0]. So the region coordinates that PDFTextStripperByArea
> wants to see need to be offset by subtracting -24 from x and y, i.e.,
> yielding x=324, y=78.
> Or so you would think. It turns out that the X coordinate stays the same,
> only the Y coordinate gets affected by the cropbox setting.
> Using the sample program PrintTextLocations, which, like
> PDFTextStripperByArea, derives from PDFTextStripper, reports both coordinates
> as being offset by 24 in its processTextPosition():
> ...
> String[92.0,94.0 fs=12.0 xscale=1.0 height=9.0720005 space=3.3360004
> width=186.71997]U.S. Individual Income Tax Retur
> String[278.71997,94.0 fs=12.0 xscale=1.0 height=9.0720005 space=3.3360004
> width=7.3320007]n
> String[301.0,94.0 fs=18.0 xscale=1.0 height=13.122001 space=5.0040007
> width=30.023987]200
> String[331.024,94.0 fs=18.0 xscale=1.0 height=13.122001 space=5.0040007
> width=10.007996]9
> String[368.0,94.0 fs=8.0 xscale=1.0 height=7.5200005 space=2.2240002
> width=11.559998](99
> String[379.56,94.0 fs=8.0 xscale=1.0 height=7.5200005 space=2.2240002
> width=2.6640015])
> String[399.0,94.0 fs=6.0 xscale=1.0 height=4.5360003 space=1.6680002
> width=36.34201]IRS Use Only
> ...
> (Lines 3 and 4 are the only key ones, the others for comparison).
> To make sense of this: 301.0 is close enough to 300 for the X coordinate: if
> I go to 324, I don't get the "200", just the "9".
> Also, the y coordinate of 94 is the bottom of the text, with a height of 18
> that roughly extends up to 76, but 78 works as far as extractRegions() is
> concerned (I think it only cares about the lower-left corner of each
> character).
> So the bounding rectangle reported above for "2009" is lower-left corner
> (301.0, 94.0) to upper-right corner approx (341.03, 76).
> In those coordinates, the region that works with extractRegions() is LL (300,
> 96) to UR (341, 78).
> (Or, the exact Rectangle2D I pass to extractRegions: x=300, y=78, w=41, h=18).
> This applies to any field you choose on this page.
> So:
> (a) it doesn't seem to me that a cropbox has any business changing the
> coordinates. But I could be wrong.
> (b) if it does make sense for a cropbox to affect the coordinates, it should
> do so in both X and Y dimensions, shouldn't it?
> (c) I suppose it would be too much to ask for notes explaining the
> coordinates used for each method, but it's a nice thought.
> I tried looking through PDFTextStripper* but I'm not sufficiently familiar
> with the code to determine where the coordinate perturbation occurs. It
> might be in how PDFStreamEngine.processEncodedText() is using the
> graphicsState (initialized with the cropbox) to transform textMatrixStDisp,
> but that seems to be initialized with a Dimension, so I don't see how an
> offset would affect it.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)