[
https://issues.apache.org/jira/browse/PDFBOX-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994085#comment-13994085
]
Joel Hirsh commented on PDFBOX-2069:
------------------------------------
One more comment:
Using setSpacingTolerance and setAverageCharTolerance does improve the results
with the original PDFBox code, but even using ridiculously large values such as
1.5f and 0.9f respectively does not fix the problem for all the text, and
creates other problems with spaces that are visible in the Acrobat display of
this file being discarded.
> PDF's with Tc before Tm are getting incorrect spacing in PDFTextArea
> --------------------------------------------------------------------
>
> Key: PDFBOX-2069
> URL: https://issues.apache.org/jira/browse/PDFBOX-2069
> Project: PDFBox
> Issue Type: Bug
> Components: Utilities
> Affects Versions: 1.8.5
> Environment: Windows
> Reporter: Joel Hirsh
> Labels: pdfbox
> Fix For: 2.0.0
>
> Attachments: PDFBOX-2609.pdf, PDFBox-2609-patch.zip
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Attached PDF is getting incorrect spacing using example program
> ExtractTrextByArea.java as follows:
> Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
> Transaction Activity
> Date D e s c r i p t i o n Deposits W i t h d r a w a l s
> 0 4 / 0 8 B E G I N N I N G BALANCE
> 04 / 0 8 W I THDRAWAL - ATM 3 1 1 7 3 0 0 . 0 0 -
> 62 M I L L H I L L ROAD WOODSTOCK N Y
> 04 / 1 0 W I THDRAWAL - ACH 2 0 0 . 0 0 -
> HUMAN RIGHTS WAT-B I L L PAYMT
> 04 / 12 C K # 1 2 7 3 11 0 . 0 0 -
> 0 4 / 1 5 W I THDRAWAL - ACH 2 0 2 . 5 7 -
> NEW SOUTH INSURA -B I LL PAYMT
> 04 / 1 5 W I THDRAWAL - ACH 3 6 . 2 6 -
> WASTE CONNECTION-BILL PAYMT
> 04 / 1 7 W I THDRAWAL - ACH 71 2 . 0 0 -
> N PYMT T
> 04 / 1 8 W I THDRAWAL - ACH 2958 9 . 0 0 3
> N PYMT T
> 04 / 1 9 W I THDRAWAL - ACH 76 8 . 1 2 -
> I believe this because PDF streams with Tc before Tm are having the matrix
> applied to the Tc, which is contrary to my experience with graphic pipelines.
> Most PDF streams seem to to have Tc after Tm, and thus do not hit this
> situation.
> I have attached a patch to two files that corrects the problem for this file,
> and also works correctly on my test suite of about 40 files from other
> sources.
> The result for the attached file now becomes:
> Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
> Transaction Activity
> Date Description Deposits Withdrawals
> 04/08 BEGINNING BALANCE
> 04/08 WITHDRAWAL-ATM 3 117 300.00-
> 62 MILL HILL ROAD WOODSTOCK NY
> 04/10 WITHDRAWAL-ACH 200.00-
> HUMAN RIGHTS WAT-BILL PAYMT
> 04/12 CK# 1273 110.00-
> 04/15 WITHDRAWAL-ACH 202.57-
> NEW SOUTH INSURA-BILL PAYMT
> 04/15 WITHDRAWAL-ACH 36.26-
> WASTE CONNECTION-BILL PAYMT
> 04/17 WITHDRAWAL-ACH 712.00-
> N PYMT T
> 04/18 WITHDRAWAL-ACH 29589.00 3
> N PYMT T
> 04/19 WITHDRAWAL-ACH 768.12-
--
This message was sent by Atlassian JIRA
(v6.2#6252)