[ 
https://issues.apache.org/jira/browse/PDFBOX-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2069:
------------------------------------
    Description: 
Attached PDF is getting incorrect spacing using example program 
ExtractTextByArea.java as follows:

Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
Transaction Activity
Date D e s c r i p t i o n Deposits W i t h d r a w a l s
0 4 / 0 8  B E G I N N I N G  BALANCE
04 / 0 8  W I THDRAWAL - ATM  3 1 1 7 3 0 0 . 0 0 -
62 M I L L  H I L L  ROAD WOODSTOCK N Y
04 / 1 0  W I THDRAWAL - ACH 2 0 0 . 0 0 -
HUMAN RIGHTS WAT-B I L L  PAYMT
04 / 12  C K #  1 2 7 3 11 0 . 0 0 -
0 4 / 1 5  W I THDRAWAL - ACH 2 0 2 . 5 7 -
NEW SOUTH INSURA -B I LL PAYMT
04 / 1 5  W I THDRAWAL - ACH 3 6 . 2 6 -
WASTE CONNECTION-BILL PAYMT
04 / 1 7  W I THDRAWAL - ACH 71 2 . 0 0 -
N  PYMT T
04 / 1 8  W I THDRAWAL - ACH 2958 9 . 0 0 3
N  PYMT T
04 / 1 9  W I THDRAWAL - ACH 76 8 . 1 2 -

I believe this because PDF streams with Tc before Tm are having the matrix 
applied to the Tc, which is contrary to my experience with graphic pipelines.  
Most PDF streams seem to to have Tc after Tm, and thus do not hit this 
situation.

I have attached a patch to two files that corrects the problem for this file, 
and also works correctly on my test suite of about 40 files from other sources. 
 

The result for the attached file now becomes:
Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
Transaction  Activity
Date  Description Deposits  Withdrawals
04/08  BEGINNING  BALANCE
04/08  WITHDRAWAL-ATM  3 117 300.00-
62 MILL  HILL  ROAD  WOODSTOCK  NY
04/10  WITHDRAWAL-ACH 200.00-
HUMAN RIGHTS  WAT-BILL  PAYMT
04/12  CK#  1273 110.00-
04/15  WITHDRAWAL-ACH 202.57-
NEW SOUTH  INSURA-BILL  PAYMT
04/15  WITHDRAWAL-ACH 36.26-
WASTE CONNECTION-BILL  PAYMT
04/17  WITHDRAWAL-ACH 712.00-
N  PYMT T
04/18  WITHDRAWAL-ACH 29589.00 3
N  PYMT T
04/19  WITHDRAWAL-ACH 768.12-

  was:
Attached PDF is getting incorrect spacing using example program 
ExtractTrextByArea.java as follows:

Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
Transaction Activity
Date D e s c r i p t i o n Deposits W i t h d r a w a l s
0 4 / 0 8  B E G I N N I N G  BALANCE
04 / 0 8  W I THDRAWAL - ATM  3 1 1 7 3 0 0 . 0 0 -
62 M I L L  H I L L  ROAD WOODSTOCK N Y
04 / 1 0  W I THDRAWAL - ACH 2 0 0 . 0 0 -
HUMAN RIGHTS WAT-B I L L  PAYMT
04 / 12  C K #  1 2 7 3 11 0 . 0 0 -
0 4 / 1 5  W I THDRAWAL - ACH 2 0 2 . 5 7 -
NEW SOUTH INSURA -B I LL PAYMT
04 / 1 5  W I THDRAWAL - ACH 3 6 . 2 6 -
WASTE CONNECTION-BILL PAYMT
04 / 1 7  W I THDRAWAL - ACH 71 2 . 0 0 -
N  PYMT T
04 / 1 8  W I THDRAWAL - ACH 2958 9 . 0 0 3
N  PYMT T
04 / 1 9  W I THDRAWAL - ACH 76 8 . 1 2 -

I believe this because PDF streams with Tc before Tm are having the matrix 
applied to the Tc, which is contrary to my experience with graphic pipelines.  
Most PDF streams seem to to have Tc after Tm, and thus do not hit this 
situation.

I have attached a patch to two files that corrects the problem for this file, 
and also works correctly on my test suite of about 40 files from other sources. 
 

The result for the attached file now becomes:
Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
Transaction  Activity
Date  Description Deposits  Withdrawals
04/08  BEGINNING  BALANCE
04/08  WITHDRAWAL-ATM  3 117 300.00-
62 MILL  HILL  ROAD  WOODSTOCK  NY
04/10  WITHDRAWAL-ACH 200.00-
HUMAN RIGHTS  WAT-BILL  PAYMT
04/12  CK#  1273 110.00-
04/15  WITHDRAWAL-ACH 202.57-
NEW SOUTH  INSURA-BILL  PAYMT
04/15  WITHDRAWAL-ACH 36.26-
WASTE CONNECTION-BILL  PAYMT
04/17  WITHDRAWAL-ACH 712.00-
N  PYMT T
04/18  WITHDRAWAL-ACH 29589.00 3
N  PYMT T
04/19  WITHDRAWAL-ACH 768.12-


> PDF's with Tc before Tm are getting incorrect spacing in PDFTextArea
> --------------------------------------------------------------------
>
>                 Key: PDFBOX-2069
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2069
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.5
>         Environment: Windows
>            Reporter: Joel Hirsh
>              Labels: pdfbox
>         Attachments: PDFBOX-2609.pdf, PDFBox-2609-patch.zip
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Attached PDF is getting incorrect spacing using example program 
> ExtractTextByArea.java as follows:
> Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
> Transaction Activity
> Date D e s c r i p t i o n Deposits W i t h d r a w a l s
> 0 4 / 0 8  B E G I N N I N G  BALANCE
> 04 / 0 8  W I THDRAWAL - ATM  3 1 1 7 3 0 0 . 0 0 -
> 62 M I L L  H I L L  ROAD WOODSTOCK N Y
> 04 / 1 0  W I THDRAWAL - ACH 2 0 0 . 0 0 -
> HUMAN RIGHTS WAT-B I L L  PAYMT
> 04 / 12  C K #  1 2 7 3 11 0 . 0 0 -
> 0 4 / 1 5  W I THDRAWAL - ACH 2 0 2 . 5 7 -
> NEW SOUTH INSURA -B I LL PAYMT
> 04 / 1 5  W I THDRAWAL - ACH 3 6 . 2 6 -
> WASTE CONNECTION-BILL PAYMT
> 04 / 1 7  W I THDRAWAL - ACH 71 2 . 0 0 -
> N  PYMT T
> 04 / 1 8  W I THDRAWAL - ACH 2958 9 . 0 0 3
> N  PYMT T
> 04 / 1 9  W I THDRAWAL - ACH 76 8 . 1 2 -
> I believe this because PDF streams with Tc before Tm are having the matrix 
> applied to the Tc, which is contrary to my experience with graphic pipelines. 
>  Most PDF streams seem to to have Tc after Tm, and thus do not hit this 
> situation.
> I have attached a patch to two files that corrects the problem for this file, 
> and also works correctly on my test suite of about 40 files from other 
> sources.  
> The result for the attached file now becomes:
> Text in the area:java.awt.Rectangle[x=10,y=500,width=600,height=200]
> Transaction  Activity
> Date  Description Deposits  Withdrawals
> 04/08  BEGINNING  BALANCE
> 04/08  WITHDRAWAL-ATM  3 117 300.00-
> 62 MILL  HILL  ROAD  WOODSTOCK  NY
> 04/10  WITHDRAWAL-ACH 200.00-
> HUMAN RIGHTS  WAT-BILL  PAYMT
> 04/12  CK#  1273 110.00-
> 04/15  WITHDRAWAL-ACH 202.57-
> NEW SOUTH  INSURA-BILL  PAYMT
> 04/15  WITHDRAWAL-ACH 36.26-
> WASTE CONNECTION-BILL  PAYMT
> 04/17  WITHDRAWAL-ACH 712.00-
> N  PYMT T
> 04/18  WITHDRAWAL-ACH 29589.00 3
> N  PYMT T
> 04/19  WITHDRAWAL-ACH 768.12-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to