[
https://issues.apache.org/jira/browse/PDFBOX-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Fred Yu updated PDFBOX-5492:
----------------------------
Description:
With pdfbox version 2.0.6
following code get the text extracted from the pdf file which attached in
Attachment:
{color:#00875a}def getTextFromPdf(filename: String):Some[String] = {{color}
{color:#00875a} var textContent :Some[String]= null{color}
{color:#00875a} try {{color}
{color:#00875a} val doc :PDDocument = PDDocument.load(new
File(filename)){color}
{color:#00875a} val docInfo :PDDocumentInformation =
doc.getDocumentInformation();{color}
{color:#00875a} val stripper = new PDFTextStripper{color}
{color:#00875a} stripper.setStartPage(1){color}
{color:#00875a} stripper.setEndPage(1){color}
{color:#00875a} textContent = Some(stripper.getText(doc)){color}
Output:
...........
* (1) Written Premium Collected by the Bank{color:#de350b} 0.00US$
0.00US$ 0.00US$
0.00US$ 0.00US$ 0.00US$ {color}
(2) Increase (Decrease) in Uearned Premium Reserve {color:#de350b}0.00US$
(72.04)US$ (72.04)US$
0.00US$ (272.31)US${color}
(272.31)US$
(3) Earned Premium ((Reinsurance Premium) (1)- (2)) 0.00US$
72.04US$ 72.04US$ 0.00US$
272.31US$ 272.31US$
{color}
(4) Currency Tax (Impuesto Divisas) [2% of (3)] {color:#de350b}0.00US$
1.44US$ 1.44US$
0.00US$ 5.45US$ 5.45US$
{color}
(5) Ceding Allowance [5.8% of (3)] {color:#de350b}$ 0.00 0.00US$
4.18US$ 4.18US$ 0.00US$
15.79US$ 15.79US$ {color}
.........
Expect: All the money field should be in correct order, like:
* Written Premium Collected by the Bank{color:#de350b} US$ 0.00
US$0.00 US$0.00 US$0.00
US$0.00 US$0.00 {color}
was:
With pdfbox version 2.0.6
following code get the text extracted from the pdf file with attached in
Attachment:
{color:#00875a}def getTextFromPdf(filename: String):Some[String] = {{color}
{color:#00875a} var textContent :Some[String]= null{color}
{color:#00875a} try {{color}
{color:#00875a} val doc :PDDocument = PDDocument.load(new
File(filename)){color}
{color:#00875a} val docInfo :PDDocumentInformation =
doc.getDocumentInformation();{color}
{color:#00875a} val stripper = new PDFTextStripper{color}
{color:#00875a} stripper.setStartPage(1){color}
{color:#00875a} stripper.setEndPage(1){color}
{color:#00875a} textContent = Some(stripper.getText(doc)){color}
Output:
...........
* (1) Written Premium Collected by the Bank{color:#de350b} 0.00US$
0.00US$ 0.00US$
0.00US$ 0.00US$ 0.00US$ {color}
(2) Increase (Decrease) in Uearned Premium Reserve {color:#de350b}0.00US$
(72.04)US$ (72.04)US$
0.00US$ (272.31)US${color}
(272.31)US$
(3) Earned Premium ((Reinsurance Premium) (1)- (2)) {color:#de350b}0.00US$
72.04US$ 72.04US$
0.00US$ 272.31US$ 272.31US$
{color}
(4) Currency Tax (Impuesto Divisas) [2% of (3)] {color:#de350b}0.00US$
1.44US$ 1.44US$
0.00US$ 5.45US$ 5.45US$
{color}
(5) Ceding Allowance [5.8% of (3)] {color:#de350b}$ 0.00 0.00US$
4.18US$ 4.18US$ 0.00US$
15.79US$ 15.79US$ {color}
.........
Expect: All the money field should be in correct order, like:
* Written Premium Collected by the Bank{color:#de350b} US$ 0.00
US$0.00 US$0.00 US$0.00
US$0.00 US$0.00 {color}
> The order of text extracted from PDF by PDFTextStripper is incorrect.
> ---------------------------------------------------------------------
>
> Key: PDFBOX-5492
> URL: https://issues.apache.org/jira/browse/PDFBOX-5492
> Project: PDFBox
> Issue Type: Bug
> Components: Documentation, PDModel
> Affects Versions: 2.0.26
> Environment: Windows 11 + Intellij + Spark3.12 + scala2.12
> Reporter: Fred Yu
> Priority: Major
> Attachments: sample.pdf
>
>
> With pdfbox version 2.0.6
> following code get the text extracted from the pdf file which attached in
> Attachment:
> {color:#00875a}def getTextFromPdf(filename: String):Some[String] = {{color}
> {color:#00875a} var textContent :Some[String]= null{color}
> {color:#00875a} try {{color}
> {color:#00875a} val doc :PDDocument = PDDocument.load(new
> File(filename)){color}
> {color:#00875a} val docInfo :PDDocumentInformation =
> doc.getDocumentInformation();{color}
> {color:#00875a} val stripper = new PDFTextStripper{color}
> {color:#00875a} stripper.setStartPage(1){color}
> {color:#00875a} stripper.setEndPage(1){color}
> {color:#00875a} textContent = Some(stripper.getText(doc)){color}
>
> Output:
> ...........
> * (1) Written Premium Collected by the Bank{color:#de350b} 0.00US$
> 0.00US$ 0.00US$
> 0.00US$ 0.00US$ 0.00US$
> {color}
> (2) Increase (Decrease) in Uearned Premium Reserve {color:#de350b}0.00US$
> (72.04)US$ (72.04)US$
> 0.00US$ (272.31)US${color}
> (272.31)US$
> (3) Earned Premium ((Reinsurance Premium) (1)- (2)) 0.00US$
> 72.04US$ 72.04US$ 0.00US$
> 272.31US$ 272.31US$
> {color}
> (4) Currency Tax (Impuesto Divisas) [2% of (3)] {color:#de350b}0.00US$
> 1.44US$ 1.44US$
> 0.00US$ 5.45US$ 5.45US$
> {color}
> (5) Ceding Allowance [5.8% of (3)] {color:#de350b}$ 0.00 0.00US$
> 4.18US$ 4.18US$
> 0.00US$ 15.79US$ 15.79US$
> {color}
> .........
> Expect: All the money field should be in correct order, like:
> * Written Premium Collected by the Bank{color:#de350b} US$ 0.00
> US$0.00 US$0.00
> US$0.00 US$0.00 US$0.00
> {color}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]