[ 
https://issues.apache.org/jira/browse/PDFBOX-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578947#comment-17578947
 ] 

Tilman Hausherr commented on PDFBOX-5492:
-----------------------------------------

The order is the order it is in the PDF, which can be quite weird. You can get 
sorted text when you use the flag (not sure if this output is useful). 
[^sample-sorted.txt] 

For tables you should try a tool like Tabula.

> The order of text extracted from PDF by PDFTextStripper is incorrect.
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-5492
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5492
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Documentation, PDModel
>    Affects Versions: 2.0.26
>         Environment: Windows 11  + Intellij +  Spark3.12 + scala2.12
>            Reporter: Fred Yu
>            Priority: Major
>         Attachments: sample-sorted.txt, sample-unsorted.txt, sample.pdf
>
>
>  With  pdfbox version 2.0.6
> following code get the text extracted from the pdf file which attached in 
> Attachment:
> {color:#00875a}def getTextFromPdf(filename: String):Some[String] = {{color}
> {color:#00875a}    var textContent :Some[String]= null{color}
> {color:#00875a}         try {{color}
> {color:#00875a}              val doc :PDDocument = PDDocument.load(new 
> File(filename)){color}
> {color:#00875a}              val docInfo :PDDocumentInformation = 
> doc.getDocumentInformation();{color}
> {color:#00875a}              val stripper = new PDFTextStripper{color}
> {color:#00875a}              stripper.setStartPage(1){color}
> {color:#00875a}              stripper.setEndPage(1){color}
> {color:#00875a}              textContent = Some(stripper.getText(doc)){color}
>  
> Output:
>     ...........
>  * (1) Written Premium Collected by the Bank{color:#de350b} 0.00US$           
>           0.00US$                               0.00US$                      
> 0.00US$                        0.00US$                          0.00US$ 
> {color}                        
> (2) Increase (Decrease) in Uearned Premium Reserve {color:#de350b}0.00US$     
>                 (72.04)US$                          (72.04)US$                
>   0.00US$                        (272.31)US${color}                    
> (272.31)US$                   
> (3) Earned Premium ((Reinsurance Premium) (1)- (2)) 0.00US$                   
>   72.04US$                             72.04US$                     0.00US$   
>                      272.31US$                       272.31US$                
>      {color}
> (4) Currency Tax (Impuesto Divisas) [2% of (3)] {color:#de350b}0.00US$        
>              1.44US$                               1.44US$                    
>   0.00US$                        5.45US$                          5.45US$  
> {color}                       
> (5) Ceding Allowance [5.8% of (3)] {color:#de350b}$ 0.00 0.00US$              
>        4.18US$                               4.18US$                      
> 0.00US$                        15.79US$                        15.79US$  
> {color}
> .........
> Expect:  All the money field should be in correct order, like:
>  * Written Premium Collected by the Bank{color:#de350b} US$ 0.00              
>       US$0.00                               US$0.00                      
> US$0.00                       US$0.00                          US$0.00 
> {color} 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to