[ 
https://issues.apache.org/jira/browse/PDFBOX-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578960#comment-17578960
 ] 

Tilman Hausherr commented on PDFBOX-5492:
-----------------------------------------

I used the command line tool. But you can do that too by using 
{{stripper.setSortByPosition(true)}}.

> The order of text extracted from PDF by PDFTextStripper is incorrect.
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-5492
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5492
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Documentation, PDModel
>    Affects Versions: 2.0.26
>         Environment: Windows 11  + Intellij +  Spark3.12 + scala2.12
>            Reporter: Fred Yu
>            Priority: Major
>         Attachments: sample-sorted.txt, sample-unsorted.txt, sample.pdf
>
>
>  With  pdfbox version 2.0.6
> following code get the text extracted from the pdf file which attached in 
> Attachment:
> {color:#00875a}def getTextFromPdf(filename: String):Some[String] = {{color}
> {color:#00875a}    var textContent :Some[String]= null{color}
> {color:#00875a}         try {{color}
> {color:#00875a}              val doc :PDDocument = PDDocument.load(new 
> File(filename)){color}
> {color:#00875a}              val docInfo :PDDocumentInformation = 
> doc.getDocumentInformation();{color}
> {color:#00875a}              val stripper = new PDFTextStripper{color}
> {color:#00875a}              stripper.setStartPage(1){color}
> {color:#00875a}              stripper.setEndPage(1){color}
> {color:#00875a}              textContent = Some(stripper.getText(doc)){color}
>  
> Output:
>     ...........
>  * (1) Written Premium Collected by the Bank{color:#de350b} 0.00US$           
>           0.00US$                               0.00US$                      
> 0.00US$                        0.00US$                          0.00US$ 
> {color}                        
> (2) Increase (Decrease) in Uearned Premium Reserve {color:#de350b}0.00US$     
>                 (72.04)US$                          (72.04)US$                
>   0.00US$                        (272.31)US${color}                    
> (272.31)US$                   
> (3) Earned Premium ((Reinsurance Premium) (1)- (2)) 0.00US$                   
>   72.04US$                             72.04US$                     0.00US$   
>                      272.31US$                       272.31US$                
>      {color}
> (4) Currency Tax (Impuesto Divisas) [2% of (3)] {color:#de350b}0.00US$        
>              1.44US$                               1.44US$                    
>   0.00US$                        5.45US$                          5.45US$  
> {color}                       
> (5) Ceding Allowance [5.8% of (3)] {color:#de350b}$ 0.00 0.00US$              
>        4.18US$                               4.18US$                      
> 0.00US$                        15.79US$                        15.79US$  
> {color}
> .........
> Expect:  All the money field should be in correct order, like:
>  * Written Premium Collected by the Bank{color:#de350b} US$ 0.00              
>       US$0.00                               US$0.00                      
> US$0.00                       US$0.00                          US$0.00 
> {color} 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to