David created PDFBOX-5532:
-----------------------------

             Summary: COSString field non-ascii characters
                 Key: PDFBOX-5532
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5532
             Project: PDFBox
          Issue Type: Bug
            Reporter: David


 
Hello,

I am reading a pdf document but in the COSString field non-ascii characters are 
being retrieved. What can be the motive? I am using version pdfbox-2.0.24.jar

This would be an example of the pdf document parsed:

COSInt\{50} 
COSInt\{0} 
PDFOperator\{Td} 
COSString\{åÅÕãÁâ@} 
PDFOperator\{Tj} 
COSFloat\{770.18} 
COSInt\{0} 
PDFOperator\{Td} 
COSString\{×–Ž–©@} 
PDFOperator\{Tj} 
COSFloat\{520.21} 
COSInt\{0}


Function java:

 public static PDDocument replaceText(PDDocument document, String searchString, 
String replacement) throws IOException {
                            
                    PDPageTree pages = document.getDocumentCatalog().getPages();
                    for (PDPage page : pages) {
                                                
                        PDFStreamParser parser = new PDFStreamParser(page);
                        parser.parse();
                        List tokens = parser.getTokens();
                        for (int j = 0; j < tokens.size(); j++) {
                            Object next = tokens.get(j);
                           
                            if (next instanceof Operator) {
                                Operator op = (Operator) next;
                             
                                 if (op.getName().equals("Tj")) {
                                    COSString previous = (COSString) 
tokens.get(j - 1);                          
                                    String string = previous.getString();
                                    System.out.println("previous:=" + string);
                                                
                                
                                    if (string.equals(searchString)){
                                         COSString sx = new 
COSString(replacement);             
                                        previous.setValue(sx.getBytes());
                                        
                                    }
                                }
                            }
                        }
                        // now that the tokens are updated we will replace the 
page content stream.
                        PDStream updatedStream = new PDStream(document);
                        OutputStream out = updatedStream.createOutputStream();
                        ContentStreamWriter tokenWriter = new 
ContentStreamWriter(out);
                        tokenWriter.writeTokens(tokens);
                        page.setContents(updatedStream);
                        out.close();
                        
                        
                    }
                    return document;
                }
         



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to