[jira] [Commented] (PDFBOX-5532) COSString field non-ascii characters

David (Jira) Mon, 24 Oct 2022 14:41:07 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623432#comment-17623432
 ]


David commented on PDFBOX-5532:
-------------------------------

So, what would be the steps to be able to correctly parse the string of the 
COSString field? Would it be necessary to refer to current font or to have font 
ttf file and load it before parsing? 

I need to edit the original pdf file to change the value of some fields.

Thanks

> COSString field non-ascii characters
> ------------------------------------
>
>                 Key: PDFBOX-5532
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5532
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: David
>            Priority: Major
>
>  
> Hello,
> I am reading a pdf document but in the COSString field non-ascii characters 
> are being retrieved. What can be the motive? I am using version 
> pdfbox-2.0.24.jar
> This would be an example of the pdf document parsed:
> COSInt\{50} 
> COSInt\{0} 
> PDFOperator\{Td} 
> COSString\{åÅÕãÁâ@} 
> PDFOperator\{Tj} 
> COSFloat\{770.18} 
> COSInt\{0} 
> PDFOperator\{Td} 
> COSString\{×–Ž–©@} 
> PDFOperator\{Tj} 
> COSFloat\{520.21} 
> COSInt\{0}
> Function java:
>  public static PDDocument replaceText(PDDocument document, String 
> searchString, String replacement) throws IOException {
>                           
>                   PDPageTree pages = document.getDocumentCatalog().getPages();
>                   for (PDPage page : pages) {
>                                               
>                       PDFStreamParser parser = new PDFStreamParser(page);
>                       parser.parse();
>                       List tokens = parser.getTokens();
>                       for (int j = 0; j < tokens.size(); j++) {
>                           Object next = tokens.get(j);
>                          
>                           if (next instanceof Operator) {
>                               Operator op = (Operator) next;
>                            
>                                if (op.getName().equals("Tj")) {
>                                   COSString previous = (COSString) 
> tokens.get(j - 1);                          
>                                   String string = previous.getString();
>                                   System.out.println("previous:=" + string);
>                                               
>                               
>                                   if (string.equals(searchString)){
>                                        COSString sx = new 
> COSString(replacement);             
>                                       previous.setValue(sx.getBytes());
>                                       
>                                   }
>                               }
>                           }
>                       }
>                       // now that the tokens are updated we will replace the 
> page content stream.
>                       PDStream updatedStream = new PDStream(document);
>                       OutputStream out = updatedStream.createOutputStream();
>                       ContentStreamWriter tokenWriter = new 
> ContentStreamWriter(out);
>                       tokenWriter.writeTokens(tokens);
>                       page.setContents(updatedStream);
>                       out.close();
>                       
>                       
>                   }
>                   return document;
>               }
>        



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5532) COSString field non-ascii characters

Reply via email to