[jira] [Commented] (PDFBOX-5532) COSString field non-ascii characters

Jira Tue, 25 Oct 2022 23:25:51 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624210#comment-17624210
 ]


Andreas Lehmkühler commented on PDFBOX-5532:
--------------------------------------------

And don't forget the issue with font subsets. Those fonts are limited to the 
characters which are used within the text, all unused characters are removed. 
In such cases you can't replace an existing text with a new one which contains 
stripped characters. For example if your text is "Test" the subsetted font is 
limited to the characters "T", "e, "s" and "t". You can't replace "Test" with 
"Hello" as the characters "H", "l" and "o" are missing.

All these issues and maybe some additional ones are the reason why we remove 
the replace text code piece from the samples project. One might come to the 
conclusion that it is easy to replace some text but it most likely isn't.

> COSString field non-ascii characters
> ------------------------------------
>
>                 Key: PDFBOX-5532
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5532
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: David
>            Priority: Major
>
>  
> Hello,
> I am reading a pdf document but in the COSString field non-ascii characters 
> are being retrieved. What can be the motive? I am using version 
> pdfbox-2.0.24.jar
> This would be an example of the pdf document parsed:
> {code}
> COSInt\{50} 
> COSInt\{0} 
> PDFOperator\{Td} 
> COSString\{åÅÕãÁâ@} 
> PDFOperator\{Tj} 
> COSFloat\{770.18} 
> COSInt\{0} 
> PDFOperator\{Td} 
> COSString\{×–Ž–©@} 
> PDFOperator\{Tj} 
> COSFloat\{520.21} 
> COSInt\{0}
> {code}
> Function java:
> {code}
>  public static PDDocument replaceText(PDDocument document, String 
> searchString, String replacement) throws IOException {
>                           
>                   PDPageTree pages = document.getDocumentCatalog().getPages();
>                   for (PDPage page : pages) {
>                                               
>                       PDFStreamParser parser = new PDFStreamParser(page);
>                       parser.parse();
>                       List tokens = parser.getTokens();
>                       for (int j = 0; j < tokens.size(); j++) {
>                           Object next = tokens.get(j);
>                          
>                           if (next instanceof Operator) {
>                               Operator op = (Operator) next;
>                            
>                                if (op.getName().equals("Tj")) {
>                                   COSString previous = (COSString) 
> tokens.get(j - 1);                          
>                                   String string = previous.getString();
>                                   System.out.println("previous:=" + string);
>                                               
>                               
>                                   if (string.equals(searchString)){
>                                        COSString sx = new 
> COSString(replacement);             
>                                       previous.setValue(sx.getBytes());
>                                       
>                                   }
>                               }
>                           }
>                       }
>                       // now that the tokens are updated we will replace the 
> page content stream.
>                       PDStream updatedStream = new PDStream(document);
>                       OutputStream out = updatedStream.createOutputStream();
>                       ContentStreamWriter tokenWriter = new 
> ContentStreamWriter(out);
>                       tokenWriter.writeTokens(tokens);
>                       page.setContents(updatedStream);
>                       out.close();
>                       
>                       
>                   }
>                   return document;
>               }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5532) COSString field non-ascii characters

Reply via email to