[ 
https://issues.apache.org/jira/browse/PDFBOX-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074819#comment-18074819
 ] 

HABA commented on PDFBOX-6194:
------------------------------

Sorry, I said Konica Minolta earlier but it's actually a Canon scanner. Here's 
the document info:
!image-2026-04-20-12-33-11-057.png|width=491,height=737!
Hope it helps you. And it's not tess4j. I'm using AWS Textract:


{code:java}
try (TextractClient textractClient = 
TextractClient.builder().region(region).credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create(accessKey,secretKey))).build())
 {
  Document                   document = 
Document.builder().bytes(SdkBytes.fromByteArray(image)).build();
  DetectDocumentTextRequest  request  = 
DetectDocumentTextRequest.builder().document(document).build();
  DetectDocumentTextResponse response = 
textractClient.detectDocumentText(request);
  return response.blocks();
} {code}

I render each page with PDFRenderer.renderImageWithDPI(page, 300, 
ImageType.GRAY), send the PNG to Textract, get words back, then write them onto 
the page with PDPageContentStream in AppendMode.APPEND using a Courier font.

> COSStream becomes COSDictionary after save — shared XObject reference 
> replaced by Font
> --------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6194
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6194
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 3.0.7 PDFBox
>         Environment: Windows Server 2016, Java 21, PDFBox 3.0.7
>            Reporter: HABA
>            Priority: Major
>         Attachments: image-2026-04-20-12-33-11-057.png, screenshot-1.png
>
>
> Hi,
> `document.save()` corrupts an `/XObject` on page 3 of a 3-page PDF.
> Before save:
> - `Obj5` = `COSStream` (ImageMask)
> After save:
> - `Obj5` = `COSDictionary` (Courier font)
> Pages 1–2 are unaffected. All pages share the same indirect XObject refs 
> (`Obj4`, `Obj5`).
> Flow:
> - load PDF
> - render pages via `PDFRenderer.renderImageWithDPI()`
> - append invisible OCR text using `PDPageContentStream` (AppendMode.APPEND, 
> Courier)
> - save document → corruption occurs
> Result:
> java.io.IOException: Unexpected object type: COSDictionary
>  
> Reproduced consistently on:
>  * Windows Server 2016, Java 21, PDFBox 3.0.7
> Not reproducible on:
>  * Windows 11, Java 21 (same code + input)
> Likely related to shared indirect XObject being overwritten during save.
> Cannot share original PDF (confidential), but can test with synthetic 
> reproducer if needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to