[
https://issues.apache.org/jira/browse/PDFBOX-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074785#comment-18074785
]
HABA commented on PDFBOX-6194:
------------------------------
Hello [~tilman],
I tried creating a synthetic PDF with shared XObjects but couldn't get it to
reproduce. The original is from a Konica Minolta scanner and contains
confidential data so I can't share it. Same code, same input, sometimes it
corrupts, sometimes it doesn't. It consistently only happens on our server (Win
Server 2016) and never on my dev machine (Win 11).
Here's the code that produces the corrupted output:
{{}}
{code:java}
public static final PDType1Font font = new
PDType1Font(Standard14Fonts.FontName.COURIER);
public static byte[] makePdfSearchable(byte[] pdf) throws Exception {
try (PDDocument document = Loader.loadPDF(pdf)) {
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); page++) {
BufferedImage image = pdfRenderer.renderImageWithDPI(page, 300,
ImageType.GRAY);
// image gets sent to AWS Textract for OCR
List<Block> blocks = extractText(image);
try (PDPageContentStream contentStream = new PDPageContentStream(
document, document.getPage(page),
PDPageContentStream.AppendMode.APPEND, true, true)) {
contentStream.setRenderingMode(RenderingMode.NEITHER);
for (Block block : blocks) {
if (block.blockType() == BlockType.WORD) {
String text = block.text();
float fontSize = calculateFontSize(text,
block.geometry().boundingBox().width()
* document.getPage(page).getMediaBox().getWidth());
contentStream.beginText();
contentStream.setFont(font, fontSize);
contentStream.newLineAtOffset(
block.geometry().boundingBox().left() *
document.getPage(page).getMediaBox().getWidth(),
document.getPage(page).getMediaBox().getHeight()
- document.getPage(page).getMediaBox().getHeight()
* block.geometry().boundingBox().top());
contentStream.showText(text);
contentStream.endText();
}
}
}
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
return baos.toByteArray();
}
} {code}
{{}}
Input is a 3-page scanned PDF where all pages reference the same indirect
XObjects (Obj4 = color image, Obj5 = ImageMask). After save(), Obj5 on page 3
turns into this:
{code:java}
COSDictionary{Type:Font, Subtype:Type1, BaseFont:Courier,
Encoding:WinAnsiEncoding} {code}
Pages 1 and 2 stay fine. Happy to test any patches or run diagnostics if you
point me in the right direction.
> COSStream becomes COSDictionary after save — shared XObject reference
> replaced by Font
> --------------------------------------------------------------------------------------
>
> Key: PDFBOX-6194
> URL: https://issues.apache.org/jira/browse/PDFBOX-6194
> Project: PDFBox
> Issue Type: Bug
> Components: PDModel
> Affects Versions: 3.0.7 PDFBox
> Environment: Windows Server 2016, Java 21, PDFBox 3.0.7
> Reporter: HABA
> Priority: Major
>
> Hi,
> `document.save()` corrupts an `/XObject` on page 3 of a 3-page PDF.
> Before save:
> - `Obj5` = `COSStream` (ImageMask)
> After save:
> - `Obj5` = `COSDictionary` (Courier font)
> Pages 1–2 are unaffected. All pages share the same indirect XObject refs
> (`Obj4`, `Obj5`).
> Flow:
> - load PDF
> - render pages via `PDFRenderer.renderImageWithDPI()`
> - append invisible OCR text using `PDPageContentStream` (AppendMode.APPEND,
> Courier)
> - save document → corruption occurs
> Result:
> java.io.IOException: Unexpected object type: COSDictionary
>
> Reproduced consistently on:
> * Windows Server 2016, Java 21, PDFBox 3.0.7
> Not reproducible on:
> * Windows 11, Java 21 (same code + input)
> Likely related to shared indirect XObject being overwritten during save.
> Cannot share original PDF (confidential), but can test with synthetic
> reproducer if needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]