Zer Jun Eng created PDFBOX-6031: ----------------------------------- Summary: PDFStreamEngine: inconsistent processPage behaviour in multithreading Key: PDFBOX-6031 URL: https://issues.apache.org/jira/browse/PDFBOX-6031 Project: PDFBox Issue Type: Bug Components: Parsing Affects Versions: 3.0.5 PDFBox Reporter: Zer Jun Eng Attachments: Catalogo_Egitto_2025.pdf, image-2025-07-07-22-35-15-823.png
Dear PDFBox developers, I modified the [PrintImageLocations.java|https://github.com/apache/pdfbox/blob/3.0.5/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java] example to count the number of unique images in a PDF document. The minimal reproducible code is below: {code:java} import java.io.File; import java.io.IOException; import java.util.List; import java.util.Set; import java.util.concurrent.Callable; import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.TimeUnit; import org.apache.pdfbox.Loader; import org.apache.pdfbox.contentstream.PDFStreamEngine; import org.apache.pdfbox.contentstream.operator.DrawObject; import org.apache.pdfbox.contentstream.operator.Operator; import org.apache.pdfbox.contentstream.operator.OperatorName; import org.apache.pdfbox.contentstream.operator.state.Concatenate; import org.apache.pdfbox.contentstream.operator.state.Restore; import org.apache.pdfbox.contentstream.operator.state.Save; import org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters; import org.apache.pdfbox.contentstream.operator.state.SetMatrix; import org.apache.pdfbox.cos.COSBase; import org.apache.pdfbox.cos.COSName; import org.apache.pdfbox.cos.COSObjectKey; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.graphics.PDXObject; import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject; import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; /** * Adapted from * https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java */ public class CountUniqueImages { private final Set<COSObjectKey> uniqueImageKeys = ConcurrentHashMap.newKeySet(); public int countUniqueImages(File file, int nThreads) throws IOException, InterruptedException { try (PDDocument document = Loader.loadPDF(file); ExecutorService executor = Executors.newFixedThreadPool(nThreads)) { for (PDPage page : document.getPages()) { ImageEngine imageEngine = new ImageEngine(page); executor.submit(imageEngine); } executor.shutdown(); executor.awaitTermination(1, TimeUnit.MINUTES); return uniqueImageKeys.size(); } } final class ImageEngine extends PDFStreamEngine implements Callable<Object> { private static final Object DONE = new Object(); private final PDPage page; public ImageEngine(PDPage page) { this.page = page; addOperator(new Concatenate(this)); addOperator(new DrawObject(this)); addOperator(new SetGraphicsStateParameters(this)); addOperator(new Save(this)); addOperator(new Restore(this)); addOperator(new SetMatrix(this)); } @Override protected void processOperator(Operator operator, List<COSBase> operands) throws IOException { String operation = operator.getName(); if (OperatorName.DRAW_OBJECT.equals(operation)) { COSName objectName = (COSName) operands.get(0); PDXObject xobject = getResources().getXObject(objectName); if (xobject instanceof PDImageXObject) { PDImageXObject imageXObj = (PDImageXObject) xobject; COSObjectKey key = imageXObj.getCOSObject().getKey(); uniqueImageKeys.add(key); } else if (xobject instanceof PDFormXObject) { PDFormXObject form = (PDFormXObject) xobject; showForm(form); } } else { super.processOperator(operator, operands); } } @Override public Object call() throws Exception { processPage(page); return DONE; } } } {code} Below is the JUnit test to verify the correctness of the multithreaded implementation. I have also attached the PDF file used for testing: {code:java} import static org.junit.jupiter.api.Assertions.*; import java.io.File; import java.io.IOException; import org.junit.jupiter.api.Test; class CountUniqueImagesTest { @Test void testSingleThreaded() throws IOException, InterruptedException { CountUniqueImages counter = new CountUniqueImages(); int count = counter.countUniqueImages(new File("src/test/resources/Catalogo_Egitto_2025.pdf"), 1); assertEquals(122, count); } @Test void testMultiThreaded() throws IOException, InterruptedException { CountUniqueImages counter = new CountUniqueImages(); int count = counter.countUniqueImages(new File("src/test/resources/Catalogo_Egitto_2025.pdf"), 4); assertEquals(122, count); } } {code} I am getting inconsistent results when using multithreading. The PDF file is expected to contain 122 unique images. Out of 100 test runs, the multithreaded test case fails 19 times. In those cases, the code does not correctly count the number of unique images. !image-2025-07-07-22-35-15-823.png! I have read the [FAQ|https://pdfbox.apache.org/3.0/faq.html#is-pdfbox-thread-safe%3F] and I understand that PDFBox is not thread-safe. Therefore, this issue might be related to or a duplicate of https://issues.apache.org/jira/browse/PDFBOX-5541 or https://issues.apache.org/jira/browse/PDFBOX-5542. However, I'm still wondering if this might be a bug, since my code only performs read-only operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org