Zer Jun Eng created PDFBOX-6031:
-----------------------------------

             Summary: PDFStreamEngine: inconsistent processPage behaviour in 
multithreading
                 Key: PDFBOX-6031
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6031
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 3.0.5 PDFBox
            Reporter: Zer Jun Eng
         Attachments: Catalogo_Egitto_2025.pdf, 
image-2025-07-07-22-35-15-823.png

Dear PDFBox developers,

I modified the 
[PrintImageLocations.java|https://github.com/apache/pdfbox/blob/3.0.5/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java]
 example to count the number of unique images in a PDF document. The minimal 
reproducible code is below:

{code:java}
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.operator.OperatorName;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import 
org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSObjectKey;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;

/**
 * Adapted from
 * 
https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java
 */
public class CountUniqueImages {

  private final Set<COSObjectKey> uniqueImageKeys = 
ConcurrentHashMap.newKeySet();

  public int countUniqueImages(File file, int nThreads) throws IOException, 
InterruptedException {

    try (PDDocument document = Loader.loadPDF(file);
        ExecutorService executor = Executors.newFixedThreadPool(nThreads)) {

      for (PDPage page : document.getPages()) {
        ImageEngine imageEngine = new ImageEngine(page);
        executor.submit(imageEngine);
      }

      executor.shutdown();
      executor.awaitTermination(1, TimeUnit.MINUTES);

      return uniqueImageKeys.size();
    }
  }

  final class ImageEngine extends PDFStreamEngine implements Callable<Object> {

    private static final Object DONE = new Object();
    private final PDPage page;

    public ImageEngine(PDPage page) {
      this.page = page;

      addOperator(new Concatenate(this));
      addOperator(new DrawObject(this));
      addOperator(new SetGraphicsStateParameters(this));
      addOperator(new Save(this));
      addOperator(new Restore(this));
      addOperator(new SetMatrix(this));
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) 
throws IOException {
      String operation = operator.getName();
      if (OperatorName.DRAW_OBJECT.equals(operation)) {
        COSName objectName = (COSName) operands.get(0);
        PDXObject xobject = getResources().getXObject(objectName);

        if (xobject instanceof PDImageXObject) {
          PDImageXObject imageXObj = (PDImageXObject) xobject;
          COSObjectKey key = imageXObj.getCOSObject().getKey();
          uniqueImageKeys.add(key);
        } else if (xobject instanceof PDFormXObject) {
          PDFormXObject form = (PDFormXObject) xobject;
          showForm(form);
        }
      } else {
        super.processOperator(operator, operands);
      }
    }

    @Override
    public Object call() throws Exception {
      processPage(page);
      return DONE;
    }
  }
}
{code}

Below is the JUnit test to verify the correctness of the multithreaded 
implementation. I have also attached the PDF file used for testing:

{code:java}
import static org.junit.jupiter.api.Assertions.*;

import java.io.File;
import java.io.IOException;
import org.junit.jupiter.api.Test;

class CountUniqueImagesTest {

  @Test
  void testSingleThreaded() throws IOException, InterruptedException {
    CountUniqueImages counter = new CountUniqueImages();
    int count =
        counter.countUniqueImages(new 
File("src/test/resources/Catalogo_Egitto_2025.pdf"), 1);
    assertEquals(122, count);
  }

  @Test
  void testMultiThreaded() throws IOException, InterruptedException {
    CountUniqueImages counter = new CountUniqueImages();
    int count =
        counter.countUniqueImages(new 
File("src/test/resources/Catalogo_Egitto_2025.pdf"), 4);
    assertEquals(122, count);
  }
}
{code}

I am getting inconsistent results when using multithreading. The PDF file is 
expected to contain 122 unique images. Out of 100 test runs, the multithreaded 
test case fails 19 times. In those cases, the code does not correctly count the 
number of unique images.

!image-2025-07-07-22-35-15-823.png!

I have read the 
[FAQ|https://pdfbox.apache.org/3.0/faq.html#is-pdfbox-thread-safe%3F] and I 
understand that PDFBox is not thread-safe. Therefore, this issue might be 
related to or a duplicate of https://issues.apache.org/jira/browse/PDFBOX-5541 
or https://issues.apache.org/jira/browse/PDFBOX-5542. However, I'm still 
wondering if this might be a bug, since my code only performs read-only 
operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to