[
https://issues.apache.org/jira/browse/PDFBOX-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andreas Lehmkühler updated PDFBOX-1109:
---------------------------------------
Component/s: PDModel
Parsing
> Data corruption related to scratch file use
> -------------------------------------------
>
> Key: PDFBOX-1109
> URL: https://issues.apache.org/jira/browse/PDFBOX-1109
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing, PDModel
> Affects Versions: 1.8.7, 2.0.0
> Reporter: Stefan Mücke
> Assignee: Andreas Lehmkühler
> Priority: Critical
> Fix For: 2.0.0
>
> Attachments: COSDocument.java, PagedMultiRandomAccessFile.java,
> PagedMultiRandomAccessFileTest.java
>
>
> PDFBox uses a scratch file to reduce memory consumption. However, there is no
> mechanism that prevents two PDStreams from writing to the scratch file at the
> same time. When this happens, the resulting PDF contains garbage in some
> streams. This problem occurred several times to me (e.g. when writing to an
> image stream while constructing a page).
> Reproducing the bug
> *******************
> One can easily reproduce the bug. Open file AddImageToPDF.java and move the
> following line:
> PDPageContentStream contentStream =
> new PDPageContentStream(doc, page, true, true);
> immediately after the line in which the PDPage object is fetched:
> PDPage page =
> (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 );
>
> With this modification, one will still get a PDF file, but Acrobat Reader
> will report that the image could not be processed. BTW, the files
> AddImageToPDF.java and ImageToPDF.java are almost identical. One of them
> should be deleted.
> Bug-Fix
> *******
> The problem can be solved by using a scratch file that is divided into pages
> (e.g. of 4 KB). Each PDStream in the scratch file is then associated with a
> list of pages. This list grows as more data is written to the stream.
> The bug fix requires minimal changes to the existing code. The very nice
> RandomAccess interface made this very easy.
> Here is what needs to be changed:
> - Add the attached "PagedMultiRandomAccessFile.java" to the I/O package
> - Change COSDocument.getScratchFile() to return a RandomAccess
> instance provided by PagedMultiRandomAccessFile:
> private PagedMultiRandomAccessFile scratchFile = null;
> [...]
> public COSDocument(File scratchDir) throws IOException {
> tmpFile = File.createTempFile("pdfbox", "tmp", scratchDir);
> scratchFile = new PagedMultiRandomAccessFile(
> new RandomAccessFile(tmpFile, "rw"));
> }
> public COSDocument(RandomAccess file) {
> // scratchFile = file;
> throw new RuntimeException("Not yet implemented.");
> //$NON-NLS-1$
> }
>
> [...]
> /**
> * Returns a new scratch file.
> *
> * @return the newly created scratch file
> */
> public RandomAccess getScratchFile() {
> return scratchFile.getNewRandomAcess();
> }
> One of the COSDocument constructors takes a RandomAccess file. This
> constructor is only called in a single location, namely, in method
> PDFParser.parse(). I am not sure if the RandomAccess parameter provided here
> is really a scratch file. Someone will have to decide what to do with this
> one.
> The code has been throughly tested and has been used in the production of
> several books without any problems.
> In the attachment please find the code. There is also a JUnit test that was
> used to debug my code. I have added an Apache license header and adopted
> PDFBox's code style. Feel free to make any desired changes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]