Here's the issue with the attached code: https://issues.apache.org/jira/browse/PDFBOX-1109
> > License considerations severely limit what we can do with patches > > provided via the mailing list. > > > > Would you please create an issue at > > https://issues.apache.org/jira/browse/PDFBOX ? When you attach your > > patch files, please check the box that grants us the right to use the > > files. > > Okay, but I have trouble finding an "Add/New/Create/Report issue" button. > Do I need to have an account? I don't really want to create one. > > Stefan > > > > May sound silly ... but ... it keeps everything legal! > > > > Thanks! > > > > Daniel > > > > On Sat, Aug 27, 2011 at 5:04 PM, Stefan Mücke <[email protected]> > > wrote: > > > Hi PDFBox comitters, > > > > > > I would like to contribute a bug fix for a long-standing, major > > > problem in PDFBox. > > > > > > PDFBox uses a scratch file to reduce memory consumption. However, > > > there is no mechanism that prevents two PDStreams from writing to > > > the scratch file at the same time. When this happens, the resulting > > > PDF contains garbage in some streams. This problem occurred to me > > > several times (e.g. when writing to an image stream while > > >constructing a page). > > > Reproducing the bug > > > ******************* > > > > > > One can easily reproduce the bug. Open file AddImageToPDF.java and > > > move the following line: > > > > > > PDPageContentStream contentStream = > > > new PDPageContentStream(doc, page, true, true); > > > > > > immediately after the line in which the PDPage object is fetched: > > > > > > PDPage page = > > > (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 ); > > > > > > With this modification, one will still get a PDF file, but Acrobat > > > Reader will report that the image could not be processed. BTW, the > > > files AddImageToPDF.java and ImageToPDF.java are almost identical. > > > One of them should be deleted. > > > > > > Bug-Fix > > > ******* > > > > > > The problem can be solved by using a scratch file that is divided > > > into pages (e.g. of 4 KB). Each PDStream in the scratch file is > > > then associated with a list of pages. This list grows as more data > > >is written to the stream. > > > The bug fix requires minimal changes to the existing code. The very > > > nice RandomAccess interface made this very easy. > > > > > > Here is what needs to be changed: > > > > > > - Add the attached "PagedMultiRandomAccessFile.java" to the I/O > > > package - Change COSDocument.getScratchFile() to return a > > > RandomAccess instance provided by PagedMultiRandomAccessFile: > > > > > > private PagedMultiRandomAccessFile scratchFile = null; > > > > > > [...] > > > > > > public COSDocument(File scratchDir) throws IOException { > > > tmpFile = File.createTempFile("pdfbox", "tmp", > > > scratchDir); scratchFile = new > > > PagedMultiRandomAccessFile( new > > > RandomAccessFile(tmpFile, "rw")); } > > > > > > public COSDocument(RandomAccess file) { > > > // scratchFile = file; > > > throw new RuntimeException("Not yet implemented."); > > > //$NON-NLS-1$ > > > } > > > > > > [...] > > > > > > /** > > > * Returns a new scratch file. > > > * > > > * @return the newly created scratch file > > > */ > > > public RandomAccess getScratchFile() { > > > return scratchFile.getNewRandomAcess(); > > > } > > > > > > One of the COSDocument constructors takes a RandomAccess file. This > > > constructor is only called in a single location, namely, in method > > > PDFParser.parse(). I am not sure if the RandomAccess parameter > > > provided here is really a scratch file. Someone will have to decide > > > what to do with this one. > > > > > > The code has been throughly tested and has been used in the > > > production of several books without any problems. > > > > > > In the attachment please find the code. There is also a JUnit test > > > that was used to debug my code. I have added an Apache license > > > header and adopted PDFBox's code style. Feel free to make any > > >desired changes. > > > Best regards, > > > > > > Stefan Mücke > > > > > > > > > > > >
