> License considerations severely limit what we can do with patches > provided via the mailing list. > > Would you please create an issue at > https://issues.apache.org/jira/browse/PDFBOX ? When you attach your > patch files, please check the box that grants us the right to use the > files.
Okay, but I have trouble finding an "Add/New/Create/Report issue" button. Do I need to have an account? I don't really want to create one. Stefan > May sound silly ... but ... it keeps everything legal! > > Thanks! > > Daniel > > On Sat, Aug 27, 2011 at 5:04 PM, Stefan Mücke <[email protected]> wrote: > > > Hi PDFBox comitters, > > > > I would like to contribute a bug fix for a long-standing, major problem > > in PDFBox. > > > > PDFBox uses a scratch file to reduce memory consumption. However, there > > is no mechanism that prevents two PDStreams from writing to the > > scratch file at the same time. When this happens, the resulting PDF > > contains garbage in some streams. This problem occurred to me several > > times (e.g. when writing to an image stream while constructing a > >page). > > Reproducing the bug > > ******************* > > > > One can easily reproduce the bug. Open file AddImageToPDF.java and move > > the following line: > > > > PDPageContentStream contentStream = > > new PDPageContentStream(doc, page, true, true); > > > > immediately after the line in which the PDPage object is fetched: > > > > PDPage page = > > (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 ); > > > > With this modification, one will still get a PDF file, but Acrobat > > Reader will report that the image could not be processed. BTW, the > > files AddImageToPDF.java and ImageToPDF.java are almost identical. One > > of them should be deleted. > > > > Bug-Fix > > ******* > > > > The problem can be solved by using a scratch file that is divided into > > pages (e.g. of 4 KB). Each PDStream in the scratch file is then > > associated with a list of pages. This list grows as more data is > >written to the stream. > > The bug fix requires minimal changes to the existing code. The very > > nice RandomAccess interface made this very easy. > > > > Here is what needs to be changed: > > > > - Add the attached "PagedMultiRandomAccessFile.java" to the I/O > > package - Change COSDocument.getScratchFile() to return a > > RandomAccess instance provided by PagedMultiRandomAccessFile: > > > > private PagedMultiRandomAccessFile scratchFile = null; > > > > [...] > > > > public COSDocument(File scratchDir) throws IOException { > > tmpFile = File.createTempFile("pdfbox", "tmp", > > scratchDir); scratchFile = new > > PagedMultiRandomAccessFile( new > > RandomAccessFile(tmpFile, "rw")); } > > > > public COSDocument(RandomAccess file) { > > // scratchFile = file; > > throw new RuntimeException("Not yet implemented."); > > //$NON-NLS-1$ > > } > > > > [...] > > > > /** > > * Returns a new scratch file. > > * > > * @return the newly created scratch file > > */ > > public RandomAccess getScratchFile() { > > return scratchFile.getNewRandomAcess(); > > } > > > > One of the COSDocument constructors takes a RandomAccess file. This > > constructor is only called in a single location, namely, in method > > PDFParser.parse(). I am not sure if the RandomAccess parameter provided > > here is really a scratch file. Someone will have to decide what to do > > with this one. > > > > The code has been throughly tested and has been used in the production > > of several books without any problems. > > > > In the attachment please find the code. There is also a JUnit test that > > was used to debug my code. I have added an Apache license header and > > adopted PDFBox's code style. Feel free to make any desired changes. > > > > Best regards, > > > > Stefan Mücke > > > > >
