[jira] [Commented] (PDFBOX-2883) Unify memory handling

Tilman Hausherr (JIRA) Tue, 29 Sep 2015 10:04:40 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935447#comment-14935447
 ]


Tilman Hausherr commented on PDFBOX-2883:
-----------------------------------------

Consider this code:
{code}
    @Test
    public void DeleteTest() throws FileNotFoundException
    {
        File f = new File("test.pdf");
        PrintWriter pw = new PrintWriter(new FileOutputStream(f));
        pw.write("<script language='JavaScript'>");
        pw.close();
        PDDocument doc = null;
        try
        {
            doc = PDDocument.load(f);
            Assert.fail("parsing should fail");
        }
        catch (IOException ex)
        {
            // expected
        }
        finally
        {
            Assert.assertNull(doc);
        }
        try
        {
            Files.delete(f.toPath());
        }
        catch (IOException ex)
        {
            Assert.fail("delete file after failed load() failed");
        }
    }
{code}
It fails to delete the file after the load() fails, because the file is still 
in use.

One solution would be to change PDDocument.load(File file, ...) like this:
{code}
    public static PDDocument load(File file, String password, InputStream 
keyStore, String alias,
                                  MemoryUsageSetting memUsageSetting) throws 
IOException
    {
        RandomAccessBufferedFileInputStream raFile = new 
RandomAccessBufferedFileInputStream(file);
        PDFParser parser = new PDFParser(raFile, password, keyStore, alias, new 
ScratchFile(memUsageSetting));
        try
        {
            parser.parse();
        }
        finally
        {
            raFile.close();
        }
        return parser.getPDDocument();
    }
{code}
Surprisingly, this works for me. But is it correct, i.e. is the original file 
no longer used even after parsing succeeds?

> Unify memory handling
> ---------------------
>
>                 Key: PDFBOX-2883
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2883
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 2.0.0
>            Reporter: Timo Boehme
>            Assignee: Timo Boehme
>             Fix For: 2.0.0
>
>         Attachments: MemoryUsage.java
>
>
> PDFBOX now has at least 2 different mechanisms to use main memory vs. keeping 
> large data in temporary file: in case of provided input stream the stream is 
> copied to temporary file and all read PDF streams are handled by 
> RandomAccessBuffer/ScratchFile.
> In PDFBOX-2882 I've done a re-implementation for ScratchFile which is quite 
> fast and allows to set a maximum amount of memory to be used for its pages 
> before it starts using the scratch file. This implementation could be used as 
> the general 'backend' for all buffered streams and even the file input stream 
> copy. As long as the PDF fits into the allowed maximum memory it should 
> equally fast as RandomAccessBuffer while it allows for good control of memory 
> usage by going to scratch file if needed. This prevents OOM in case of large 
> files.
> In order to use this the PDDocument methods should be changed to not have a 
> 'useScratchFile' parameter but to take a MemoryHandling object which details 
> the Buffering strategy (using ScratchFile; what amount of main memory can be 
> used, ...).
> I've opened this issue for discussing. Since we need API changes in 
> PDDocument it should be done before 2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2883) Unify memory handling

Reply via email to