[jira] [Comment Edited] (PDFBOX-2200) Memory leak with org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects

John Hewson (JIRA) Thu, 13 Nov 2014 11:50:55 -0800

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211204#comment-14211204
 ]


John Hewson edited comment on PDFBOX-2200 at 11/13/14 7:49 PM:
---------------------------------------------------------------

Yes, though looking at the code in PDSimpleFont and PDCIDFont it's not obvious. 
The put() and clear() in PDFont are fine, but the get() calls in those two 
subclasses are more tricky. I've included the relevant code below, what you'll 
see is that cmapObjects is used as a cache and it's expected that entries may 
be missing. Each call to cmapObjects.get is checked for null and eventually 
followed by a call to parseCmap, so it's safe, even though it's not pretty.

h6. PDCIDFont#determineEncoding
{code}
cmap = cmapObjects.get( cidSystemInfo );
if (cmap == null)
{
    InputStream cmapStream = null;
    try
    {
        cmapStream = ResourceLoader.loadResource(resourceRootCMAP + 
cidSystemInfo);
        if (cmapStream != null)
        {
            cmap = parseCmap(resourceRootCMAP, cmapStream);
            if (cmap == null)
            {
                log.error("Error: Could not parse predefined CMAP file for '" + 
cidSystemInfo + "'");
            }
            ...
{code}

h6. PDCIDFont##determineEncoding
{code}
if (encoding instanceof COSName) 
{
    if (cmap == null)
    {
        encodingName = (COSName)encoding;
        cmap = cmapObjects.get( encodingName.getName() );
        if (cmap == null) 
        {
            cmapName = encodingName.getName();
        }
    }
    ...
{code}
Much later in the same method is this code:
{code}
if (cmap == null && cmapName != null) 
{
    InputStream cmapStream = null;
    try 
    {
        // look for a predefined CMap with the given name
        cmapStream = ResourceLoader.loadResource(resourceRootCMAP + cmapName);
        if (cmapStream != null)
        {
            cmap = parseCmap(resourceRootCMAP, cmapStream);
            ...
{code}

h6. PDCIDFont#extractToUnicodeEncoding
{code}
else if ( toUnicode instanceof COSName)
{
    encodingName = (COSName)toUnicode;
    toUnicodeCmap = cmapObjects.get( encodingName.getName() );
    if (toUnicodeCmap == null) 
    {
        cmapName = encodingName.getName();
        String resourceName = resourceRootCMAP + cmapName;
        try 
        {
            toUnicodeCmap = parseCmap( resourceRootCMAP, 
ResourceLoader.loadResource( resourceName ));
        }
        catch(IOException exception) 
        {
            ...
{code}


was (Author: jahewson):
Yes, though looking at the code in PDSimpleFont and PDCIDFont it's not obvious. 
The put() and clear() in PDFont are fine, but the get() calls in those two 
subclasses are more tricky. I've included the relevant code below, what you'll 
see is that cmapObjects is used as a cache and it's expected that entries may 
be missing. Each call to cmapObjects.get is checked for null and eventually 
followed by a call to parseCmap, so it's safe, even if its not pretty.

h6. PDCIDFont#determineEncoding
{code}
cmap = cmapObjects.get( cidSystemInfo );
if (cmap == null)
{
    InputStream cmapStream = null;
    try
    {
        cmapStream = ResourceLoader.loadResource(resourceRootCMAP + 
cidSystemInfo);
        if (cmapStream != null)
        {
            cmap = parseCmap(resourceRootCMAP, cmapStream);
            if (cmap == null)
            {
                log.error("Error: Could not parse predefined CMAP file for '" + 
cidSystemInfo + "'");
            }
            ...
{code}

h6. PDCIDFont##determineEncoding
{code}
if (encoding instanceof COSName) 
{
    if (cmap == null)
    {
        encodingName = (COSName)encoding;
        cmap = cmapObjects.get( encodingName.getName() );
        if (cmap == null) 
        {
            cmapName = encodingName.getName();
        }
    }
    ...
{code}
Much later in the same method is this code:
{code}
if (cmap == null && cmapName != null) 
{
    InputStream cmapStream = null;
    try 
    {
        // look for a predefined CMap with the given name
        cmapStream = ResourceLoader.loadResource(resourceRootCMAP + cmapName);
        if (cmapStream != null)
        {
            cmap = parseCmap(resourceRootCMAP, cmapStream);
            ...
{code}

h6. PDCIDFont#extractToUnicodeEncoding
{code}
else if ( toUnicode instanceof COSName)
{
    encodingName = (COSName)toUnicode;
    toUnicodeCmap = cmapObjects.get( encodingName.getName() );
    if (toUnicodeCmap == null) 
    {
        cmapName = encodingName.getName();
        String resourceName = resourceRootCMAP + cmapName;
        try 
        {
            toUnicodeCmap = parseCmap( resourceRootCMAP, 
ResourceLoader.loadResource( resourceName ));
        }
        catch(IOException exception) 
        {
            ...
{code}

> Memory leak with org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-2200
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2200
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Matthew Buckett
>             Fix For: 2.0.0
>
>
> We use Tika to extract text from a large number (10,000+) of PDFs in a long 
> running JVM, after doing this for a while we started running short of heap 
> space. A heap dump shows that about 717MB of heap is retained through 
> org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects and the hashmap has 18001 
> entries.
> PDFBOX-1009 looked to partially address this but it appears the symptons are 
> still present. As a workaround I'm going to manually call             
> PDFont.clearResources() after indexing each document to prevent this 
> happening, but it would be better if I didn't have to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (PDFBOX-2200) Memory leak with org.apache.pdfbox.pdmodel.font.PDFont#cmapObjects

Reply via email to