PDFBox performance issue:  TextPosition performance tweak
---------------------------------------------------------

                 Key: PDFBOX-599
                 URL: https://issues.apache.org/jira/browse/PDFBOX-599
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
    Affects Versions: 0.8.0-incubator, 1.0.0
         Environment: All
            Reporter: Mel Martinez


During text extraction, the TextPosition.getX() and TextPosition.getY() methods 
are invoked multiple times on each TextPosition object.

The current code recalculate these values each time the accessor is invoked, 
even thought the underlying state from which the values are derived has not 
changed.

This is slow.

The getters  (getX() and getY()) should be changed to retain the X and Y 
attributes in instance fields and only calculate their values once.

Specificaly the following two fields should be added:

    private float x = Float.NEGATIVE_INFINITY;
    private float y = Float.NEGATIVE_INFINITY;

And the two methods changed to look like so:

    public float getX()
    {
        if(x==Float.NEGATIVE_INFINITY){
                x = getXRot(rot);
        }
        return x;
    }

    public float getY()
    {
        if(y==Float.NEGATIVE_INFINITY){
            if ((rot == 0) || (rot == 180))
            {
                y = pageHeight - getYLowerLeftRot(rot);
            }
            else 
            {
                y = pageWidth - getYLowerLeftRot(rot);
            }
        }
        return y;
    }

This provides a very noticeable speedup in the text extraction.

I'll attach a version of the TextPosition.java class that includes this mod.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to