[jira] Created: (PDFBOX-591) PDFBox performance issue: BaseParser.readUntilEndStream() rewrite

Mel Martinez (JIRA) Wed, 06 Jan 2010 15:03:16 -0800

PDFBox performance issue:  BaseParser.readUntilEndStream() rewrite
------------------------------------------------------------------


                 Key: PDFBOX-591
                 URL: https://issues.apache.org/jira/browse/PDFBOX-591
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 0.8.0-incubator
         Environment: all
            Reporter: Mel Martinez


The load time for loading documents into PDFBox (PDDocument) is too slow.

One culprit is the method:  
org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)

The current implementation of this method uses a very slow test for end of 
stream conditions.   A profile of the readUntilEndStream() method shows that a 
huge chunk of the method's processing time is being consumed in the 
cmpCircularBuffer() call - which is purely part of the test for for the end of 
stream marker.  In other words, the readUntilEndOfStream() is spending twice as 
much time testing for the end of stream marker as it is reading bytes from the 
stream.

A better solution is to use a simpler, direct fail-fast test conditional 
structure that uses byte primitives.   I strongly recommend that the current 
method be removed and replaced with the following code below.  This results in 
a relative speed up of readUntilEndStream() method of a little over a factor of 
3 (a ratio of 113/37 = 3.05 if you want to be more precise).  This in turn 
helps the overall performance of PDDocument.parse() by about a factor of 2.7.

Note the addition of some byte constants used to make the code readable.

-----------------------------------------------------------------
    private static final int E = 101;
    private static final int N = 110;
    private static final int D = 100;
    
    private static final int S = 115;
    private static final int T = 116;
    private static final int R = 114;
    private static final int A = 97;
    private static final int M = 109;
    
    private static final int O = 111;
    private static final int B = 98;
    private static final int J = 106;
    
    private static boolean flag = true;
    
    /**
     * This method will read through the current stream object until
     * we find the keyword "endstream" meaning we're at the end of this
     * object. Some pdf files, however, forget to write some endstream tags
     * and just close off objects with an "endobj" tag so we have to handle
     * this case as well.
     * @param out The stream we write out to. 
     * @throws IOException
     */
    private void readUntilEndStream( OutputStream out ) throws IOException{
        int byteRead;
        do{ //use a fail fast test for end of stream markers
                byteRead = pdfSource.read();
                if(byteRead==E){//only branch if "e"
                        byteRead = pdfSource.read();
                        if(byteRead==N){ //only continue branch if "en"
                                byteRead = pdfSource.read();
                                if(byteRead==D){//up to "end" now
                                        byteRead = pdfSource.read();
                                        if(byteRead==S){
                                                byteRead = pdfSource.read();
                                                if(byteRead==T){
                                                        byteRead = 
pdfSource.read();
                                                        if(byteRead==R){
                                                                byteRead = 
pdfSource.read();
                                                                if(byteRead==E){
                                                                        
byteRead = pdfSource.read();
                                                                        
if(byteRead==A){
                                                                                
byteRead = pdfSource.read();
                                                                                
if(byteRead==M){
                                                                                
        //found the whole marker
                                                                                
        pdfSource.unread( ENDSTREAM );
                                                                        return;
                                                                                
}
                                                                        }else{
                                                                                
out.write(ENDSTREAM, 0, 7);
                                                                        }
                                                                }else{
                                                                        
out.write(ENDSTREAM, 0, 6);
                                                                }
                                                        }else{
                                                                
out.write(ENDSTREAM, 0, 5);
                                                        }
                                                }else{
                                                        out.write(ENDSTREAM, 0, 
4);
                                                }
                                        }else if(byteRead==O){
                                                byteRead = pdfSource.read();
                                                if(byteRead==B){
                                                        byteRead = 
pdfSource.read();
                                                        if(byteRead==J){
                                                                //found whole 
marker
                                                                
pdfSource.unread( ENDOBJ );
                                                return;
                                                        }
                                                }else{
                                                        out.write(ENDOBJ, 0, 4);
                                                }
                                        }else{
                                                out.write(E);
                                                out.write(N);
                                                out.write(D);
                                        }
                                }else{
                                        out.write(E);
                                        out.write(N);
                                }
                        }else{
                                out.write(E);
                        }
                }
                if(byteRead!=-1)out.write(byteRead);

        }while(byteRead!=-1);
    }


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-591) PDFBox performance issue: BaseParser.readUntilEndStream() rewrite

Reply via email to