[jira] Updated: (PDFBOX-591) PDFBox performance issue: BaseParser.readUntilEndStream() rewrite

Mel Martinez (JIRA) Thu, 14 Jan 2010 12:16:18 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mel Martinez updated PDFBOX-591:
--------------------------------

    Description: 
The load time for loading documents into PDFBox (PDDocument) is too slow.

One culprit is the method:  
org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)

The current implementation of this method uses a very slow test for end of 
stream conditions.   A profile of the readUntilEndStream() method shows that a 
huge chunk of the method's processing time is being consumed in the 
cmpCircularBuffer() call - which is purely part of the test for for the end of 
stream marker.  In other words, the readUntilEndOfStream() is spending twice as 
much time testing for the end of stream marker as it is reading bytes from the 
stream.

A better solution is to use a simpler, direct fail-fast test conditional 
structure that uses byte primitives.   I strongly recommend that the current 
method be removed and replaced with the following code below.  This results in 
a relative speed up of readUntilEndStream() method of a little over a factor of 
3 (a ratio of 113/37 = 3.05 if you want to be more precise).  This in turn 
helps the overall performance of PDDocument.parse() by about a factor of 2.7.

Note the addition of some byte constants used to make the code readable.

-----------------------------------------------------------------
    private static final int E = 101;
    private static final int N = 110;
    private static final int D = 100;
    
    private static final int S = 115;
    private static final int T = 116;
    private static final int R = 114;
    private static final int A = 97;
    private static final int M = 109;
    
    private static final int O = 111;
    private static final int B = 98;
    private static final int J = 106;
    
    
    /**
     * This method will read through the current stream object until
     * we find the keyword "endstream" meaning we're at the end of this
     * object. Some pdf files, however, forget to write some endstream tags
     * and just close off objects with an "endobj" tag so we have to handle
     * this case as well.
     * @param out The stream we write out to. 
     * @throws IOException
     */
    private void readUntilEndStream( OutputStream out ) throws IOException{
        int byteRead;
        do{ //use a fail fast test for end of stream markers
                byteRead = pdfSource.read();
                if(byteRead==E){//only branch if "e"
                        byteRead = pdfSource.read();
                        if(byteRead==N){ //only continue branch if "en"
                                byteRead = pdfSource.read();
                                if(byteRead==D){//up to "end" now
                                        byteRead = pdfSource.read();
                                        if(byteRead==S){
                                                byteRead = pdfSource.read();
                                                if(byteRead==T){
                                                        byteRead = 
pdfSource.read();
                                                        if(byteRead==R){
                                                                byteRead = 
pdfSource.read();
                                                                if(byteRead==E){
                                                                        
byteRead = pdfSource.read();
                                                                        
if(byteRead==A){
                                                                                
byteRead = pdfSource.read();
                                                                                
if(byteRead==M){
                                                                                
        //found the whole marker
                                                                                
        pdfSource.unread( ENDSTREAM );
                                                                        return;
                                                                                
}
                                                                        }else{
                                                                                
out.write(ENDSTREAM, 0, 7);
                                                                        }
                                                                }else{
                                                                        
out.write(ENDSTREAM, 0, 6);
                                                                }
                                                        }else{
                                                                
out.write(ENDSTREAM, 0, 5);
                                                        }
                                                }else{
                                                        out.write(ENDSTREAM, 0, 
4);
                                                }
                                        }else if(byteRead==O){
                                                byteRead = pdfSource.read();
                                                if(byteRead==B){
                                                        byteRead = 
pdfSource.read();
                                                        if(byteRead==J){
                                                                //found whole 
marker
                                                                
pdfSource.unread( ENDOBJ );
                                                return;
                                                        }
                                                }else{
                                                        out.write(ENDOBJ, 0, 4);
                                                }
                                        }else{
                                                out.write(E);
                                                out.write(N);
                                                out.write(D);
                                        }
                                }else{
                                        out.write(E);
                                        out.write(N);
                                }
                        }else{
                                out.write(E);
                        }
                }
                if(byteRead!=-1)out.write(byteRead);

        }while(byteRead!=-1);
    }


  was:
The load time for loading documents into PDFBox (PDDocument) is too slow.

One culprit is the method:  
org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)

The current implementation of this method uses a very slow test for end of 
stream conditions.   A profile of the readUntilEndStream() method shows that a 
huge chunk of the method's processing time is being consumed in the 
cmpCircularBuffer() call - which is purely part of the test for for the end of 
stream marker.  In other words, the readUntilEndOfStream() is spending twice as 
much time testing for the end of stream marker as it is reading bytes from the 
stream.

A better solution is to use a simpler, direct fail-fast test conditional 
structure that uses byte primitives.   I strongly recommend that the current 
method be removed and replaced with the following code below.  This results in 
a relative speed up of readUntilEndStream() method of a little over a factor of 
3 (a ratio of 113/37 = 3.05 if you want to be more precise).  This in turn 
helps the overall performance of PDDocument.parse() by about a factor of 2.7.

Note the addition of some byte constants used to make the code readable.

-----------------------------------------------------------------
    private static final int E = 101;
    private static final int N = 110;
    private static final int D = 100;
    
    private static final int S = 115;
    private static final int T = 116;
    private static final int R = 114;
    private static final int A = 97;
    private static final int M = 109;
    
    private static final int O = 111;
    private static final int B = 98;
    private static final int J = 106;
    
    private static boolean flag = true;
    
    /**
     * This method will read through the current stream object until
     * we find the keyword "endstream" meaning we're at the end of this
     * object. Some pdf files, however, forget to write some endstream tags
     * and just close off objects with an "endobj" tag so we have to handle
     * this case as well.
     * @param out The stream we write out to. 
     * @throws IOException
     */
    private void readUntilEndStream( OutputStream out ) throws IOException{
        int byteRead;
        do{ //use a fail fast test for end of stream markers
                byteRead = pdfSource.read();
                if(byteRead==E){//only branch if "e"
                        byteRead = pdfSource.read();
                        if(byteRead==N){ //only continue branch if "en"
                                byteRead = pdfSource.read();
                                if(byteRead==D){//up to "end" now
                                        byteRead = pdfSource.read();
                                        if(byteRead==S){
                                                byteRead = pdfSource.read();
                                                if(byteRead==T){
                                                        byteRead = 
pdfSource.read();
                                                        if(byteRead==R){
                                                                byteRead = 
pdfSource.read();
                                                                if(byteRead==E){
                                                                        
byteRead = pdfSource.read();
                                                                        
if(byteRead==A){
                                                                                
byteRead = pdfSource.read();
                                                                                
if(byteRead==M){
                                                                                
        //found the whole marker
                                                                                
        pdfSource.unread( ENDSTREAM );
                                                                        return;
                                                                                
}
                                                                        }else{
                                                                                
out.write(ENDSTREAM, 0, 7);
                                                                        }
                                                                }else{
                                                                        
out.write(ENDSTREAM, 0, 6);
                                                                }
                                                        }else{
                                                                
out.write(ENDSTREAM, 0, 5);
                                                        }
                                                }else{
                                                        out.write(ENDSTREAM, 0, 
4);
                                                }
                                        }else if(byteRead==O){
                                                byteRead = pdfSource.read();
                                                if(byteRead==B){
                                                        byteRead = 
pdfSource.read();
                                                        if(byteRead==J){
                                                                //found whole 
marker
                                                                
pdfSource.unread( ENDOBJ );
                                                return;
                                                        }
                                                }else{
                                                        out.write(ENDOBJ, 0, 4);
                                                }
                                        }else{
                                                out.write(E);
                                                out.write(N);
                                                out.write(D);
                                        }
                                }else{
                                        out.write(E);
                                        out.write(N);
                                }
                        }else{
                                out.write(E);
                        }
                }
                if(byteRead!=-1)out.write(byteRead);

        }while(byteRead!=-1);
    }


     Issue Type: Improvement  (was: Bug)

Changed from 'bug' to improvement.

A much needed improvement, though!

> PDFBox performance issue:  BaseParser.readUntilEndStream() rewrite
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-591
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-591
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: BaseParser.java
>
>
> The load time for loading documents into PDFBox (PDDocument) is too slow.
> One culprit is the method:  
> org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)
> The current implementation of this method uses a very slow test for end of 
> stream conditions.   A profile of the readUntilEndStream() method shows that 
> a huge chunk of the method's processing time is being consumed in the 
> cmpCircularBuffer() call - which is purely part of the test for for the end 
> of stream marker.  In other words, the readUntilEndOfStream() is spending 
> twice as much time testing for the end of stream marker as it is reading 
> bytes from the stream.
> A better solution is to use a simpler, direct fail-fast test conditional 
> structure that uses byte primitives.   I strongly recommend that the current 
> method be removed and replaced with the following code below.  This results 
> in a relative speed up of readUntilEndStream() method of a little over a 
> factor of 3 (a ratio of 113/37 = 3.05 if you want to be more precise).  This 
> in turn helps the overall performance of PDDocument.parse() by about a factor 
> of 2.7.
> Note the addition of some byte constants used to make the code readable.
> -----------------------------------------------------------------
>     private static final int E = 101;
>     private static final int N = 110;
>     private static final int D = 100;
>     
>     private static final int S = 115;
>     private static final int T = 116;
>     private static final int R = 114;
>     private static final int A = 97;
>     private static final int M = 109;
>     
>     private static final int O = 111;
>     private static final int B = 98;
>     private static final int J = 106;
>     
>     
>     /**
>      * This method will read through the current stream object until
>      * we find the keyword "endstream" meaning we're at the end of this
>      * object. Some pdf files, however, forget to write some endstream tags
>      * and just close off objects with an "endobj" tag so we have to handle
>      * this case as well.
>      * @param out The stream we write out to. 
>      * @throws IOException
>      */
>     private void readUntilEndStream( OutputStream out ) throws IOException{
>       int byteRead;
>       do{ //use a fail fast test for end of stream markers
>               byteRead = pdfSource.read();
>               if(byteRead==E){//only branch if "e"
>                       byteRead = pdfSource.read();
>                       if(byteRead==N){ //only continue branch if "en"
>                               byteRead = pdfSource.read();
>                               if(byteRead==D){//up to "end" now
>                                       byteRead = pdfSource.read();
>                                       if(byteRead==S){
>                                               byteRead = pdfSource.read();
>                                               if(byteRead==T){
>                                                       byteRead = 
> pdfSource.read();
>                                                       if(byteRead==R){
>                                                               byteRead = 
> pdfSource.read();
>                                                               if(byteRead==E){
>                                                                       
> byteRead = pdfSource.read();
>                                                                       
> if(byteRead==A){
>                                                                               
> byteRead = pdfSource.read();
>                                                                               
> if(byteRead==M){
>                                                                               
>         //found the whole marker
>                                                                               
>         pdfSource.unread( ENDSTREAM );
>                                                                       return;
>                                                                               
> }
>                                                                       }else{
>                                                                               
> out.write(ENDSTREAM, 0, 7);
>                                                                       }
>                                                               }else{
>                                                                       
> out.write(ENDSTREAM, 0, 6);
>                                                               }
>                                                       }else{
>                                                               
> out.write(ENDSTREAM, 0, 5);
>                                                       }
>                                               }else{
>                                                       out.write(ENDSTREAM, 0, 
> 4);
>                                               }
>                                       }else if(byteRead==O){
>                                               byteRead = pdfSource.read();
>                                               if(byteRead==B){
>                                                       byteRead = 
> pdfSource.read();
>                                                       if(byteRead==J){
>                                                               //found whole 
> marker
>                                                               
> pdfSource.unread( ENDOBJ );
>                                               return;
>                                                       }
>                                               }else{
>                                                       out.write(ENDOBJ, 0, 4);
>                                               }
>                                       }else{
>                                               out.write(E);
>                                               out.write(N);
>                                               out.write(D);
>                                       }
>                               }else{
>                                       out.write(E);
>                                       out.write(N);
>                               }
>                       }else{
>                               out.write(E);
>                       }
>               }
>               if(byteRead!=-1)out.write(byteRead);
>       }while(byteRead!=-1);
>     }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-591) PDFBox performance issue: BaseParser.readUntilEndStream() rewrite

Reply via email to