[jira] Updated: (PDFBOX-591) PDFBox performance issue: BaseParser.readUntilEndStream() rewrite

Mel Martinez (JIRA) Wed, 06 Jan 2010 15:35:16 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mel Martinez updated PDFBOX-591:
--------------------------------

    Attachment: BaseParser.java

tweaked version of BaseParser  to improve performance of readUntilEndStream() 
method.

> PDFBox performance issue:  BaseParser.readUntilEndStream() rewrite
> ------------------------------------------------------------------
>
>                 Key: PDFBOX-591
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-591
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: all
>            Reporter: Mel Martinez
>         Attachments: BaseParser.java
>
>
> The load time for loading documents into PDFBox (PDDocument) is too slow.
> One culprit is the method:  
> org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)
> The current implementation of this method uses a very slow test for end of 
> stream conditions.   A profile of the readUntilEndStream() method shows that 
> a huge chunk of the method's processing time is being consumed in the 
> cmpCircularBuffer() call - which is purely part of the test for for the end 
> of stream marker.  In other words, the readUntilEndOfStream() is spending 
> twice as much time testing for the end of stream marker as it is reading 
> bytes from the stream.
> A better solution is to use a simpler, direct fail-fast test conditional 
> structure that uses byte primitives.   I strongly recommend that the current 
> method be removed and replaced with the following code below.  This results 
> in a relative speed up of readUntilEndStream() method of a little over a 
> factor of 3 (a ratio of 113/37 = 3.05 if you want to be more precise).  This 
> in turn helps the overall performance of PDDocument.parse() by about a factor 
> of 2.7.
> Note the addition of some byte constants used to make the code readable.
> -----------------------------------------------------------------
>     private static final int E = 101;
>     private static final int N = 110;
>     private static final int D = 100;
>     
>     private static final int S = 115;
>     private static final int T = 116;
>     private static final int R = 114;
>     private static final int A = 97;
>     private static final int M = 109;
>     
>     private static final int O = 111;
>     private static final int B = 98;
>     private static final int J = 106;
>     
>     private static boolean flag = true;
>     
>     /**
>      * This method will read through the current stream object until
>      * we find the keyword "endstream" meaning we're at the end of this
>      * object. Some pdf files, however, forget to write some endstream tags
>      * and just close off objects with an "endobj" tag so we have to handle
>      * this case as well.
>      * @param out The stream we write out to. 
>      * @throws IOException
>      */
>     private void readUntilEndStream( OutputStream out ) throws IOException{
>       int byteRead;
>       do{ //use a fail fast test for end of stream markers
>               byteRead = pdfSource.read();
>               if(byteRead==E){//only branch if "e"
>                       byteRead = pdfSource.read();
>                       if(byteRead==N){ //only continue branch if "en"
>                               byteRead = pdfSource.read();
>                               if(byteRead==D){//up to "end" now
>                                       byteRead = pdfSource.read();
>                                       if(byteRead==S){
>                                               byteRead = pdfSource.read();
>                                               if(byteRead==T){
>                                                       byteRead = 
> pdfSource.read();
>                                                       if(byteRead==R){
>                                                               byteRead = 
> pdfSource.read();
>                                                               if(byteRead==E){
>                                                                       
> byteRead = pdfSource.read();
>                                                                       
> if(byteRead==A){
>                                                                               
> byteRead = pdfSource.read();
>                                                                               
> if(byteRead==M){
>                                                                               
>         //found the whole marker
>                                                                               
>         pdfSource.unread( ENDSTREAM );
>                                                                       return;
>                                                                               
> }
>                                                                       }else{
>                                                                               
> out.write(ENDSTREAM, 0, 7);
>                                                                       }
>                                                               }else{
>                                                                       
> out.write(ENDSTREAM, 0, 6);
>                                                               }
>                                                       }else{
>                                                               
> out.write(ENDSTREAM, 0, 5);
>                                                       }
>                                               }else{
>                                                       out.write(ENDSTREAM, 0, 
> 4);
>                                               }
>                                       }else if(byteRead==O){
>                                               byteRead = pdfSource.read();
>                                               if(byteRead==B){
>                                                       byteRead = 
> pdfSource.read();
>                                                       if(byteRead==J){
>                                                               //found whole 
> marker
>                                                               
> pdfSource.unread( ENDOBJ );
>                                               return;
>                                                       }
>                                               }else{
>                                                       out.write(ENDOBJ, 0, 4);
>                                               }
>                                       }else{
>                                               out.write(E);
>                                               out.write(N);
>                                               out.write(D);
>                                       }
>                               }else{
>                                       out.write(E);
>                                       out.write(N);
>                               }
>                       }else{
>                               out.write(E);
>                       }
>               }
>               if(byteRead!=-1)out.write(byteRead);
>       }while(byteRead!=-1);
>     }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-591) PDFBox performance issue: BaseParser.readUntilEndStream() rewrite

Reply via email to