[
https://issues.apache.org/jira/browse/PDFBOX-591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mel Martinez updated PDFBOX-591:
--------------------------------
Attachment: BaseParser.java
tweaked version of BaseParser to improve performance of readUntilEndStream()
method.
> PDFBox performance issue: BaseParser.readUntilEndStream() rewrite
> ------------------------------------------------------------------
>
> Key: PDFBOX-591
> URL: https://issues.apache.org/jira/browse/PDFBOX-591
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: all
> Reporter: Mel Martinez
> Attachments: BaseParser.java
>
>
> The load time for loading documents into PDFBox (PDDocument) is too slow.
> One culprit is the method:
> org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)
> The current implementation of this method uses a very slow test for end of
> stream conditions. A profile of the readUntilEndStream() method shows that
> a huge chunk of the method's processing time is being consumed in the
> cmpCircularBuffer() call - which is purely part of the test for for the end
> of stream marker. In other words, the readUntilEndOfStream() is spending
> twice as much time testing for the end of stream marker as it is reading
> bytes from the stream.
> A better solution is to use a simpler, direct fail-fast test conditional
> structure that uses byte primitives. I strongly recommend that the current
> method be removed and replaced with the following code below. This results
> in a relative speed up of readUntilEndStream() method of a little over a
> factor of 3 (a ratio of 113/37 = 3.05 if you want to be more precise). This
> in turn helps the overall performance of PDDocument.parse() by about a factor
> of 2.7.
> Note the addition of some byte constants used to make the code readable.
> -----------------------------------------------------------------
> private static final int E = 101;
> private static final int N = 110;
> private static final int D = 100;
>
> private static final int S = 115;
> private static final int T = 116;
> private static final int R = 114;
> private static final int A = 97;
> private static final int M = 109;
>
> private static final int O = 111;
> private static final int B = 98;
> private static final int J = 106;
>
> private static boolean flag = true;
>
> /**
> * This method will read through the current stream object until
> * we find the keyword "endstream" meaning we're at the end of this
> * object. Some pdf files, however, forget to write some endstream tags
> * and just close off objects with an "endobj" tag so we have to handle
> * this case as well.
> * @param out The stream we write out to.
> * @throws IOException
> */
> private void readUntilEndStream( OutputStream out ) throws IOException{
> int byteRead;
> do{ //use a fail fast test for end of stream markers
> byteRead = pdfSource.read();
> if(byteRead==E){//only branch if "e"
> byteRead = pdfSource.read();
> if(byteRead==N){ //only continue branch if "en"
> byteRead = pdfSource.read();
> if(byteRead==D){//up to "end" now
> byteRead = pdfSource.read();
> if(byteRead==S){
> byteRead = pdfSource.read();
> if(byteRead==T){
> byteRead =
> pdfSource.read();
> if(byteRead==R){
> byteRead =
> pdfSource.read();
> if(byteRead==E){
>
> byteRead = pdfSource.read();
>
> if(byteRead==A){
>
> byteRead = pdfSource.read();
>
> if(byteRead==M){
>
> //found the whole marker
>
> pdfSource.unread( ENDSTREAM );
> return;
>
> }
> }else{
>
> out.write(ENDSTREAM, 0, 7);
> }
> }else{
>
> out.write(ENDSTREAM, 0, 6);
> }
> }else{
>
> out.write(ENDSTREAM, 0, 5);
> }
> }else{
> out.write(ENDSTREAM, 0,
> 4);
> }
> }else if(byteRead==O){
> byteRead = pdfSource.read();
> if(byteRead==B){
> byteRead =
> pdfSource.read();
> if(byteRead==J){
> //found whole
> marker
>
> pdfSource.unread( ENDOBJ );
> return;
> }
> }else{
> out.write(ENDOBJ, 0, 4);
> }
> }else{
> out.write(E);
> out.write(N);
> out.write(D);
> }
> }else{
> out.write(E);
> out.write(N);
> }
> }else{
> out.write(E);
> }
> }
> if(byteRead!=-1)out.write(byteRead);
> }while(byteRead!=-1);
> }
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.