PDFBox performance issue: BaseParser.readUntilEndStream() rewrite
------------------------------------------------------------------
Key: PDFBOX-591
URL: https://issues.apache.org/jira/browse/PDFBOX-591
Project: PDFBox
Issue Type: Bug
Components: Parsing
Affects Versions: 0.8.0-incubator
Environment: all
Reporter: Mel Martinez
The load time for loading documents into PDFBox (PDDocument) is too slow.
One culprit is the method:
org.apach.pdfbox.pdfparser.BaseParser.readUntilEndStream(OutputStream out)
The current implementation of this method uses a very slow test for end of
stream conditions. A profile of the readUntilEndStream() method shows that a
huge chunk of the method's processing time is being consumed in the
cmpCircularBuffer() call - which is purely part of the test for for the end of
stream marker. In other words, the readUntilEndOfStream() is spending twice as
much time testing for the end of stream marker as it is reading bytes from the
stream.
A better solution is to use a simpler, direct fail-fast test conditional
structure that uses byte primitives. I strongly recommend that the current
method be removed and replaced with the following code below. This results in
a relative speed up of readUntilEndStream() method of a little over a factor of
3 (a ratio of 113/37 = 3.05 if you want to be more precise). This in turn
helps the overall performance of PDDocument.parse() by about a factor of 2.7.
Note the addition of some byte constants used to make the code readable.
-----------------------------------------------------------------
private static final int E = 101;
private static final int N = 110;
private static final int D = 100;
private static final int S = 115;
private static final int T = 116;
private static final int R = 114;
private static final int A = 97;
private static final int M = 109;
private static final int O = 111;
private static final int B = 98;
private static final int J = 106;
private static boolean flag = true;
/**
* This method will read through the current stream object until
* we find the keyword "endstream" meaning we're at the end of this
* object. Some pdf files, however, forget to write some endstream tags
* and just close off objects with an "endobj" tag so we have to handle
* this case as well.
* @param out The stream we write out to.
* @throws IOException
*/
private void readUntilEndStream( OutputStream out ) throws IOException{
int byteRead;
do{ //use a fail fast test for end of stream markers
byteRead = pdfSource.read();
if(byteRead==E){//only branch if "e"
byteRead = pdfSource.read();
if(byteRead==N){ //only continue branch if "en"
byteRead = pdfSource.read();
if(byteRead==D){//up to "end" now
byteRead = pdfSource.read();
if(byteRead==S){
byteRead = pdfSource.read();
if(byteRead==T){
byteRead =
pdfSource.read();
if(byteRead==R){
byteRead =
pdfSource.read();
if(byteRead==E){
byteRead = pdfSource.read();
if(byteRead==A){
byteRead = pdfSource.read();
if(byteRead==M){
//found the whole marker
pdfSource.unread( ENDSTREAM );
return;
}
}else{
out.write(ENDSTREAM, 0, 7);
}
}else{
out.write(ENDSTREAM, 0, 6);
}
}else{
out.write(ENDSTREAM, 0, 5);
}
}else{
out.write(ENDSTREAM, 0,
4);
}
}else if(byteRead==O){
byteRead = pdfSource.read();
if(byteRead==B){
byteRead =
pdfSource.read();
if(byteRead==J){
//found whole
marker
pdfSource.unread( ENDOBJ );
return;
}
}else{
out.write(ENDOBJ, 0, 4);
}
}else{
out.write(E);
out.write(N);
out.write(D);
}
}else{
out.write(E);
out.write(N);
}
}else{
out.write(E);
}
}
if(byteRead!=-1)out.write(byteRead);
}while(byteRead!=-1);
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.