[ 
https://issues.apache.org/jira/browse/PDFBOX-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alfred updated PDFBOX-4875:
---------------------------
    Description: 
I am testing text extraction from PDF and profiling the execution.

I found that the second biggest time consumer is the static code in 
Standard14Fonts that loads fonts from the pdf box jar.

Looking at the code I realized we don't have to load all fonts statically, when 
the class loads.

Not all PDFs need all fonts, so, if we lazy loaded them, only when needed, it 
will save some time and some memory.

 

  was:
I am testing text extraction from PDF and profiling the execution.

I found that the second biggest time consumer is the static code in 
Standard14Fonts that loads fonts from the pdf box jar.

The culprit seems to be the direct use of the stream returned 
getResurceAsStream.
 That would be a ZipInputStream when using PDFBox as a jar.

Using a buffered stream around it reduces the load time a lot.

 


> Lazy load standard 14 fonts, only if needed
> -------------------------------------------
>
>                 Key: PDFBOX-4875
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4875
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 2.0.20, 3.0.0 PDFBox
>            Reporter: Alfred
>            Priority: Major
>              Labels: Optimization
>             Fix For: 2.0.21, 3.0.0 PDFBox
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> I am testing text extraction from PDF and profiling the execution.
> I found that the second biggest time consumer is the static code in 
> Standard14Fonts that loads fonts from the pdf box jar.
> Looking at the code I realized we don't have to load all fonts statically, 
> when the class loads.
> Not all PDFs need all fonts, so, if we lazy loaded them, only when needed, it 
> will save some time and some memory.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to