[
https://issues.apache.org/jira/browse/PDFBOX-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781785#comment-16781785
]
Tilman Hausherr commented on PDFBOX-4477:
-----------------------------------------
I also tried a smaller file (755045.pdf) and the times are about the same.
I'll probably commit that latest change too… reasoning:
Most arrays / dictionaries won't be referenced by several other objects. Those
referenced aren't so many (e.g. fonts) and are also found in the confidential
file.
When we do have a complex / huge file, it is the memory footprint that could
"kill" an application, while a possibly slower response time (that I haven't
been able to show) would not.
> Large encrypted file takes days to be parsed
> --------------------------------------------
>
> Key: PDFBOX-4477
> URL: https://issues.apache.org/jira/browse/PDFBOX-4477
> Project: PDFBox
> Issue Type: Bug
> Components: Crypto, Parsing
> Affects Versions: 2.0.14
> Reporter: Tilman Hausherr
> Priority: Major
> Labels: optimization
> Fix For: 2.0.15, 3.0.0 PDFBox
>
>
> As reported by [~slavago] in TIKA-2832. File is confidential but I have it.
> Initial findings:
> - File is AES256 encrypted with empty user password
> - File has about 1000 objects
> - File is a tagged PDF
> - HashMap in SecurityHandler grows to 100000?!
> - Using an IdentityHashMap speeds up the process dramatically (parsed in a
> few seconds), and it may also be a better solution that what was done in
> PDFBOX-4453
> Todo:
> - Read description of IdentityHashMap again
> - Find out why the HashMap grows so much. Could it be that identical objects
> are stored twice? Or does the file have many direct objects?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]