[ 
https://issues.apache.org/jira/browse/PDFBOX-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781741#comment-16781741
 ] 

Tilman Hausherr commented on PDFBOX-4477:
-----------------------------------------

With the current code, it takes 2.6 seconds to open (best of several attempts).
With code that keeps only strings and streams in the sets, it takes 2.4 
seconds. Element count goes down to 252869.


> Large encrypted file takes days to be parsed
> --------------------------------------------
>
>                 Key: PDFBOX-4477
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4477
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Crypto, Parsing
>    Affects Versions: 2.0.14
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: optimization
>             Fix For: 2.0.15, 3.0.0 PDFBox
>
>
> As reported by [~slavago] in TIKA-2832. File is confidential but I have it. 
> Initial findings:
> - File is AES256 encrypted with empty user password
> - File has about 1000 objects
> - File is a tagged PDF
> - HashMap in SecurityHandler grows to 100000?!
> - Using an IdentityHashMap speeds up the process dramatically (parsed in a 
> few seconds), and it may also be a better solution that what was done in 
> PDFBOX-4453
> Todo:
> - Read description of IdentityHashMap again
> - Find out why the HashMap grows so much. Could it be that identical objects 
> are stored twice? Or does the file have many direct objects?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to