[ 
https://issues.apache.org/jira/browse/PDFBOX-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781696#comment-16781696
 ] 

Tilman Hausherr commented on PDFBOX-4477:
-----------------------------------------

The last commit is just about whether object types are put into the set at all. 
The commit before didn't introduce a new check, it reverted the code to before 
PDFBOX-4453 (and refactored a bit) but replaced the set type. The huge set size 
is always there.

I did think about only putting only COSString and COSStream into the set but 
not COSDictionary and not COSArray, but that would mean that these would still 
be processed one level.

The real mystery (which I am investigating) is why are there so many objects.

> Large encrypted file takes days to be parsed
> --------------------------------------------
>
>                 Key: PDFBOX-4477
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4477
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Crypto, Parsing
>    Affects Versions: 2.0.14
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: optimization
>             Fix For: 2.0.15, 3.0.0 PDFBox
>
>
> As reported by [~slavago] in TIKA-2832. File is confidential but I have it. 
> Initial findings:
> - File is AES256 encrypted with empty user password
> - File has about 1000 objects
> - File is a tagged PDF
> - HashMap in SecurityHandler grows to 100000?!
> - Using an IdentityHashMap speeds up the process dramatically (parsed in a 
> few seconds), and it may also be a better solution that what was done in 
> PDFBOX-4453
> Todo:
> - Read description of IdentityHashMap again
> - Find out why the HashMap grows so much. Could it be that identical objects 
> are stored twice? Or does the file have many direct objects?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to