https://bugs.documentfoundation.org/show_bug.cgi?id=66580
Dave Gilbert <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] | |om --- Comment #30 from Dave Gilbert <[email protected]> --- (Copying Caolán in due to the hybrid performance hack) Hi Tomaz, Thanks for replying. For context, one reason I'm asking abotu this stuff is that I just fixed the PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I was looking at how to fix that, and know a fairly easy way to do it - but understanding the details of hybrid seems to make sense first! >> a) Without a trailer, I don't understand how a hybrid loader is supposed >> to understand that a PDF embedded file is actually a hybrid PDF rather than >> just a PDF that happens to have an attached; A PDF could have loads of >> embedded files and none of them actually represent the same contents as the >> PDF (e.g. maybe include raw data spreadsheets or something with an >> explanatory PDF). > Using a filename convention "Original.od*" and making sure it's a compatible > document. > Or in PDF 2.0 - the embedded file that is /AFRelationship is /Source. OK, I don't know the details of PDF 2.0 yet, so I'll take your word for that; but is that something the writer already does? >> b) caolanm's performance hack - 'detectHasAdditionalStreams' from ~2023 - >> sniffs the trailer at the end of the file to make a quick detection about >> whether it might be a hybrid. So if the use of the trailer disappears you >> lose that performance trick. >Well, that's unfortunate but reading the xref entry and checking the objects >for an > embedded file is also fast and can spare you from loading the whole PDF. Before I break it, I'd like to give Caolán (copied in) a chance to say stuff. My feeling is that Caolán hack is unreasonably fast - it was driven by this ticket: https://github.com/CollaboraOnline/online/issues/7307 > Also BTW, we don't write those to the trailer anymore, if we export to PDF/A > variant or > PDF/UA is enabled. Hmm so we don't - we really should have a test for each of the hybrid types we support to make sure we can read it! We've already got a couple in there, but not any of the new types (which we can't read!) Also: - Do we have a 'spec' of hybrid anywhere? - If we change it what's the rules on compatibility for old docs? - My guess for DocumentChecksum had been that it was intended to stop the pdf and the original content getting out of sync - but that would want a checksum of the embedded document, not the whole file wouldn't it? - If we needed to check the embedded file name and/or the AFRelationship you mentioned, is that doable on a file prior to decryption? - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the hybrid extraction; then we can lose our extra PDF parser code. -- You are receiving this mail because: You are the assignee for the bug.
