https://bugs.documentfoundation.org/show_bug.cgi?id=66580
--- Comment #32 from Tomaz Vajngerl <[email protected]> --- (In reply to Dave Gilbert from comment #30) > (Copying Caolán in due to the hybrid performance hack) > Hi Tomaz, > Thanks for replying. > For context, one reason I'm asking abotu this stuff is that I just fixed the > PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I > was looking at how to fix that, and know a fairly easy way to do it - but > understanding the details of hybrid seems to make sense first! > > >> a) Without a trailer, I don't understand how a hybrid loader is supposed > >> to understand that a PDF embedded file is actually a hybrid PDF rather than > >> just a PDF that happens to have an attached; A PDF could have loads of > >> embedded files and none of them actually represent the same contents as the > >> PDF (e.g. maybe include raw data spreadsheets or something with an > >> explanatory PDF). > > > Using a filename convention "Original.od*" and making sure it's a > > compatible document. > > Or in PDF 2.0 - the embedded file that is /AFRelationship is /Source. > > OK, I don't know the details of PDF 2.0 yet, so I'll take your word for > that; but is that something the writer already does? It does in master when the output format is PDF 2.0. Also a fairly good type of detection is to check if there is just one embedded file that is ODF. That's what will be the case in case of hybrid files. If there are more leave it up to the user to open the correct source in the PDF viewer - if it indeed is a hybrid file. > Before I break it, I'd like to give Caolán (copied in) a chance to say stuff. > My feeling is that Caolán hack is unreasonably fast - it was driven by this > ticket: > https://github.com/CollaboraOnline/online/issues/7307 Well, sure, but parsing the PDF document structure only is not very complex either. The problem is that we do a full PDF parse (I guess) in getAdditionalStream with calling the pdfparse::PDFReader::read method. Instead we could've used a modified PDFDocument::Read just for detection that doesn't do any unnecessary things. So to detect an embedded file you can read the xref table, then check the trailer for /Root which should point you to the object with the /Catalog dictionary and from there you need to get the /Names entry and check the /EmbeddedFiles... Those things should eb fast as you just jump around in the file from one location to the other and do minimal reads. Example of PDF Catalog dictionary with embedded files: 16 0 obj << /Type/Catalog /Pages 6 0 R /Names << /EmbeddedFiles << /Names [<FEFF004F0072006900670069006E0061006C002E006F00640074> 2 0 R] >> >> /OpenAction[3 0 R /XYZ null null 0] /Lang(en-GB) >> endobj > Hmm so we don't - we really should have a test for each of the hybrid types > we support to make sure we can read it! We've already got a couple in > there, but not any of the new types (which we can't read!) As I said you can just go into the PDF viewer and open the embedded document there. I think this should be the preferred way of doing it anyway, so I don't think it's such a big problem, but we need to make users aware that they can access the original document in this way as well, and this is probably the biggest issue. > Also: > - Do we have a 'spec' of hybrid anywhere? No. > - If we change it what's the rules on compatibility for old docs? Not breaking the compatibility, so reading /AdditionalStreams should still be supported. Writing /AdditionalStreams is probably something we can stop adding at some point (for plain PDFs). > - My guess for DocumentChecksum had been that it was intended to stop the > pdf and the original content getting out of sync - but that would want a > checksum of the embedded document, not the whole file wouldn't it? All I can see is that we check the PDF file is not corrupted by checking the hashes - I don't think it has any purpose except maybe to not prevent opening the source documnet if the PDF was modified (for example - annotation was added), which may not make sense actually. I would remove that completely. > - If we needed to check the embedded file name and/or the AFRelationship > you mentioned, is that doable on a file prior to decryption? PDF document structure is not completely encrypted - you can still see the most of the structure, but not the content of streams, so you see that there is an embedded file and the type of the file. > - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the > same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the > hybrid extraction; then we can lose our extra PDF parser code. The problem is only that Poppler is optional, so it might not be available, which means this functionallity will also not be available. But then the issue in general is that this is implemented inside PDF import code and not as a independent filter, which it could be done easily. Also we use PDFium mostly in the core nowadays for various tasks and I think that will become available by default much more likely than Poppler, which is only used for PDF import. -- You are receiving this mail because: You are the assignee for the bug.
