https://bugs.documentfoundation.org/show_bug.cgi?id=66580
--- Comment #34 from Dave Gilbert <[email protected]> --- (In reply to Tomaz Vajngerl from comment #32) > (In reply to Dave Gilbert from comment #30) > > (Copying Caolán in due to the hybrid performance hack) > > Hi Tomaz, > > Thanks for replying. > > For context, one reason I'm asking abotu this stuff is that I just fixed the > > PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I > > was looking at how to fix that, and know a fairly easy way to do it - but > > understanding the details of hybrid seems to make sense first! > > > > >> a) Without a trailer, I don't understand how a hybrid loader is > > >> supposed > > >> to understand that a PDF embedded file is actually a hybrid PDF rather > > >> than > > >> just a PDF that happens to have an attached; A PDF could have loads of > > >> embedded files and none of them actually represent the same contents as > > >> the > > >> PDF (e.g. maybe include raw data spreadsheets or something with an > > >> explanatory PDF). > > > > > Using a filename convention "Original.od*" and making sure it's a > > > compatible document. > > > Or in PDF 2.0 - the embedded file that is /AFRelationship is /Source. > > > > OK, I don't know the details of PDF 2.0 yet, so I'll take your word for > > that; but is that something the writer already does? > > It does in master when the output format is PDF 2.0. > > Also a fairly good type of detection is to check if there is just one > embedded file that is ODF. That's what will be the case in case of hybrid > files. If there are more leave it up to the user to open the correct source > in the PDF viewer - if it indeed is a hybrid file. OK, that feels fairly simple; does that cover for all old Hybrid files? > > Before I break it, I'd like to give Caolán (copied in) a chance to say > > stuff. > > My feeling is that Caolán hack is unreasonably fast - it was driven by this > > ticket: > > https://github.com/CollaboraOnline/online/issues/7307 > > Well, sure, but parsing the PDF document structure only is not very complex > either. The problem is that we do a full PDF parse (I guess) in > getAdditionalStream with calling the pdfparse::PDFReader::read method. > > Instead we could've used a modified PDFDocument::Read just for detection > that doesn't do any unnecessary things. So to detect an embedded file you > can read the xref table, then check the trailer for /Root which should point > you to the object with the /Catalog dictionary and from there you need to > get the /Names entry and check the /EmbeddedFiles... Those things should eb > fast as you just jump around in the file from one location to the other and > do minimal reads. > > Example of PDF Catalog dictionary with embedded files: > 16 0 obj > << > /Type/Catalog > /Pages 6 0 R > /Names > << > /EmbeddedFiles > << > /Names [<FEFF004F0072006900670069006E0061006C002E006F00640074> 2 0 R] > >> > >> > /OpenAction[3 0 R /XYZ null null 0] > /Lang(en-GB) > >> > endobj > <See below> > > Hmm so we don't - we really should have a test for each of the hybrid types > > we support to make sure we can read it! We've already got a couple in > > there, but not any of the new types (which we can't read!) > > As I said you can just go into the PDF viewer and open the embedded document > there. I think this should be the preferred way of doing it anyway, so I > don't think it's such a big problem, but we need to make users aware that > they can access the original document in this way as well, and this is > probably the biggest issue. > > > Also: > > - Do we have a 'spec' of hybrid anywhere? > > No. Feels like we should. > > - If we change it what's the rules on compatibility for old docs? > > Not breaking the compatibility, so reading /AdditionalStreams should still > be supported. Gah OK, that was a bit I hadn't realised - but then we do have ./sw/qa/extras/pdf/data/Hybrid_AdditionalStreamsOnly.pdf as a test for that case, where there is no PDF EmbeddedFile So we're stuck in having to do both of them for some long time: a) Check for an AdditionalStreams (required for old old files) b) Check for EmbedeedFile named Original.o* with correct MIME (required for current PDF/UA, etc) - not currently done > Writing /AdditionalStreams is probably something we can stop adding at some > point (for plain PDFs). > > > - My guess for DocumentChecksum had been that it was intended to stop the > > pdf and the original content getting out of sync - but that would want a > > checksum of the embedded document, not the whole file wouldn't it? > > All I can see is that we check the PDF file is not corrupted by checking the > hashes - I don't think it has any purpose except maybe to not prevent > opening the source documnet if the PDF was modified (for example - > annotation was added), which may not make sense actually. I would remove > that completely. OK, tht's not hard. > > - If we needed to check the embedded file name and/or the AFRelationship > > you mentioned, is that doable on a file prior to decryption? > > PDF document structure is not completely encrypted - you can still see the > most of the structure, but not the content of streams, so you see that there > is an embedded file and the type of the file. OK. > > - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the > > same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the > > hybrid extraction; then we can lose our extra PDF parser code. > > The problem is only that Poppler is optional, so it might not be available, > which means this functionallity will also not be available. But then the > issue in general is that this is implemented inside PDF import code and not > as a independent filter, which it could be done easily. > > Also we use PDFium mostly in the core nowadays for various tasks and I think > that will become available by default much more likely than Poppler, which > is only used for PDF import. I'm up for rewriting the hybrid Detect/extract using PDFium rather than the other inbuilt parser in sdext/source/pdfimport/pdfparse, but the need to keep parsing the trailer for AdditionalStreams might make that a lot more tricky; Poppler parses the trailer and it's easy just to ask it for it: pDocUnique->getXRef()->getTrailerDict()->dictLookup("AdditionalStreams") (Which seems to work - add error checking to taste) I'm not seeing a similar interface in PDFium - the nearest seems to be 'getTrailerEnds' - which I think is just start and end bytes of (each??) trailer. Do we have anything that parses a trailer using PDFium? -- You are receiving this mail because: You are the assignee for the bug.
