https://bugs.documentfoundation.org/show_bug.cgi?id=66580

--- Comment #34 from Dave Gilbert <[email protected]> ---
(In reply to Tomaz Vajngerl from comment #32)
> (In reply to Dave Gilbert from comment #30)
> > (Copying Caolán in due to the hybrid performance hack)
> > Hi Tomaz,
> >   Thanks for replying.
> > For context, one reason I'm asking abotu this stuff is that I just fixed the
> > PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I
> > was looking at how to fix that, and know a fairly easy way to do it - but
> > understanding the details of hybrid seems to make sense first!
> > 
> > >>   a) Without a trailer, I don't understand how a hybrid loader is 
> > >> supposed
> > >> to understand that a PDF embedded file is actually a hybrid PDF rather 
> > >> than
> > >> just a PDF that happens to have an attached; A PDF could have loads of
> > >> embedded files and none of them actually represent the same contents as 
> > >> the
> > >> PDF (e.g. maybe include raw data spreadsheets or something with an
> > >> explanatory PDF).
> > 
> > > Using a filename convention "Original.od*" and making sure it's a 
> > > compatible document. 
> > > Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   
> > 
> > OK, I don't know the details of PDF 2.0 yet, so I'll take your word for
> > that; but is that something the writer already does?
> 
> It does in master when the output format is PDF 2.0. 
> 
> Also a fairly good type of detection is to check if there is just one
> embedded file that is ODF. That's what will be the case in case of hybrid
> files. If there are more leave it up to the user to open the correct source
> in the PDF viewer - if it indeed is a hybrid file.  

OK, that feels fairly simple; does that cover for all old Hybrid files?

> > Before I break it, I'd like to give Caolán (copied in) a chance to say 
> > stuff.
> > My feeling is that Caolán hack is unreasonably fast - it was driven by this
> > ticket:
> > https://github.com/CollaboraOnline/online/issues/7307
> 
> Well, sure, but parsing the PDF document structure only is not very complex
> either. The problem is that we do a full PDF parse (I guess) in
> getAdditionalStream with calling the pdfparse::PDFReader::read method. 
> 
> Instead we could've used a modified PDFDocument::Read just for detection
> that doesn't do any unnecessary things. So to detect an embedded file you
> can read the xref table, then check the trailer for /Root which should point
> you to the object with the /Catalog dictionary and from there you need to
> get the /Names entry and check the /EmbeddedFiles... Those things should eb
> fast as you just jump around in the file from one location to the other and
> do minimal reads.
> 
> Example of PDF Catalog dictionary with embedded files:
> 16 0 obj
> <<
>   /Type/Catalog
>   /Pages 6 0 R
>   /Names 
>   <<
>     /EmbeddedFiles 
>     <<
>       /Names [<FEFF004F0072006900670069006E0061006C002E006F00640074> 2 0 R]
>     >>
>   >>
>   /OpenAction[3 0 R /XYZ null null 0]
>   /Lang(en-GB)
> >>
> endobj
> 

<See below>

> > Hmm so we don't - we really should have a test for each of the hybrid types
> > we support to make sure we can read it!  We've already got a couple in
> > there, but not any of the new types (which we can't read!)
> 
> As I said you can just go into the PDF viewer and open the embedded document
> there. I think this should be the preferred way of doing it anyway, so I
> don't think it's such a big problem, but we need to make users aware that
> they can access the original document in this way as well, and this is
> probably the biggest issue.
> 
> > Also:
> >   - Do we have a 'spec' of hybrid anywhere?
> 
> No.

Feels like we should.

> >   - If we change it what's the rules on compatibility for old docs?
> 
> Not breaking the compatibility, so reading /AdditionalStreams should still
> be supported. 

Gah OK, that was a bit I hadn't realised - but then we do have 
    ./sw/qa/extras/pdf/data/Hybrid_AdditionalStreamsOnly.pdf 

as a test for that case, where there is no PDF EmbeddedFile

So we're stuck in having to do both of them for some long time:
  a) Check for an AdditionalStreams (required for old old files)
  b) Check for EmbedeedFile named Original.o* with correct MIME (required for
current PDF/UA, etc)   - not currently done

> Writing /AdditionalStreams is probably something we can stop adding at some
> point (for plain PDFs).


> 
> >   - My guess for DocumentChecksum had been that it was intended to stop the
> > pdf and the original content getting out of sync - but that would want a
> > checksum of the embedded document, not the whole file wouldn't it?
> 
> All I can see is that we check the PDF file is not corrupted by checking the
> hashes - I don't think it has any purpose except maybe to not prevent
> opening the source documnet if the PDF was modified (for example -
> annotation was added), which may not make sense actually. I would remove
> that completely. 

OK, tht's not hard.

> >   - If we needed to check the embedded file name and/or the  AFRelationship
> > you mentioned, is that doable on a file prior to decryption?
> 
> PDF document structure is not completely encrypted - you can still see the
> most of the structure, but not the content of streams, so you see that there
> is an embedded file and the type of the file. 

OK.

> >   - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the
> > same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the
> > hybrid extraction; then we can lose our extra PDF parser code.
> 
> The problem is only that Poppler is optional, so it might not be available,
> which means this functionallity will also not be available. But then the
> issue in general is that this is implemented inside PDF import code and not
> as a independent filter, which it could be done easily.
> 
> Also we use PDFium mostly in the core nowadays for various tasks and I think
> that will become available by default much more likely than Poppler, which
> is only used for PDF import.

I'm up for rewriting the hybrid Detect/extract using PDFium rather than the
other inbuilt
parser in  sdext/source/pdfimport/pdfparse,  but the need to keep parsing the
trailer for AdditionalStreams might make that a lot more tricky;  Poppler
parses the trailer and it's easy just to ask it for it:

  pDocUnique->getXRef()->getTrailerDict()->dictLookup("AdditionalStreams")
(Which seems to work - add error checking to taste)
I'm not seeing a similar interface in PDFium - the nearest seems to be
'getTrailerEnds' - which I think is just start and end bytes of (each??)
trailer.
Do we have anything that parses a trailer using PDFium?

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to