https://bugs.documentfoundation.org/show_bug.cgi?id=66580

--- Comment #32 from Tomaz Vajngerl <[email protected]> ---
(In reply to Dave Gilbert from comment #30)
> (Copying Caolán in due to the hybrid performance hack)
> Hi Tomaz,
>   Thanks for replying.
> For context, one reason I'm asking abotu this stuff is that I just fixed the
> PDF import filter to handle PDF2 encryption, but not with hybrid yet, and I
> was looking at how to fix that, and know a fairly easy way to do it - but
> understanding the details of hybrid seems to make sense first!
> 
> >>   a) Without a trailer, I don't understand how a hybrid loader is supposed
> >> to understand that a PDF embedded file is actually a hybrid PDF rather than
> >> just a PDF that happens to have an attached; A PDF could have loads of
> >> embedded files and none of them actually represent the same contents as the
> >> PDF (e.g. maybe include raw data spreadsheets or something with an
> >> explanatory PDF).
> 
> > Using a filename convention "Original.od*" and making sure it's a 
> > compatible document. 
> > Or in PDF 2.0 - the embedded file that is  /AFRelationship is /Source.   
> 
> OK, I don't know the details of PDF 2.0 yet, so I'll take your word for
> that; but is that something the writer already does?

It does in master when the output format is PDF 2.0. 

Also a fairly good type of detection is to check if there is just one embedded
file that is ODF. That's what will be the case in case of hybrid files. If
there are more leave it up to the user to open the correct source in the PDF
viewer - if it indeed is a hybrid file.  

> Before I break it, I'd like to give Caolán (copied in) a chance to say stuff.
> My feeling is that Caolán hack is unreasonably fast - it was driven by this
> ticket:
> https://github.com/CollaboraOnline/online/issues/7307

Well, sure, but parsing the PDF document structure only is not very complex
either. The problem is that we do a full PDF parse (I guess) in
getAdditionalStream with calling the pdfparse::PDFReader::read method. 

Instead we could've used a modified PDFDocument::Read just for detection that
doesn't do any unnecessary things. So to detect an embedded file you can read
the xref table, then check the trailer for /Root which should point you to the
object with the /Catalog dictionary and from there you need to get the /Names
entry and check the /EmbeddedFiles... Those things should eb fast as you just
jump around in the file from one location to the other and do minimal reads.

Example of PDF Catalog dictionary with embedded files:
16 0 obj
<<
  /Type/Catalog
  /Pages 6 0 R
  /Names 
  <<
    /EmbeddedFiles 
    <<
      /Names [<FEFF004F0072006900670069006E0061006C002E006F00640074> 2 0 R]
    >>
  >>
  /OpenAction[3 0 R /XYZ null null 0]
  /Lang(en-GB)
>>
endobj


> Hmm so we don't - we really should have a test for each of the hybrid types
> we support to make sure we can read it!  We've already got a couple in
> there, but not any of the new types (which we can't read!)

As I said you can just go into the PDF viewer and open the embedded document
there. I think this should be the preferred way of doing it anyway, so I don't
think it's such a big problem, but we need to make users aware that they can
access the original document in this way as well, and this is probably the
biggest issue.

> Also:
>   - Do we have a 'spec' of hybrid anywhere?

No.

>   - If we change it what's the rules on compatibility for old docs?

Not breaking the compatibility, so reading /AdditionalStreams should still be
supported. 

Writing /AdditionalStreams is probably something we can stop adding at some
point (for plain PDFs).

>   - My guess for DocumentChecksum had been that it was intended to stop the
> pdf and the original content getting out of sync - but that would want a
> checksum of the embedded document, not the whole file wouldn't it?

All I can see is that we check the PDF file is not corrupted by checking the
hashes - I don't think it has any purpose except maybe to not prevent opening
the source documnet if the PDF was modified (for example - annotation was
added), which may not make sense actually. I would remove that completely. 

>   - If we needed to check the embedded file name and/or the  AFRelationship
> you mentioned, is that doable on a file prior to decryption?

PDF document structure is not completely encrypted - you can still see the most
of the structure, but not the content of streams, so you see that there is an
embedded file and the type of the file. 

>   - My relatively easy way of fixing the PDF2 encrypted hybrids is to do the
> same thing I did for PDF2 encrypted none-hybrid, and use Poppler to do the
> hybrid extraction; then we can lose our extra PDF parser code.

The problem is only that Poppler is optional, so it might not be available,
which means this functionallity will also not be available. But then the issue
in general is that this is implemented inside PDF import code and not as a
independent filter, which it could be done easily.

Also we use PDFium mostly in the core nowadays for various tasks and I think
that will become available by default much more likely than Poppler, which is
only used for PDF import.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to