Re: [basex-talk] creating epub and odf with basex

Jos van den Oever Tue, 08 Sep 2020 05:07:20 -0700

On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:
> > Here is an example that creates a new archive that uses
> > compression-level="0" and algorithm="stored" and still compresses that
> > entry.
> > 
> > Note that the archive level option 'algorithm' is unfortumate because
> > often it is only single entries such as 'mimetype' or images that should
> > not be compressed.
> 
> Thanks for the example. – My observation is that the entry is indeed
> archived uncompressed if you choose compression-level="0"; but I think
> what you are saying is that an uncompressed DEFLATE entry is not the
> same as an uncompressed STORED entry, right, and that ODS and ePub
> files require certain files to be stored with the STORED algorithm, is
> that right?


The thing that counts is that you can read the mimetype enty name and contents 
without decompression starting from byte 30. That way tools such as 'find' can 
report the mimetype.

The file generated with the attached script in BaseX 9.4.3 beta gives this:

$ file -i test.epub
test.epub: application/octet-stream; charset=binary
$ unzip -vl test.epub
Archive:  test.epub
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
      20  Defl:N       25 -25% 09-08-2020 13:54 2cab616f  mimetype
--------          -------  ---                            -------
      20               25 -25%                            1 file
$ hexdump -C test.epub | head -4
00000000  50 4b 03 04 14 00 08 08  08 00 d9 6e 28 51 00 00  |PK.........n(Q..|
00000010  00 00 00 00 00 00 00 00  00 00 08 00 00 00 6d 69  |..............mi|
00000020  6d 65 74 79 70 65 01 14  00 eb ff 61 70 70 6c 69  |metype.....appli|
00000030  63 61 74 69 6f 6e 2f 65  70 75 62 2b 7a 69 70 50  |cation/epub+zipP|

There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are 
deflate information. If the entry is 'stored' there are no bytes between the 
entry name and the contents and the zip will be recognized by the epub and ODF 
applications (and use less space) than when it is deflated with compression-
level 0.

> The Archive Module has a long history, and was initially based on a
> proposal for the Zorba XQuery Processor back in 2012. I don’t actually
> remember why the algorithm option was not adopted for the single
> archive entries; maybe that would have been more reasonable. As we
> seem to be the only implementation left today, we could think about
> changing that. I doubt anyway that people will use different
> compression levels for single archive entries (apart from archiving
> them uncompressed), so it might be a better solution to define one
> global compression level for the whole archive.

From a practical point of view (regardless of what is in the specification) it 
makes sense to store 'mimetype' uncompressed and also store files such as png 
and jpg that are already compressed in the 'stored' way. If that can be 
achieved easily: great, but at least it should be possible. I think the 
simplest solution is to save compression-level=0 as stored.

Best regards,
Jos

signature.asc
Description: This is a digitally signed message part.

Re: [basex-talk] creating epub and odf with basex

Reply via email to