[basex-talk] superfluous shape="area' when using html:parse()
Hi all, When loading a document with html:parse(), an extra attribute is added to every element. becomes This error is even shown in the example on the wiki: https://docs.basex.org/wiki/HTML_Module It turns out this behaviour can be avoided by using the 'nodefaults" option of TagSoup: html:doc($uri, map { 'nodefaults': true() }) That's a lot faster than removing these attributes from loaded document. ⤳Jos signature.asc Description: This is a digitally signed message part.
[basex-talk] creating epub and odf with bases
Hello all, As you might know, epub files and ODF files are zip files with specific contents. BaseX supports the expath zip module and could in theory be used for creating these files if it were not for a missing simple feature. There is one rule for epub and ODF files that cannot be followed by BaseX at the moment: the first file in the zip container should be named 'mimetype' and is a plain test file that contains the mimetype string. This is meant to allow applications to read the mimetype at a fixed offset in the file and without doing decompression. In unzip -vl it looks like this: Length MethodSize CmprDateTime CRC-32 Name -- --- -- - 20 Stored 20 0% 10-14-2018 05:57 2cab616f mimetype Here is an XQuery to create a file with just that entry: xquery declare namespace zip = "http://expath.org/ns/zip;; let $zip := {"application/epub+zip"} return zip:zip-file($zip) ``` BaseX does not support the 'compressed' option. Without that option the file 'mimetype' is stored in compressed form and cannot be used by applications to quickly determine the mimetype of the file. Modifying the xml in an exisiting epub or ODF with zip:update-entries is also not possible because the mimetype file is still compressed. An additional issue: when reading a zip file, the entries in are not in the same order as they are in the zip file. So when modifying an existing file, the mimetype entry has to moved to the front of the list explicitly. In short: to make BaseX support the creation of epub en ODF files it should: - support the 'compressed' attribute - retain the order of files in the zip file in the element. Best regards, Jos signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] creating epub and odf with bases
On dinsdag 8 september 2020 09:57:50 CEST Christian Grün wrote: > Hi Jos, > > While the ZIP Module is still part of our distribution, it’s not > actively maintained anymore, and we generally recommend our users to > switch to the Archive Module [1]. Providing custom compression levels > for each archive entry is one of the features that is provided by this > newer module. Oh, a shame that the cross-implementation module is not maintained. The archive module also compresses the 'mimetype' file with this code: let $file := "test.ods" let $archive := file:read-binary($file) let $content := parse-xml(archive:extract-text($archive, "content.xml")) let $content := local:change($content, local:add_number_value_type#1) let $updated := archive:update($archive, "content.xml", $content) return file:write-binary($file, $updated) Cheers, Jos > > Hope this helps, > Christian > > [1] https://docs.basex.org/wiki/Archive_Module > > On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever wrote: > > Hello all, > > > > As you might know, epub files and ODF files are zip files with specific > > contents. BaseX supports the expath zip module and could in theory be used > > for creating these files if it were not for a missing simple feature. > > > > There is one rule for epub and ODF files that cannot be followed by BaseX > > at the moment: the first file in the zip container should be named > > 'mimetype' and is a plain test file that contains the mimetype string. > > This is meant to allow applications to read the mimetype at a fixed > > offset in the file and without doing decompression. > > > > In unzip -vl it looks like this: > > Length MethodSize CmprDateTime CRC-32 Name > > > > -- --- -- - > > > > 20 Stored 20 0% 10-14-2018 05:57 2cab616f mimetype > > > > Here is an XQuery to create a file with just that entry: > > > > xquery > > declare namespace zip = "http://expath.org/ns/zip;; > > > > let $zip := > > > > > > > > > > {"application/epub+zip"} > > > > > > > > > > return zip:zip-file($zip) > > ``` > > > > BaseX does not support the 'compressed' option. Without that option the > > file 'mimetype' is stored in compressed form and cannot be used by > > applications to quickly determine the mimetype of the file. > > > > Modifying the xml in an exisiting epub or ODF with zip:update-entries is > > also not possible because the mimetype file is still compressed. > > > > An additional issue: when reading a zip file, the entries in > > are > > not in the same order as they are in the zip file. So when modifying an > > existing file, the mimetype entry has to moved to the front of the list > > explicitly. > > > > In short: to make BaseX support the creation of epub en ODF files it should: > > - support the 'compressed' attribute > > - retain the order of files in the zip file in the element. > > > > Best regards, > > Jos signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] creating epub and odf with basex
On dinsdag 8 september 2020 10:59:37 CEST Christian Grün wrote: > > Oh, a shame that the cross-implementation module is not maintained. > > The Archive Module was supposed to become the new EXPath standard. > Unfortunately, different versions of that module were specified one > after another such that the spec that’s currently publicly available > doesn’t reflect our implementation anymore [1]. > > I didn’t know that the ZIP Module is still maintained in other > implementations of XQuery. Is it still popular e.g. in eXist-db? I've used it in production to create governemnt epub files (law bundles). > > The archive module also compresses the 'mimetype' file with this code: > When calling archive:update, you can supply more properties with an > archive:entry element: > > compression-level='8' >encoding='US-ASCII'>hello.txt I assumed that files that are not mentioned in the archive:update call or zip:update-entries call would not be touched. I'll see if this way works. Cheers, Jos > > Best, > Christian > > [1] http://expath.org/spec/archive/20130930 > > > let $file := "test.ods" > > let $archive := file:read-binary($file) > > let $content := parse-xml(archive:extract-text($archive, "content.xml")) > > let $content := local:change($content, local:add_number_value_type#1) > > let $updated := archive:update($archive, "content.xml", $content) > > return file:write-binary($file, $updated) > > > > Cheers, > > Jos > > > > > Hope this helps, > > > Christian > > > > > > [1] https://docs.basex.org/wiki/Archive_Module > > > > > > On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever > > > > wrote: > > > > Hello all, > > > > > > > > As you might know, epub files and ODF files are zip files with > > > > specific > > > > contents. BaseX supports the expath zip module and could in theory be > > > > used > > > > for creating these files if it were not for a missing simple feature. > > > > > > > > There is one rule for epub and ODF files that cannot be followed by > > > > BaseX > > > > at the moment: the first file in the zip container should be named > > > > 'mimetype' and is a plain test file that contains the mimetype string. > > > > This is meant to allow applications to read the mimetype at a fixed > > > > offset in the file and without doing decompression. > > > > > > > > In unzip -vl it looks like this: > > > > Length MethodSize CmprDateTime CRC-32 Name > > > > > > > > -- --- -- - > > > > > > > > 20 Stored 20 0% 10-14-2018 05:57 2cab616f mimetype > > > > > > > > Here is an XQuery to create a file with just that entry: > > > > > > > > xquery > > > > declare namespace zip = "http://expath.org/ns/zip;; > > > > > > > > let $zip := > > > > > > > > > > > > > > > > > > > > {"application/epub+zip"} > > > > > > > > > > > > > > > > > > > > return zip:zip-file($zip) > > > > ``` > > > > > > > > BaseX does not support the 'compressed' option. Without that option > > > > the > > > > file 'mimetype' is stored in compressed form and cannot be used by > > > > applications to quickly determine the mimetype of the file. > > > > > > > > Modifying the xml in an exisiting epub or ODF with zip:update-entries > > > > is > > > > also not possible because the mimetype file is still compressed. > > > > > > > > An additional issue: when reading a zip file, the entries in > > > > > > > > are > > > > not in the same order as they are in the zip file. So when modifying > > > > an > > > > existing file, the mimetype entry has to moved to the front of the > > > > list > > > > explicitly. > > > > > > > > In short: to make BaseX support the creation of epub en ODF files it > > > > should: > > > > - support the 'compressed' attribute > > > > - retain the order of files in the zip file in the > > > > element. > > > > > > > > Best regards, > > > > Jos signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] creating epub and odf with basex
To be complete, here is an example to create a file that is recognized as epub: $ echo -n application/epub+zip > mimetype $ zip -D -X -0 test.epub mimetype $ file -i test.epub test.epub: application/epub+zip; charset=binary $ hexdump -C test.epub | head -4 50 4b 03 04 0a 00 00 00 00 00 3d 2f 4e 4d 6f 61 |PK=/NMoa| 0010 ab 2c 14 00 00 00 14 00 00 00 08 00 00 00 6d 69 |.,mi| 0020 6d 65 74 79 70 65 61 70 70 6c 69 63 61 74 69 6f |metypeapplicatio| 0030 6e 2f 65 70 75 62 2b 7a 69 70 50 4b 01 02 1e 03 |n/epub+zipPK| On dinsdag 8 september 2020 14:06:20 CEST Jos van den Oever wrote: > On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote: > > > Here is an example that creates a new archive that uses > > > compression-level="0" and algorithm="stored" and still compresses that > > > entry. > > > > > > Note that the archive level option 'algorithm' is unfortumate because > > > often it is only single entries such as 'mimetype' or images that should > > > not be compressed. > > > > Thanks for the example. – My observation is that the entry is indeed > > archived uncompressed if you choose compression-level="0"; but I think > > what you are saying is that an uncompressed DEFLATE entry is not the > > same as an uncompressed STORED entry, right, and that ODS and ePub > > files require certain files to be stored with the STORED algorithm, is > > that right? > > The thing that counts is that you can read the mimetype enty name and > contents without decompression starting from byte 30. That way tools such > as 'find' can report the mimetype. > > The file generated with the attached script in BaseX 9.4.3 beta gives this: > > $ file -i test.epub > test.epub: application/octet-stream; charset=binary > $ unzip -vl test.epub > Archive: test.epub > Length MethodSize CmprDateTime CRC-32 Name > -- --- -- - > 20 Defl:N 25 -25% 09-08-2020 13:54 2cab616f mimetype > --- ------ > 20 25 -25%1 file > $ hexdump -C test.epub | head -4 > 50 4b 03 04 14 00 08 08 08 00 d9 6e 28 51 00 00 > |PK.n(Q..| 0010 00 00 00 00 00 00 00 00 00 00 08 00 00 00 6d > 69 |..mi| 0020 6d 65 74 79 70 65 01 14 00 eb ff 61 70 70 > 6c 69 |metype.appli| 0030 63 61 74 69 6f 6e 2f 65 70 75 62 2b 7a > 69 70 50 |cation/epub+zipP| > > There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are > deflate information. If the entry is 'stored' there are no bytes between the > entry name and the contents and the zip will be recognized by the epub and > ODF applications (and use less space) than when it is deflated with > compression- level 0. > > > The Archive Module has a long history, and was initially based on a > > proposal for the Zorba XQuery Processor back in 2012. I don’t actually > > remember why the algorithm option was not adopted for the single > > archive entries; maybe that would have been more reasonable. As we > > seem to be the only implementation left today, we could think about > > changing that. I doubt anyway that people will use different > > compression levels for single archive entries (apart from archiving > > them uncompressed), so it might be a better solution to define one > > global compression level for the whole archive. > > From a practical point of view (regardless of what is in the specification) > it makes sense to store 'mimetype' uncompressed and also store files such > as png and jpg that are already compressed in the 'stored' way. If that can > be achieved easily: great, but at least it should be possible. I think the > simplest solution is to save compression-level=0 as stored. > > Best regards, > Jos signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] creating epub and odf with basex
On dinsdag 8 september 2020 11:05:45 CEST Jos van den Oever wrote: > On dinsdag 8 september 2020 10:59:37 CEST Christian Grün wrote: > > > Oh, a shame that the cross-implementation module is not maintained. > > > > The Archive Module was supposed to become the new EXPath standard. > > Unfortunately, different versions of that module were specified one > > after another such that the spec that’s currently publicly available > > doesn’t reflect our implementation anymore [1]. > > > > I didn’t know that the ZIP Module is still maintained in other > > implementations of XQuery. Is it still popular e.g. in eXist-db? > > I've used it in production to create governemnt epub files (law bundles). > > > > The archive module also compresses the 'mimetype' file with this code: > > When calling archive:update, you can supply more properties with an > > archive:entry element: > > > > > > >compression-level='8' > >encoding='US-ASCII'>hello.txt > > I assumed that files that are not mentioned in the archive:update call or > zip:update-entries call would not be touched. > > I'll see if this way works. Calling with compression-level="0" still compresses the file. And because a call with update is done, the entire zip needs to be rewritten while taking care that 'mimetype' is the first entry even though the archive spec says "The relative order of all the existing and replaced entries within the archive is preserved." This example demonstrates that compression-level="0" does do what the api promises: ```xquery let $file := "test.ods" let $archive := file:read-binary($file) let $mimetype := archive:extract-text($archive, "mimetype") let $content_xml := fn:parse-xml(archive:extract-text($archive, "content.xml")) let $content_xml := local:change($content_xml, local:add_number_value_type#1) let $entries := ( {"mimetype"}, {"content.xml"} ) let $contents := ($mimetype, fn:serialize($content_xml)) let $updated := archive:update($archive, $entries, $contents) return file:write-binary($file, $updated) ``` On the archive spec: the example in '3.1 Creating a simple EPUB document' is not valid XQuery and does not match the description of the function. Best regards, Jos > > [1] http://expath.org/spec/archive/20130930 > > > > > let $file := "test.ods" > > > let $archive := file:read-binary($file) > > > let $content := parse-xml(archive:extract-text($archive, "content.xml")) > > > let $content := local:change($content, local:add_number_value_type#1) > > > let $updated := archive:update($archive, "content.xml", $content) > > > return file:write-binary($file, $updated) > > > > > > Cheers, > > > Jos > > > > > > > Hope this helps, > > > > Christian > > > > > > > > [1] https://docs.basex.org/wiki/Archive_Module > > > > > > > > On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever > > > > > > > > > > wrote: > > > > > Hello all, > > > > > > > > > > As you might know, epub files and ODF files are zip files with > > > > > specific > > > > > contents. BaseX supports the expath zip module and could in theory > > > > > be > > > > > used > > > > > for creating these files if it were not for a missing simple > > > > > feature. > > > > > > > > > > There is one rule for epub and ODF files that cannot be followed by > > > > > BaseX > > > > > at the moment: the first file in the zip container should be named > > > > > 'mimetype' and is a plain test file that contains the mimetype > > > > > string. > > > > > This is meant to allow applications to read the mimetype at a fixed > > > > > offset in the file and without doing decompression. > > > > > > > > > > In unzip -vl it looks like this: > > > > > Length MethodSize CmprDateTime CRC-32 Name > > > > > > > > > > -- --- -- - > > > > > > > > > > 20 Stored 20 0% 10-14-2018 05:57 2cab616f mimetype > > > > > > > > > > Here is an XQuery to create a file with just that entry: > > > > > > > > > > xquery > > > > > declare namespace zip = "http://expath.org/ns/zip;; > > > > >
Re: [basex-talk] creating epub and odf with basex
On dinsdag 8 september 2020 11:57:16 CEST Christian Grün wrote: > > This example demonstrates that compression-level="0" does do what > > > the api promises: > I can have a closer look into that. Could you possibly provide me with > a little self-contained example that I can run out of the box? Here is an example that creates a new archive that uses compression-level="0" and algorithm="stored" and still compresses that entry. Note that the archive level option 'algorithm' is unfortumate because often it is only single entries such as 'mimetype' or images that should not be compressed. The algorithm should be 'stored' for every entry that has compression-level="0". ```xquery declare namespace file = "http://expath.org/ns/file;; declare namespace archive = "http://basex.org/modules/archive;; (: Create a zip file with one uncompressed file :) let $file := "test.epub" let $mimetype := "application/epub+zip" let $entries := ( {"mimetype"} ) let $contents := ($mimetype) let $zip := archive:create($entries, $contents, map { "format": "zip", "algorithm": "stored" } ) return file:write-binary($file, $zip) ``` Best regards, Jos declare namespace file = "http://expath.org/ns/file;; declare namespace archive = "http://basex.org/modules/archive;; (: Create a zip file with one uncompressed file :) let $file := "test.epub" let $mimetype := "application/epub+zip" let $entries := ( {"mimetype"} ) let $contents := ($mimetype) let $zip := archive:create($entries, $contents, map { "format": "zip", "algorithm": "stored" } ) return file:write-binary($file, $zip) signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] creating epub and odf with basex
On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote: > > Here is an example that creates a new archive that uses > > compression-level="0" and algorithm="stored" and still compresses that > > entry. > > > > Note that the archive level option 'algorithm' is unfortumate because > > often it is only single entries such as 'mimetype' or images that should > > not be compressed. > > Thanks for the example. – My observation is that the entry is indeed > archived uncompressed if you choose compression-level="0"; but I think > what you are saying is that an uncompressed DEFLATE entry is not the > same as an uncompressed STORED entry, right, and that ODS and ePub > files require certain files to be stored with the STORED algorithm, is > that right? The thing that counts is that you can read the mimetype enty name and contents without decompression starting from byte 30. That way tools such as 'find' can report the mimetype. The file generated with the attached script in BaseX 9.4.3 beta gives this: $ file -i test.epub test.epub: application/octet-stream; charset=binary $ unzip -vl test.epub Archive: test.epub Length MethodSize CmprDateTime CRC-32 Name -- --- -- - 20 Defl:N 25 -25% 09-08-2020 13:54 2cab616f mimetype --- ------ 20 25 -25%1 file $ hexdump -C test.epub | head -4 50 4b 03 04 14 00 08 08 08 00 d9 6e 28 51 00 00 |PK.n(Q..| 0010 00 00 00 00 00 00 00 00 00 00 08 00 00 00 6d 69 |..mi| 0020 6d 65 74 79 70 65 01 14 00 eb ff 61 70 70 6c 69 |metype.appli| 0030 63 61 74 69 6f 6e 2f 65 70 75 62 2b 7a 69 70 50 |cation/epub+zipP| There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name and the contents and the zip will be recognized by the epub and ODF applications (and use less space) than when it is deflated with compression- level 0. > The Archive Module has a long history, and was initially based on a > proposal for the Zorba XQuery Processor back in 2012. I don’t actually > remember why the algorithm option was not adopted for the single > archive entries; maybe that would have been more reasonable. As we > seem to be the only implementation left today, we could think about > changing that. I doubt anyway that people will use different > compression levels for single archive entries (apart from archiving > them uncompressed), so it might be a better solution to define one > global compression level for the whole archive. From a practical point of view (regardless of what is in the specification) it makes sense to store 'mimetype' uncompressed and also store files such as png and jpg that are already compressed in the 'stored' way. If that can be achieved easily: great, but at least it should be possible. I think the simplest solution is to save compression-level=0 as stored. Best regards, Jos signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] creating epub and odf with basex
On dinsdag 8 september 2020 14:27:55 CEST Christian Grün wrote: > Hi Jos, > > > There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are > > deflate information. If the entry is 'stored' there are no bytes between > > the entry name […] > > Great, so we are talking about the same thing. > > > I think the simplest solution is to save compression-level=0 as stored. > > That was also my thought. A quick fix caused the following error > message (similar to what is described here [1])… > > > Operation failed: STORED entry missing size, compressed size, or crc-32. > > …which means we’ll probably need to set additional values before > writing the actual byte array. I’ll see what we can do. > > I was surprised to learn more about the deficiencies of the Archive > Module. The module was already used many times in the past to create > ePub files, so my guess would be that these files could be opened by > many readers, but were not 100% valid. How do you usually proceed to > check the validity of ePub files? I think many, but not all, tools are forgiving. Especially tools that scan many files such as file explorers, will rely on 'magic bytes'. Most epub files comply to having a 'stored' mimetype as the first file. From a local collection from various sources 138 out of 159 files comply. For ODF, complyance to this rule is almost universal. EPub actually adopted the practice of a stored mimetype from ODF. For validation of epub files you can use the w3c validator. https://github.com/w3c/epubcheck For ODF you can use the ODF validator: https://odftoolkit.org/conformance/ODFValidator.html Best regards, Jos signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] creating epub and odf with basex
On dinsdag 8 september 2020 14:40:34 CEST Christian Grün wrote: > I’ve updated the code; if the compression level is set to 0, entries > will be STORED [1]. Feel free to check out the latest snapshot [2]. Creating a new file epub or odf file works correctly now but archive:update() does not retain the 'stored' property for the manifest file. (It does retain the order of the entries). Here is an example script that takes 'test.epub' as input. ```xquery declare namespace file = "http://expath.org/ns/file;; declare namespace archive = "http://basex.org/modules/archive;; (: Update a zip file Currently, this will change the 'stored' entries to 'deflate' breaking mimetype recognition. :) let $file := "test.epub" let $archive := file:read-binary($file) let $updated := archive:update($archive, (), ()) return file:write-binary($file, $updated) ``` Best regards, Jos > > [1] > https://github.com/BaseXdb/basex/commit/67ad584a85e0848432e19b4f587fbabfc2f > c38e5 [2] https://files.basex.org/releases/latest/ > > On Tue, Sep 8, 2020 at 2:27 PM Christian Grün wrote: > > Hi Jos, > > > > > There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These > > > are > > > deflate information. If the entry is 'stored' there are no bytes between > > > the entry name […] > > > > Great, so we are talking about the same thing. > > > > > I think the simplest solution is to save compression-level=0 as stored. > > > > That was also my thought. A quick fix caused the following error > > message (similar to what is described here [1])… > > > > > Operation failed: STORED entry missing size, compressed size, or crc-32. > > > > …which means we’ll probably need to set additional values before > > writing the actual byte array. I’ll see what we can do. > > > > I was surprised to learn more about the deficiencies of the Archive > > Module. The module was already used many times in the past to create > > ePub files, so my guess would be that these files could be opened by > > many readers, but were not 100% valid. How do you usually proceed to > > check the validity of ePub files? > > > > Best, > > Christian > > > > [1] > > https://stackoverflow.com/questions/1206970/how-to-create-uncompressed-zi > > p-archive-in-java> > > On Tue, Sep 8, 2020 at 2:06 PM Jos van den Oever wrote: > > > On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote: > > > > > Here is an example that creates a new archive that uses > > > > > compression-level="0" and algorithm="stored" and still compresses > > > > > that > > > > > entry. > > > > > > > > > > Note that the archive level option 'algorithm' is unfortumate > > > > > because > > > > > often it is only single entries such as 'mimetype' or images that > > > > > should > > > > > not be compressed. > > > > > > > > Thanks for the example. – My observation is that the entry is indeed > > > > archived uncompressed if you choose compression-level="0"; but I think > > > > what you are saying is that an uncompressed DEFLATE entry is not the > > > > same as an uncompressed STORED entry, right, and that ODS and ePub > > > > files require certain files to be stored with the STORED algorithm, is > > > > that right? > > > > > > The thing that counts is that you can read the mimetype enty name and > > > contents without decompression starting from byte 30. That way tools > > > such as 'find' can report the mimetype. > > > > > > The file generated with the attached script in BaseX 9.4.3 beta gives > > > this: > > > > > > $ file -i test.epub > > > test.epub: application/octet-stream; charset=binary > > > $ unzip -vl test.epub > > > Archive: test.epub > > > > > > Length MethodSize CmprDateTime CRC-32 Name > > > > > > -- --- -- - > > > > > > 20 Defl:N 25 -25% 09-08-2020 13:54 2cab616f mimetype > > > > > > --- ------ > > > > > > 20 25 -25%1 file > > > > > > $ hexdump -C test.epub | head -4 > > > 50 4b 03 04 14 00 08 08 08 00 d9 6e 28 51 00 00 > > > |PK.n(Q..| 0010 00 00 00 00 00 00 00 00 00 00 08 00 00 00 &
Re: [basex-talk] creating epub and odf with basex
Thank you for making the improvements. This is much cleaner imho than bash + zip + xsltproc. :-) On dinsdag 8 september 2020 16:59:14 CEST Christian Grün wrote: > > Creating a new file epub or odf file works correctly now but > > archive:update() does not retain the 'stored' property for the manifest > > file. (It does retain the order of the entries). > > The stored property will now be retained if an archive is updated. > > Up to now, archive:update removed existing update candidates from the > archive and added new entries at the end. I changed this as well: If > existing files are updated, the original order will be preserved, and > new files will be added in the order in which they were supplied by > the user. > > > EPub actually adopted the practice of a stored mimetype from ODF. > > I didn’t know that. And thanks for the link to the W3C epub checker > (which I actually used by myself a long time ago) and the ODF checker. Since both are java they even fit in the basex environment. > BaseX 9.4.3 is scheduled to be released later this week. signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] creating epub and odf with basex
On dinsdag 8 september 2020 19:31:41 CEST Liam R. E. Quin wrote: > On Tue, 2020-09-08 at 17:07 +0200, Jos van den Oever wrote: > > Thank you for making the improvements. This is much cleaner imho than > > bash + > > zip + xsltproc. :-) > > A minor addition - i've sometimes started with a base zip file with the > uncompressed "mimetype" entry in it, and just added the rest to that > file fromXQuery or XSLT, without problems. That's a fine solution, but in this case that did not work because it 'upgrade' the mimetype file to be compressed. signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] vanishing whitespace with fn:doc()
Hi Christian, Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace. But where in the XQuery or XDM spec does it say that whitespace handling when parsing is implementation dependent? Cheers, Jos On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote: > Hi Jos, > > Whitespaces will be preserved if the CHOP option is disabled. You can make > this a default by adding CHOP=false in your .basex configuration file [1,2]. > > Hope this helps, > Christian > > [1] https://docs.basex.org/wiki/Full-Text#Mixed_Content > [2] https://docs.basex.org/wiki/Configuration > > > > > Jos van den Oever schrieb am Di., 16. Feb. 2021, > > 22:00: > > Dear all, > > > > First off: BaseX is great to work with. I use it for a few statically > > generated websites. > > > > But I recently found what might be a bug. > > > > Some whitespace vanishes when loading xml files. E.g. this xml file: > > > > ```test.xml > > a b c d e > > ``` > > > > run like this: > > > > doc('test.xml') > > > > gives: > > > > a bcd e > > > > But running this: > > > > ``` > > parse-xml(' a b c d e ') > > ``` > > > > retains the whitespace. > > > > I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6. > > > > Running this in saxon-he-10.3.jar retains the whitespace. > > > > I can work around this issue by placing xml:space="preserve" in the > > document > > element. > > > > I cannot come up with a scenario in which discarding whitespace during is > > parsing is ok when no DTD or XML Schema is provided. > > > > Best regards, > > Jos signature.asc Description: This is a digitally signed message part.
[basex-talk] vanishing whitespace with fn:doc()
Dear all, First off: BaseX is great to work with. I use it for a few statically generated websites. But I recently found what might be a bug. Some whitespace vanishes when loading xml files. E.g. this xml file: ```test.xml a b c d e ``` run like this: doc('test.xml') gives: a bcd e But running this: ``` parse-xml(' a b c d e ') ``` retains the whitespace. I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6. Running this in saxon-he-10.3.jar retains the whitespace. I can work around this issue by placing xml:space="preserve" in the document element. I cannot come up with a scenario in which discarding whitespace during is parsing is ok when no DTD or XML Schema is provided. Best regards, Jos signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] vanishing whitespace with fn:doc()
Thanks for the context. Still, it does not explain the difference in behavior bestween doc() and parse-xml(). As far as I understand the XDM specification, whitespace may be ignored by the parser if there is a DTD or XML Schema that says that an element is not PCDATA (DTD) or mixed (XML Schema). In the absense of (support for) schemas, all whitespace should be left in. Wendell Piez writes it with many details. Whitespace in XML tricky. E.g. indenting XML cannot be done well without knowing which elements are PCDATA/mixed. Now that I know about the CHOP option, I can use BaseX predictably. And the legacy reasons for keeping it set are understandable. Best regards, Jos On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote: > There is an old (and still open) issue on GitHub [1] that might give you > some more insight into the history of whitespace chopping in BaseX. > > Hope this helps > Christian > > [1] https://github.com/BaseXdb/basex/issues/913 > > > > > Jos van den Oever schrieb am Di., 16. Feb. 2021, > > 22:41: > > Hi Christian, > > > > Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace. > > > > But where in the XQuery or XDM spec does it say that whitespace handling > > when > > parsing is implementation dependent? > > > > Cheers, > > Jos > > > > On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote: > > > Hi Jos, > > > > > > Whitespaces will be preserved if the CHOP option is disabled. You can > > > > make > > > > > this a default by adding CHOP=false in your .basex configuration file > > > > [1,2]. > > > > > Hope this helps, > > > Christian > > > > > > [1] https://docs.basex.org/wiki/Full-Text#Mixed_Content > > > [2] https://docs.basex.org/wiki/Configuration > > > > > > > > > > > > > > > Jos van den Oever schrieb am Di., 16. Feb. 2021, > > > > > > 22:00: > > > > Dear all, > > > > > > > > First off: BaseX is great to work with. I use it for a few statically > > > > generated websites. > > > > > > > > But I recently found what might be a bug. > > > > > > > > Some whitespace vanishes when loading xml files. E.g. this xml file: > > > > > > > > ```test.xml > > > > a b c d e > > > > ``` > > > > > > > > run like this: > > > > > > > > doc('test.xml') > > > > > > > > gives: > > > > > > > > a bcd e > > > > > > > > But running this: > > > > > > > > ``` > > > > parse-xml(' a b c d e ') > > > > ``` > > > > > > > > retains the whitespace. > > > > > > > > I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6. > > > > > > > > Running this in saxon-he-10.3.jar retains the whitespace. > > > > > > > > I can work around this issue by placing xml:space="preserve" in the > > > > document > > > > element. > > > > > > > > I cannot come up with a scenario in which discarding whitespace during > > > > is > > > > > > parsing is ok when no DTD or XML Schema is provided. > > > > > > > > Best regards, > > > > Jos signature.asc Description: This is a digitally signed message part.
Re: [basex-talk] vanishing whitespace with fn:doc()
Then to pass the XQuery test suite you probably use CHOP=OFF. Are there other settings needed to be compliant? On woensdag 17 februari 2021 00:04:38 CET Christian Grün wrote: > Yes, you are certainly right. I think it was around 2007 when we chopped > whitespaces by default, although we knew it didn't comply with the > specification. One reason was that we rarely worked with mixed-content data > at that time, and the whitespace indentations increased the size of > databases and led to worse rendering results in the built-in visualizations > (our first users were confused about that). > > Maybe we’ll switch the default in a future version of BaseX. > > > > > Jos van den Oever schrieb am Di., 16. Feb. 2021, > > 23:36: > > Thanks for the context. > > > > Still, it does not explain the difference in behavior bestween doc() and > > parse-xml(). > > > > As far as I understand the XDM specification, whitespace may be ignored by > > the > > parser if there is a DTD or XML Schema that says that an element is not > > PCDATA > > (DTD) or mixed (XML Schema). In the absense of (support for) schemas, all > > whitespace should be left in. Wendell Piez writes it with many details. > > > > Whitespace in XML tricky. E.g. indenting XML cannot be done well without > > knowing which elements are PCDATA/mixed. > > > > Now that I know about the CHOP option, I can use BaseX predictably. And > > the > > legacy reasons for keeping it set are understandable. > > > > Best regards, > > Jos > > > > On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote: > > > There is an old (and still open) issue on GitHub [1] that might give you > > > some more insight into the history of whitespace chopping in BaseX. > > > > > > Hope this helps > > > Christian > > > > > > [1] https://github.com/BaseXdb/basex/issues/913 > > > > > > > > > > > > > > > Jos van den Oever schrieb am Di., 16. Feb. 2021, > > > > > > 22:41: > > > > Hi Christian, > > > > > > > > Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace. > > > > > > > > But where in the XQuery or XDM spec does it say that whitespace > > > > handling > > > > > > when > > > > parsing is implementation dependent? > > > > > > > > Cheers, > > > > Jos > > > > > > > > On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote: > > > > > Hi Jos, > > > > > > > > > > Whitespaces will be preserved if the CHOP option is disabled. You > > > > > can > > > > > > > > make > > > > > > > > > this a default by adding CHOP=false in your .basex configuration > > > > > file > > > > > > > > [1,2]. > > > > > > > > > Hope this helps, > > > > > Christian > > > > > > > > > > [1] https://docs.basex.org/wiki/Full-Text#Mixed_Content > > > > > [2] https://docs.basex.org/wiki/Configuration > > > > > > > > > > > > > > > > > > > > > > > > > Jos van den Oever schrieb am Di., 16. Feb. > > > > 2021, > > > > > > > 22:00: > > > > > > Dear all, > > > > > > > > > > > > First off: BaseX is great to work with. I use it for a few > > > > statically > > > > > > > > generated websites. > > > > > > > > > > > > But I recently found what might be a bug. > > > > > > > > > > > > Some whitespace vanishes when loading xml files. E.g. this xml > > > > file: > > > > > > ```test.xml > > > > > > a b c d e > > > > > > ``` > > > > > > > > > > > > run like this: > > > > > > > > > > > > doc('test.xml') > > > > > > > > > > > > gives: > > > > > > > > > > > > a bcd e > > > > > > > > > > > > But running this: > > > > > > > > > > > > ``` > > > > > > parse-xml(' a b c d e ') > > > > > > ``` > > > > > > > > > > > > retains the whitespace. > > > > > > > > > > > > I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6. > > > > > > > > > > > > Running this in saxon-he-10.3.jar retains the whitespace. > > > > > > > > > > > > I can work around this issue by placing xml:space="preserve" in > > > > > > the > > > > > > document > > > > > > element. > > > > > > > > > > > > I cannot come up with a scenario in which discarding whitespace > > > > during > > > > > > is > > > > > > > > > > parsing is ok when no DTD or XML Schema is provided. > > > > > > > > > > > > Best regards, > > > > > > Jos signature.asc Description: This is a digitally signed message part.