[basex-talk] superfluous shape="area' when using html:parse()

2020-08-30 Thread Jos van den Oever
Hi all,

When loading a document with html:parse(), an extra attribute is added to 
every  element.

  becomes 

This error is even shown in the example on the wiki:
  https://docs.basex.org/wiki/HTML_Module

It turns out this behaviour can be avoided by using the 'nodefaults" option of 
TagSoup:

 html:doc($uri, map { 'nodefaults': true() })

That's a lot faster than removing these attributes from loaded document.

⤳Jos


signature.asc
Description: This is a digitally signed message part.


[basex-talk] creating epub and odf with bases

2020-09-08 Thread Jos van den Oever
Hello all,

As you might know, epub files and ODF files are zip files with specific 
contents. BaseX supports the expath zip module and could in theory be used for 
creating these files if it were not for a missing simple feature.

There is one rule for epub and ODF files that cannot be followed by BaseX at 
the moment: the first file in the zip container should be named 'mimetype' and 
is a plain test file that contains the mimetype string. This is meant to allow 
applications to read the mimetype at a fixed offset in the file and without 
doing decompression.

In unzip -vl it looks like this:

 Length   MethodSize  CmprDateTime   CRC-32   Name
  --  ---  -- -   
  20  Stored   20   0% 10-14-2018 05:57 2cab616f  mimetype

Here is an XQuery to create a file with just that entry:

xquery
declare namespace zip = "http://expath.org/ns/zip;;

let $zip :=

  
{"application/epub+zip"}
  

return zip:zip-file($zip)
```

BaseX does not support the 'compressed' option. Without that option the file 
'mimetype' is stored in compressed form and cannot be used by applications to 
quickly determine the mimetype of the file.

Modifying the xml in an exisiting epub or ODF with zip:update-entries is also 
not possible because the mimetype file is still compressed.

An additional issue: when reading a zip file, the entries in  are 
not in the same order as they are in the zip file. So when modifying an 
existing file, the mimetype entry has to moved to the front of the list 
explicitly.

In short: to make BaseX support the creation of epub en ODF files it should:
 - support the 'compressed' attribute
 - retain the order of files in the zip file in the  element.

Best regards,
Jos


signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] creating epub and odf with bases

2020-09-08 Thread Jos van den Oever
On dinsdag 8 september 2020 09:57:50 CEST Christian Grün wrote:
> Hi Jos,
> 
> While the ZIP Module is still part of our distribution, it’s not
> actively maintained anymore, and we generally recommend our users to
> switch to the Archive Module [1]. Providing custom compression levels
> for each archive entry is one of the features that is provided by this
> newer module.

Oh, a shame that the cross-implementation module is not maintained.

The archive module also compresses the 'mimetype' file with this code:

let $file := "test.ods"
let $archive := file:read-binary($file)
let $content := parse-xml(archive:extract-text($archive, "content.xml"))
let $content := local:change($content, local:add_number_value_type#1)
let $updated := archive:update($archive, "content.xml", $content)
return file:write-binary($file, $updated)

Cheers,
Jos

> 
> Hope this helps,
> Christian
> 
> [1] https://docs.basex.org/wiki/Archive_Module
> 
> On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever  
wrote:
> > Hello all,
> > 
> > As you might know, epub files and ODF files are zip files with specific
> > contents. BaseX supports the expath zip module and could in theory be used
> > for creating these files if it were not for a missing simple feature.
> > 
> > There is one rule for epub and ODF files that cannot be followed by BaseX
> > at the moment: the first file in the zip container should be named
> > 'mimetype' and is a plain test file that contains the mimetype string.
> > This is meant to allow applications to read the mimetype at a fixed
> > offset in the file and without doing decompression.
> > 
> > In unzip -vl it looks like this:
> >  Length   MethodSize  CmprDateTime   CRC-32   Name
> > 
> >   --  ---  -- -   
> > 
> >   20  Stored   20   0% 10-14-2018 05:57 2cab616f  mimetype
> > 
> > Here is an XQuery to create a file with just that entry:
> > 
> > xquery
> > declare namespace zip = "http://expath.org/ns/zip;;
> > 
> > let $zip :=
> > 
> > 
> >   
> >   
> > {"application/epub+zip"}
> >   
> >   
> > 
> > 
> > return zip:zip-file($zip)
> > ```
> > 
> > BaseX does not support the 'compressed' option. Without that option the
> > file 'mimetype' is stored in compressed form and cannot be used by
> > applications to quickly determine the mimetype of the file.
> > 
> > Modifying the xml in an exisiting epub or ODF with zip:update-entries is
> > also not possible because the mimetype file is still compressed.
> > 
> > An additional issue: when reading a zip file, the entries in 
> > are
> > not in the same order as they are in the zip file. So when modifying an
> > existing file, the mimetype entry has to moved to the front of the list
> > explicitly.
> > 
> > In short: to make BaseX support the creation of epub en ODF files it 
should:
> >  - support the 'compressed' attribute
> >  - retain the order of files in the zip file in the  element.
> > 
> > Best regards,
> > Jos



signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
On dinsdag 8 september 2020 10:59:37 CEST Christian Grün wrote:
> > Oh, a shame that the cross-implementation module is not maintained.
> 
> The Archive Module was supposed to become the new EXPath standard.
> Unfortunately, different versions of that module were specified one
> after another such that the spec that’s currently publicly available
> doesn’t reflect our implementation anymore [1].
> 
> I didn’t know that the ZIP Module is still maintained in other
> implementations of XQuery. Is it still popular e.g. in eXist-db?

I've used it in production to create governemnt epub files (law bundles).

> > The archive module also compresses the 'mimetype' file with this code:
> When calling archive:update, you can supply more properties with an
> archive:entry element:
> 
> compression-level='8'
>encoding='US-ASCII'>hello.txt

I assumed that files that are not mentioned in the archive:update call or 
zip:update-entries call would not be touched.

I'll see if this way works.

Cheers,
Jos

> 
> Best,
> Christian
> 
> [1] http://expath.org/spec/archive/20130930
> 
> > let $file := "test.ods"
> > let $archive := file:read-binary($file)
> > let $content := parse-xml(archive:extract-text($archive, "content.xml"))
> > let $content := local:change($content, local:add_number_value_type#1)
> > let $updated := archive:update($archive, "content.xml", $content)
> > return file:write-binary($file, $updated)
> > 
> > Cheers,
> > Jos
> > 
> > > Hope this helps,
> > > Christian
> > > 
> > > [1] https://docs.basex.org/wiki/Archive_Module
> > > 
> > > On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever 
> > 
> > wrote:
> > > > Hello all,
> > > > 
> > > > As you might know, epub files and ODF files are zip files with
> > > > specific
> > > > contents. BaseX supports the expath zip module and could in theory be
> > > > used
> > > > for creating these files if it were not for a missing simple feature.
> > > > 
> > > > There is one rule for epub and ODF files that cannot be followed by
> > > > BaseX
> > > > at the moment: the first file in the zip container should be named
> > > > 'mimetype' and is a plain test file that contains the mimetype string.
> > > > This is meant to allow applications to read the mimetype at a fixed
> > > > offset in the file and without doing decompression.
> > > > 
> > > > In unzip -vl it looks like this:
> > > >  Length   MethodSize  CmprDateTime   CRC-32   Name
> > > > 
> > > >   --  ---  -- -   
> > > > 
> > > >   20  Stored   20   0% 10-14-2018 05:57 2cab616f  mimetype
> > > > 
> > > > Here is an XQuery to create a file with just that entry:
> > > > 
> > > > xquery
> > > > declare namespace zip = "http://expath.org/ns/zip;;
> > > > 
> > > > let $zip :=
> > > > 
> > > > 
> > > >   
> > > >   
> > > > {"application/epub+zip"}
> > > >   
> > > >   
> > > > 
> > > > 
> > > > return zip:zip-file($zip)
> > > > ```
> > > > 
> > > > BaseX does not support the 'compressed' option. Without that option
> > > > the
> > > > file 'mimetype' is stored in compressed form and cannot be used by
> > > > applications to quickly determine the mimetype of the file.
> > > > 
> > > > Modifying the xml in an exisiting epub or ODF with zip:update-entries
> > > > is
> > > > also not possible because the mimetype file is still compressed.
> > > > 
> > > > An additional issue: when reading a zip file, the entries in
> > > > 
> > > > are
> > > > not in the same order as they are in the zip file. So when modifying
> > > > an
> > > > existing file, the mimetype entry has to moved to the front of the
> > > > list
> > > > explicitly.
> > > > 
> > > > In short: to make BaseX support the creation of epub en ODF files it
> > 
> > should:
> > > >  - support the 'compressed' attribute
> > > >  - retain the order of files in the zip file in the 
> > > >  element.
> > > > 
> > > > Best regards,
> > > > Jos



signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
To be complete, here is an example to create a file that is recognized as 
epub:

$ echo -n application/epub+zip > mimetype
$ zip -D -X -0 test.epub mimetype
$ file -i test.epub
test.epub: application/epub+zip; charset=binary
$ hexdump -C test.epub | head -4
  50 4b 03 04 0a 00 00 00  00 00 3d 2f 4e 4d 6f 61  |PK=/NMoa|
0010  ab 2c 14 00 00 00 14 00  00 00 08 00 00 00 6d 69  |.,mi|
0020  6d 65 74 79 70 65 61 70  70 6c 69 63 61 74 69 6f  |metypeapplicatio|
0030  6e 2f 65 70 75 62 2b 7a  69 70 50 4b 01 02 1e 03  |n/epub+zipPK|


On dinsdag 8 september 2020 14:06:20 CEST Jos van den Oever wrote:
> On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:
> > > Here is an example that creates a new archive that uses
> > > compression-level="0" and algorithm="stored" and still compresses that
> > > entry.
> > > 
> > > Note that the archive level option 'algorithm' is unfortumate because
> > > often it is only single entries such as 'mimetype' or images that should
> > > not be compressed.
> > 
> > Thanks for the example. – My observation is that the entry is indeed
> > archived uncompressed if you choose compression-level="0"; but I think
> > what you are saying is that an uncompressed DEFLATE entry is not the
> > same as an uncompressed STORED entry, right, and that ODS and ePub
> > files require certain files to be stored with the STORED algorithm, is
> > that right?
> 
> The thing that counts is that you can read the mimetype enty name and
> contents without decompression starting from byte 30. That way tools such
> as 'find' can report the mimetype.
> 
> The file generated with the attached script in BaseX 9.4.3 beta gives this:
> 
> $ file -i test.epub
> test.epub: application/octet-stream; charset=binary
> $ unzip -vl test.epub
> Archive:  test.epub
>  Length   MethodSize  CmprDateTime   CRC-32   Name
>   --  ---  -- -   
>   20  Defl:N   25 -25% 09-08-2020 13:54 2cab616f  mimetype
>   ---  ------
>   20   25 -25%1 file
> $ hexdump -C test.epub | head -4
>   50 4b 03 04 14 00 08 08  08 00 d9 6e 28 51 00 00 
> |PK.n(Q..| 0010  00 00 00 00 00 00 00 00  00 00 08 00 00 00 6d
> 69  |..mi| 0020  6d 65 74 79 70 65 01 14  00 eb ff 61 70 70
> 6c 69  |metype.appli| 0030  63 61 74 69 6f 6e 2f 65  70 75 62 2b 7a
> 69 70 50  |cation/epub+zipP|
> 
> There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are
> deflate information. If the entry is 'stored' there are no bytes between the
> entry name and the contents and the zip will be recognized by the epub and
> ODF applications (and use less space) than when it is deflated with
> compression- level 0.
> 
> > The Archive Module has a long history, and was initially based on a
> > proposal for the Zorba XQuery Processor back in 2012. I don’t actually
> > remember why the algorithm option was not adopted for the single
> > archive entries; maybe that would have been more reasonable. As we
> > seem to be the only implementation left today, we could think about
> > changing that. I doubt anyway that people will use different
> > compression levels for single archive entries (apart from archiving
> > them uncompressed), so it might be a better solution to define one
> > global compression level for the whole archive.
> 
> From a practical point of view (regardless of what is in the specification)
> it makes sense to store 'mimetype' uncompressed and also store files such
> as png and jpg that are already compressed in the 'stored' way. If that can
> be achieved easily: great, but at least it should be possible. I think the
> simplest solution is to save compression-level=0 as stored.
> 
> Best regards,
> Jos



signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
On dinsdag 8 september 2020 11:05:45 CEST Jos van den Oever wrote:
> On dinsdag 8 september 2020 10:59:37 CEST Christian Grün wrote:
> > > Oh, a shame that the cross-implementation module is not maintained.
> > 
> > The Archive Module was supposed to become the new EXPath standard.
> > Unfortunately, different versions of that module were specified one
> > after another such that the spec that’s currently publicly available
> > doesn’t reflect our implementation anymore [1].
> > 
> > I didn’t know that the ZIP Module is still maintained in other
> > implementations of XQuery. Is it still popular e.g. in eXist-db?
> 
> I've used it in production to create governemnt epub files (law bundles).
> 
> > > The archive module also compresses the 'mimetype' file with this code:
> > When calling archive:update, you can supply more properties with an
> > archive:entry element:
> > 
> >  > 
> >compression-level='8'
> >encoding='US-ASCII'>hello.txt
> 
> I assumed that files that are not mentioned in the archive:update call or
> zip:update-entries call would not be touched.
> 
> I'll see if this way works.

Calling with compression-level="0" still compresses the file. And because a 
call with update is done, the entire zip needs to be rewritten while taking 
care that 'mimetype' is the first entry even though the archive spec says "The 
relative order of all the existing and replaced entries within the archive is 
preserved." This example demonstrates that compression-level="0" does do what 
the api promises:

```xquery
let $file := "test.ods"
let $archive := file:read-binary($file)
let $mimetype := archive:extract-text($archive, "mimetype")
let $content_xml := fn:parse-xml(archive:extract-text($archive, 
"content.xml"))
let $content_xml := local:change($content_xml, local:add_number_value_type#1)
let $entries := (
   {"mimetype"},
   {"content.xml"}
)
let $contents := ($mimetype, fn:serialize($content_xml))
let $updated := archive:update($archive, $entries, $contents)
return file:write-binary($file, $updated) 
```

On the archive spec: the example in '3.1 Creating a simple EPUB document' is 
not valid XQuery and does not match the description of the function.

Best regards,
Jos


> > [1] http://expath.org/spec/archive/20130930
> > 
> > > let $file := "test.ods"
> > > let $archive := file:read-binary($file)
> > > let $content := parse-xml(archive:extract-text($archive, "content.xml"))
> > > let $content := local:change($content, local:add_number_value_type#1)
> > > let $updated := archive:update($archive, "content.xml", $content)
> > > return file:write-binary($file, $updated)
> > > 
> > > Cheers,
> > > Jos
> > > 
> > > > Hope this helps,
> > > > Christian
> > > > 
> > > > [1] https://docs.basex.org/wiki/Archive_Module
> > > > 
> > > > On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever
> > > > 
> > > 
> > > wrote:
> > > > > Hello all,
> > > > > 
> > > > > As you might know, epub files and ODF files are zip files with
> > > > > specific
> > > > > contents. BaseX supports the expath zip module and could in theory
> > > > > be
> > > > > used
> > > > > for creating these files if it were not for a missing simple
> > > > > feature.
> > > > > 
> > > > > There is one rule for epub and ODF files that cannot be followed by
> > > > > BaseX
> > > > > at the moment: the first file in the zip container should be named
> > > > > 'mimetype' and is a plain test file that contains the mimetype
> > > > > string.
> > > > > This is meant to allow applications to read the mimetype at a fixed
> > > > > offset in the file and without doing decompression.
> > > > > 
> > > > > In unzip -vl it looks like this:
> > > > >  Length   MethodSize  CmprDateTime   CRC-32   Name
> > > > > 
> > > > >   --  ---  -- -   
> > > > > 
> > > > >   20  Stored   20   0% 10-14-2018 05:57 2cab616f  mimetype
> > > > > 
> > > > > Here is an XQuery to create a file with just that entry:
> > > > > 
> > > > > xquery
> > > > > declare namespace zip = "http://expath.org/ns/zip;;
> > > > >

Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
On dinsdag 8 september 2020 11:57:16 CEST Christian Grün wrote:
> > This example demonstrates that compression-level="0" does do what
> 
> > the api promises:
> I can have a closer look into that. Could you possibly provide me with
> a little self-contained example that I can run out of the box?

Here is an example that creates a new archive that uses compression-level="0" 
and algorithm="stored" and still compresses that entry.

Note that the archive level option 'algorithm' is unfortumate because often it 
is only single entries such as 'mimetype' or images that should not be 
compressed. The algorithm should be 'stored' for every entry that has 
compression-level="0".

```xquery
declare namespace file = "http://expath.org/ns/file;;
declare namespace archive = "http://basex.org/modules/archive;;

(: Create a zip file with one uncompressed file :)
let $file := "test.epub"
let $mimetype := "application/epub+zip"
let $entries := (
   {"mimetype"}
)
let $contents := ($mimetype)
let $zip := archive:create($entries, $contents,
  map { "format": "zip", "algorithm": "stored" }
) 
return file:write-binary($file, $zip)
```

Best regards,
Jos
declare namespace file = "http://expath.org/ns/file;;
declare namespace archive = "http://basex.org/modules/archive;;

(: Create a zip file with one uncompressed file :)
let $file := "test.epub"
let $mimetype := "application/epub+zip"
let $entries := (
   {"mimetype"}
)
let $contents := ($mimetype)
let $zip := archive:create($entries, $contents,
  map { "format": "zip", "algorithm": "stored" }
)
return file:write-binary($file, $zip)


signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:
> > Here is an example that creates a new archive that uses
> > compression-level="0" and algorithm="stored" and still compresses that
> > entry.
> > 
> > Note that the archive level option 'algorithm' is unfortumate because
> > often it is only single entries such as 'mimetype' or images that should
> > not be compressed.
> 
> Thanks for the example. – My observation is that the entry is indeed
> archived uncompressed if you choose compression-level="0"; but I think
> what you are saying is that an uncompressed DEFLATE entry is not the
> same as an uncompressed STORED entry, right, and that ODS and ePub
> files require certain files to be stored with the STORED algorithm, is
> that right?

The thing that counts is that you can read the mimetype enty name and contents 
without decompression starting from byte 30. That way tools such as 'find' can 
report the mimetype.

The file generated with the attached script in BaseX 9.4.3 beta gives this:

$ file -i test.epub
test.epub: application/octet-stream; charset=binary
$ unzip -vl test.epub
Archive:  test.epub
 Length   MethodSize  CmprDateTime   CRC-32   Name
  --  ---  -- -   
  20  Defl:N   25 -25% 09-08-2020 13:54 2cab616f  mimetype
  ---  ------
  20   25 -25%1 file
$ hexdump -C test.epub | head -4
  50 4b 03 04 14 00 08 08  08 00 d9 6e 28 51 00 00  |PK.n(Q..|
0010  00 00 00 00 00 00 00 00  00 00 08 00 00 00 6d 69  |..mi|
0020  6d 65 74 79 70 65 01 14  00 eb ff 61 70 70 6c 69  |metype.appli|
0030  63 61 74 69 6f 6e 2f 65  70 75 62 2b 7a 69 70 50  |cation/epub+zipP|

There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are 
deflate information. If the entry is 'stored' there are no bytes between the 
entry name and the contents and the zip will be recognized by the epub and ODF 
applications (and use less space) than when it is deflated with compression-
level 0.

> The Archive Module has a long history, and was initially based on a
> proposal for the Zorba XQuery Processor back in 2012. I don’t actually
> remember why the algorithm option was not adopted for the single
> archive entries; maybe that would have been more reasonable. As we
> seem to be the only implementation left today, we could think about
> changing that. I doubt anyway that people will use different
> compression levels for single archive entries (apart from archiving
> them uncompressed), so it might be a better solution to define one
> global compression level for the whole archive.

From a practical point of view (regardless of what is in the specification) it 
makes sense to store 'mimetype' uncompressed and also store files such as png 
and jpg that are already compressed in the 'stored' way. If that can be 
achieved easily: great, but at least it should be possible. I think the 
simplest solution is to save compression-level=0 as stored.

Best regards,
Jos


signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
On dinsdag 8 september 2020 14:27:55 CEST Christian Grün wrote:
> Hi Jos,
> 
> > There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are
> > deflate information. If the entry is 'stored' there are no bytes between
> > the entry name […]
> 
> Great, so we are talking about the same thing.
> 
> > I think the simplest solution is to save compression-level=0 as stored.
> 
> That was also my thought. A quick fix caused the following error
> message (similar to what is described here [1])…
> 
> > Operation failed: STORED entry missing size, compressed size, or crc-32.
> 
> …which means we’ll probably need to set additional values before
> writing the actual byte array. I’ll see what we can do.
> 
> I was surprised to learn more about the deficiencies of the Archive
> Module. The module was already used many times in the past to create
> ePub files, so my guess would be that these files could be opened by
> many readers, but were not 100% valid. How do you usually proceed to
> check the validity of ePub files?

I think many, but not all, tools are forgiving. Especially tools that scan 
many files such as file explorers, will rely on 'magic bytes'. Most epub files 
comply to having a 'stored' mimetype as the first file. From a local 
collection from various sources 138 out of 159 files comply.

For ODF, complyance to this rule is almost universal. EPub actually adopted 
the practice of a stored mimetype from ODF.

For validation of epub files you can use the w3c validator.
  https://github.com/w3c/epubcheck
For ODF you can use the ODF validator:
  https://odftoolkit.org/conformance/ODFValidator.html

Best regards,
Jos


signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
On dinsdag 8 september 2020 14:40:34 CEST Christian Grün wrote:
> I’ve updated the code; if the compression level is set to 0, entries
> will be STORED [1]. Feel free to check out the latest snapshot [2].

Creating a new file epub or odf file works correctly now but archive:update() 
does not retain the 'stored' property for the manifest file. (It does retain 
the order of the entries). Here is an example script that takes 'test.epub' as 
input.

```xquery
declare namespace file = "http://expath.org/ns/file;;
declare namespace archive = "http://basex.org/modules/archive;;

(: Update a zip file
   Currently, this will change the 'stored' entries to 'deflate' breaking
   mimetype recognition.
:)
let $file := "test.epub"
let $archive := file:read-binary($file)
let $updated := archive:update($archive, (), ())
return file:write-binary($file, $updated)
```

Best regards,
Jos

> 
> [1]
> https://github.com/BaseXdb/basex/commit/67ad584a85e0848432e19b4f587fbabfc2f
> c38e5 [2] https://files.basex.org/releases/latest/
> 
> On Tue, Sep 8, 2020 at 2:27 PM Christian Grün  
wrote:
> > Hi Jos,
> > 
> > > There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These
> > > are
> > > deflate information. If the entry is 'stored' there are no bytes between
> > > the entry name […]
> > 
> > Great, so we are talking about the same thing.
> > 
> > > I think the simplest solution is to save compression-level=0 as stored.
> > 
> > That was also my thought. A quick fix caused the following error
> > message (similar to what is described here [1])…
> > 
> > > Operation failed: STORED entry missing size, compressed size, or crc-32.
> > 
> > …which means we’ll probably need to set additional values before
> > writing the actual byte array. I’ll see what we can do.
> > 
> > I was surprised to learn more about the deficiencies of the Archive
> > Module. The module was already used many times in the past to create
> > ePub files, so my guess would be that these files could be opened by
> > many readers, but were not 100% valid. How do you usually proceed to
> > check the validity of ePub files?
> > 
> > Best,
> > Christian
> > 
> > [1]
> > https://stackoverflow.com/questions/1206970/how-to-create-uncompressed-zi
> > p-archive-in-java> 
> > On Tue, Sep 8, 2020 at 2:06 PM Jos van den Oever  
wrote:
> > > On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:
> > > > > Here is an example that creates a new archive that uses
> > > > > compression-level="0" and algorithm="stored" and still compresses
> > > > > that
> > > > > entry.
> > > > > 
> > > > > Note that the archive level option 'algorithm' is unfortumate
> > > > > because
> > > > > often it is only single entries such as 'mimetype' or images that
> > > > > should
> > > > > not be compressed.
> > > > 
> > > > Thanks for the example. – My observation is that the entry is indeed
> > > > archived uncompressed if you choose compression-level="0"; but I think
> > > > what you are saying is that an uncompressed DEFLATE entry is not the
> > > > same as an uncompressed STORED entry, right, and that ODS and ePub
> > > > files require certain files to be stored with the STORED algorithm, is
> > > > that right?
> > > 
> > > The thing that counts is that you can read the mimetype enty name and
> > > contents without decompression starting from byte 30. That way tools
> > > such as 'find' can report the mimetype.
> > > 
> > > The file generated with the attached script in BaseX 9.4.3 beta gives
> > > this:
> > > 
> > > $ file -i test.epub
> > > test.epub: application/octet-stream; charset=binary
> > > $ unzip -vl test.epub
> > > Archive:  test.epub
> > > 
> > >  Length   MethodSize  CmprDateTime   CRC-32   Name
> > > 
> > >   --  ---  -- -   
> > > 
> > >   20  Defl:N   25 -25% 09-08-2020 13:54 2cab616f  mimetype
> > > 
> > >   ---  ------
> > > 
> > >   20   25 -25%1 file
> > > 
> > > $ hexdump -C test.epub | head -4
> > >   50 4b 03 04 14 00 08 08  08 00 d9 6e 28 51 00 00 
> > > |PK.n(Q..| 0010  00 00 00 00 00 00 00 00  00 00 08 00 00 00
&

Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
Thank you for making the improvements. This is much cleaner imho than bash + 
zip + xsltproc. :-)

On dinsdag 8 september 2020 16:59:14 CEST Christian Grün wrote:
> > Creating a new file epub or odf file works correctly now but
> > archive:update() does not retain the 'stored' property for the manifest
> > file. (It does retain the order of the entries).
> 
> The stored property will now be retained if an archive is updated.
> 
> Up to now, archive:update removed existing update candidates from the
> archive and added new entries at the end. I changed this as well: If
> existing files are updated, the original order will be preserved, and
> new files will be added in the order in which they were supplied by
> the user.
> 
> > EPub actually adopted the practice of a stored mimetype from ODF.
> 
> I didn’t know that. And thanks for the link to the W3C epub checker
> (which I actually used by myself a long time ago) and the ODF checker.

Since both are java they even fit in the basex environment.

> BaseX 9.4.3 is scheduled to be released later this week.


signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] creating epub and odf with basex

2020-09-08 Thread Jos van den Oever
On dinsdag 8 september 2020 19:31:41 CEST Liam R. E. Quin wrote:
> On Tue, 2020-09-08 at 17:07 +0200, Jos van den Oever wrote:
> > Thank you for making the improvements. This is much cleaner imho than
> > bash +
> > zip + xsltproc. :-)
> 
> A minor addition - i've sometimes started with a base zip file with the
> uncompressed "mimetype" entry in it, and just added the rest to that
> file fromXQuery or XSLT, without problems.

That's a fine solution, but in this case that did not work because it 
'upgrade' the mimetype file to be compressed.



signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] vanishing whitespace with fn:doc()

2021-02-16 Thread Jos van den Oever
Hi Christian,

Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.

But where in the XQuery or XDM spec does it say that whitespace handling when 
parsing is implementation dependent?

Cheers,
Jos


On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
> Hi Jos,
> 
> Whitespaces will be preserved if the CHOP option is disabled. You can make
> this a default by adding CHOP=false in your .basex configuration file [1,2].
> 
> Hope this helps,
> Christian
> 
> [1] https://docs.basex.org/wiki/Full-Text#Mixed_Content
> [2] https://docs.basex.org/wiki/Configuration
> 
> 
> 
> 
> Jos van den Oever  schrieb am Di., 16. Feb. 2021,
> 
> 22:00:
> > Dear all,
> > 
> > First off: BaseX is great to work with. I use it for a few statically
> > generated websites.
> > 
> > But I recently found what might be a bug.
> > 
> > Some whitespace vanishes when loading xml files. E.g. this xml file:
> > 
> > ```test.xml
> >  a b  c  d e 
> > ```
> > 
> > run like this:
> > 
> > doc('test.xml')
> > 
> > gives:
> > 
> > a bcd e
> > 
> > But running this:
> > 
> > ```
> > parse-xml(' a b  c  d e ')
> > ```
> > 
> > retains the whitespace.
> > 
> > I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
> > 
> > Running this in saxon-he-10.3.jar retains the whitespace.
> > 
> > I can work around this issue by placing xml:space="preserve" in the
> > document
> > element.
> > 
> > I cannot come up with a scenario in which discarding whitespace during is
> > parsing is ok when no DTD or XML Schema is provided.
> > 
> > Best regards,
> > Jos



signature.asc
Description: This is a digitally signed message part.


[basex-talk] vanishing whitespace with fn:doc()

2021-02-16 Thread Jos van den Oever
Dear all,

First off: BaseX is great to work with. I use it for a few statically 
generated websites.

But I recently found what might be a bug.

Some whitespace vanishes when loading xml files. E.g. this xml file:

```test.xml
 a b  c  d e 
```

run like this:

doc('test.xml')

gives:

a bcd e

But running this:

```
parse-xml(' a b  c  d e ')
```

retains the whitespace.

I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.

Running this in saxon-he-10.3.jar retains the whitespace.

I can work around this issue by placing xml:space="preserve" in the document 
element.

I cannot come up with a scenario in which discarding whitespace during is 
parsing is ok when no DTD or XML Schema is provided.

Best regards,
Jos


signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] vanishing whitespace with fn:doc()

2021-02-16 Thread Jos van den Oever
Thanks for the context.

Still, it does not explain the difference in behavior bestween doc() and 
parse-xml().

As far as I understand the XDM specification, whitespace may be ignored by the 
parser if there is a DTD or XML Schema that says that an element is not PCDATA 
(DTD) or mixed (XML Schema). In the absense of (support for) schemas, all 
whitespace should be left in. Wendell Piez writes it with many details.

Whitespace in XML tricky. E.g. indenting XML cannot be done well without 
knowing which elements are PCDATA/mixed.

Now that I know about the CHOP option, I can use BaseX predictably. And the 
legacy reasons for keeping it set are understandable.

Best regards,
Jos

On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote:
> There is an old (and still open) issue on GitHub [1] that might give you
> some more insight into the history of whitespace chopping in BaseX.
> 
> Hope this helps
> Christian
> 
> [1] https://github.com/BaseXdb/basex/issues/913
> 
> 
> 
> 
> Jos van den Oever  schrieb am Di., 16. Feb. 2021,
> 
> 22:41:
> > Hi Christian,
> > 
> > Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
> > 
> > But where in the XQuery or XDM spec does it say that whitespace handling
> > when
> > parsing is implementation dependent?
> > 
> > Cheers,
> > Jos
> > 
> > On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
> > > Hi Jos,
> > > 
> > > Whitespaces will be preserved if the CHOP option is disabled. You can
> > 
> > make
> > 
> > > this a default by adding CHOP=false in your .basex configuration file
> > 
> > [1,2].
> > 
> > > Hope this helps,
> > > Christian
> > > 
> > > [1] https://docs.basex.org/wiki/Full-Text#Mixed_Content
> > > [2] https://docs.basex.org/wiki/Configuration
> > > 
> > > 
> > > 
> > > 
> > > Jos van den Oever  schrieb am Di., 16. Feb. 2021,
> > > 
> > > 22:00:
> > > > Dear all,
> > > > 
> > > > First off: BaseX is great to work with. I use it for a few statically
> > > > generated websites.
> > > > 
> > > > But I recently found what might be a bug.
> > > > 
> > > > Some whitespace vanishes when loading xml files. E.g. this xml file:
> > > > 
> > > > ```test.xml
> > > >  a b  c  d e 
> > > > ```
> > > > 
> > > > run like this:
> > > > 
> > > > doc('test.xml')
> > > > 
> > > > gives:
> > > > 
> > > > a bcd e
> > > > 
> > > > But running this:
> > > > 
> > > > ```
> > > > parse-xml(' a b  c  d e ')
> > > > ```
> > > > 
> > > > retains the whitespace.
> > > > 
> > > > I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
> > > > 
> > > > Running this in saxon-he-10.3.jar retains the whitespace.
> > > > 
> > > > I can work around this issue by placing xml:space="preserve" in the
> > > > document
> > > > element.
> > > > 
> > > > I cannot come up with a scenario in which discarding whitespace during
> > 
> > is
> > 
> > > > parsing is ok when no DTD or XML Schema is provided.
> > > > 
> > > > Best regards,
> > > > Jos



signature.asc
Description: This is a digitally signed message part.


Re: [basex-talk] vanishing whitespace with fn:doc()

2021-02-16 Thread Jos van den Oever
Then to pass the XQuery test suite you probably use CHOP=OFF.
Are there other settings needed to be compliant?

On woensdag 17 februari 2021 00:04:38 CET Christian Grün wrote:
> Yes, you are certainly right. I think it was around 2007 when we chopped
> whitespaces by default, although we knew it didn't comply with the
> specification. One reason was that we rarely worked with mixed-content data
> at that time, and the whitespace indentations increased the size of
> databases and led to worse rendering results in the built-in visualizations
> (our first users were confused about that).
> 
> Maybe we’ll switch the default in a future version of BaseX.
> 
> 
> 
> 
> Jos van den Oever  schrieb am Di., 16. Feb. 2021,
> 
> 23:36:
> > Thanks for the context.
> > 
> > Still, it does not explain the difference in behavior bestween doc() and
> > parse-xml().
> > 
> > As far as I understand the XDM specification, whitespace may be ignored by
> > the
> > parser if there is a DTD or XML Schema that says that an element is not
> > PCDATA
> > (DTD) or mixed (XML Schema). In the absense of (support for) schemas, all
> > whitespace should be left in. Wendell Piez writes it with many details.
> > 
> > Whitespace in XML tricky. E.g. indenting XML cannot be done well without
> > knowing which elements are PCDATA/mixed.
> > 
> > Now that I know about the CHOP option, I can use BaseX predictably. And
> > the
> > legacy reasons for keeping it set are understandable.
> > 
> > Best regards,
> > Jos
> > 
> > On dinsdag 16 februari 2021 23:10:05 CET Christian Grün wrote:
> > > There is an old (and still open) issue on GitHub [1] that might give you
> > > some more insight into the history of whitespace chopping in BaseX.
> > > 
> > > Hope this helps
> > > Christian
> > > 
> > > [1] https://github.com/BaseXdb/basex/issues/913
> > > 
> > > 
> > > 
> > > 
> > > Jos van den Oever  schrieb am Di., 16. Feb. 2021,
> > > 
> > > 22:41:
> > > > Hi Christian,
> > > > 
> > > > Yes, writing 'CHOP=OFF' in .basex stops the vanishing of whitespace.
> > > > 
> > > > But where in the XQuery or XDM spec does it say that whitespace
> > 
> > handling
> > 
> > > > when
> > > > parsing is implementation dependent?
> > > > 
> > > > Cheers,
> > > > Jos
> > > > 
> > > > On dinsdag 16 februari 2021 22:10:30 CET Christian Grün wrote:
> > > > > Hi Jos,
> > > > > 
> > > > > Whitespaces will be preserved if the CHOP option is disabled. You
> > > > > can
> > > > 
> > > > make
> > > > 
> > > > > this a default by adding CHOP=false in your .basex configuration
> > > > > file
> > > > 
> > > > [1,2].
> > > > 
> > > > > Hope this helps,
> > > > > Christian
> > > > > 
> > > > > [1] https://docs.basex.org/wiki/Full-Text#Mixed_Content
> > > > > [2] https://docs.basex.org/wiki/Configuration
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Jos van den Oever  schrieb am Di., 16. Feb.
> > 
> > 2021,
> > 
> > > > > 22:00:
> > > > > > Dear all,
> > > > > > 
> > > > > > First off: BaseX is great to work with. I use it for a few
> > 
> > statically
> > 
> > > > > > generated websites.
> > > > > > 
> > > > > > But I recently found what might be a bug.
> > > > > > 
> > > > > > Some whitespace vanishes when loading xml files. E.g. this xml
> > 
> > file:
> > > > > > ```test.xml
> > > > > >  a b  c  d e 
> > > > > > ```
> > > > > > 
> > > > > > run like this:
> > > > > > 
> > > > > > doc('test.xml')
> > > > > > 
> > > > > > gives:
> > > > > > 
> > > > > > a bcd e
> > > > > > 
> > > > > > But running this:
> > > > > > 
> > > > > > ```
> > > > > > parse-xml(' a b  c  d e ')
> > > > > > ```
> > > > > > 
> > > > > > retains the whitespace.
> > > > > > 
> > > > > > I've tested this with BaseX 7.0, 8.0, 9.0 and 9.4.6.
> > > > > > 
> > > > > > Running this in saxon-he-10.3.jar retains the whitespace.
> > > > > > 
> > > > > > I can work around this issue by placing xml:space="preserve" in
> > > > > > the
> > > > > > document
> > > > > > element.
> > > > > > 
> > > > > > I cannot come up with a scenario in which discarding whitespace
> > 
> > during
> > 
> > > > is
> > > > 
> > > > > > parsing is ok when no DTD or XML Schema is provided.
> > > > > > 
> > > > > > Best regards,
> > > > > > Jos



signature.asc
Description: This is a digitally signed message part.