[basex-talk] Whitespace

2024-02-14 Thread Owen Ambur
Lack of capability to deal appropriately with whitespaces (and punctuation) 
results in false positives in our StratML-enabled query service at 
https://search.aboutthem.info/
Will look forward to learning if anything can be done about it.
Owen Amburhttps://www.linkedin.com/in/owenambur/
 

On Wednesday, February 14, 2024 at 05:38:41 AM EST, Imsieke, Gerrit, le-tex 
 wrote:  
 
 Whitespace is probably only a minor factor here. It can’t explain the loading 
times that grow non-linearly with document count.

Dietmar, have you looked at the memory consumption? My experience is that if 
memory gets scarce, garbage collection will kick in frequently, slowing down 
the import process. Increasing -Xmx in the startup script might improve the 
import speed. If your computer has 16 GB of RAM, try setting -Xmx12g, for 
example, and see whether there is an improvement. You can see the memory 
consumption in the GUI, so try to create the DB from the GUI.

Gerrit

On 14.02.2024 10:48, Christian Grün wrote:
> Thanks for the addition, Liam; I should have mentioned that.
> 
> If your input has mixed content, and if the relevant sections have 
> xml:space='preserve' attributes…
> 
> The very tc34q.
> 
> …whitespace stripping will be safe.
> 
> Similarly, it may be helpful to know that the whitspace gets lost if XML 
> strings…
> 
> The very tc34q.
> 
> …are evaluated as XQuery. To prevent that, you can add a statement to the 
> prolog of the query:
> 
> declare boundary-space preserve;
> The very tc34q.
> 
> Whitespace handling is generally a tricky issue in XML.
> 
> Best,
> Christian
> 
> 
> On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin  > wrote:
> 
>    On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
>>
>>    If your XML input has been properly indented to improve readibility, you 
>>can reduce the size of your database by dropping superfluous whitespace 
>>during the import:
>>
>>    SET STRIPWS ON; CREATE DB ...
>>    db:create('db', '/path/to/documents', (), map { 'stripws': true() })
> 
>    Beware that this is not schema-based, and can remove whitespace nodes in 
>mixed content -
>    The very tc34q.
>    may become (as i understand it)
>          The verytc34q.
>    (i have seen this, with different software, cause potentially catastrophic 
>problems in aircraft manuals!)
> 
>    liam
> 
  

Re: [basex-talk] Help with loading of 9 million documents

2024-02-14 Thread Imsieke, Gerrit, le-tex

Whitespace is probably only a minor factor here. It can’t explain the loading 
times that grow non-linearly with document count.

Dietmar, have you looked at the memory consumption? My experience is that if 
memory gets scarce, garbage collection will kick in frequently, slowing down 
the import process. Increasing -Xmx in the startup script might improve the 
import speed. If your computer has 16 GB of RAM, try setting -Xmx12g, for 
example, and see whether there is an improvement. You can see the memory 
consumption in the GUI, so try to create the DB from the GUI.

Gerrit

On 14.02.2024 10:48, Christian Grün wrote:

Thanks for the addition, Liam; I should have mentioned that.

If your input has mixed content, and if the relevant sections have 
xml:space='preserve' attributes…

The very tc34q.

…whitespace stripping will be safe.

Similarly, it may be helpful to know that the whitspace gets lost if XML 
strings…

The very tc34q.

…are evaluated as XQuery. To prevent that, you can add a statement to the 
prolog of the query:

declare boundary-space preserve;
The very tc34q.

Whitespace handling is generally a tricky issue in XML.

Best,
Christian


On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin mailto:l...@fromoldbooks.org>> wrote:

On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:


If your XML input has been properly indented to improve readibility, you 
can reduce the size of your database by dropping superfluous whitespace during 
the import:

SET STRIPWS ON; CREATE DB ...
db:create('db', '/path/to/documents', (), map { 'stripws': true() })


Beware that this is not schema-based, and can remove whitespace nodes in 
mixed content -
The very tc34q.
may become (as i understand it)
     The verytc34q.
(i have seen this, with different software, cause potentially catastrophic 
problems in aircraft manuals!)

liam



Re: [basex-talk] Help with loading of 9 million documents

2024-02-14 Thread Christian Grün
Thanks for the addition, Liam; I should have mentioned that.

If your input has mixed content, and if the relevant sections have
xml:space='preserve' attributes…

The very tc34q.

…whitespace stripping will be safe.

Similarly, it may be helpful to know that the whitspace gets lost if XML
strings…

The very tc34q.

…are evaluated as XQuery. To prevent that, you can add a statement to the
prolog of the query:

declare boundary-space preserve;
The very tc34q.

Whitespace handling is generally a tricky issue in XML.

Best,
Christian


On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin 
wrote:

> On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
>
>
> If your XML input has been properly indented to improve readibility, you
> can reduce the size of your database by dropping superfluous whitespace
> during the import:
>
> SET STRIPWS ON; CREATE DB ...
> db:create('db', '/path/to/documents', (), map { 'stripws': true() })
>
>
> Beware that this is not schema-based, and can remove whitespace nodes in
> mixed content -
> The very tc34q.
> may become (as i understand it)
> The verytc34q.
> (i have seen this, with different software, cause potentially catastrophic
> problems in aircraft manuals!)
>
> liam
>
> --
>
> Liam Quin, https://www.delightfulcomputing.com/
> Available for XML/Document/Information Architecture/XSLT/
> XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
> Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org
>


Re: [basex-talk] Help with loading of 9 million documents

2024-02-14 Thread Liam R. E. Quin
On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
> 
> If your XML input has been properly indented to improve readibility,
> you can reduce the size of your database by dropping superfluous
> whitespace during the import:
> 
> SET STRIPWS ON; CREATE DB ...
> db:create('db', '/path/to/documents', (), map { 'stripws': true() })

Beware that this is not schema-based, and can remove whitespace nodes
in mixed content -
    The very tc34q.
may become (as i understand it)
    The verytc34q.
(i have seen this, with different software, cause potentially
catastrophic problems in aircraft manuals!)

liam

-- 
Liam Quin, https://www.delightfulcomputing.com/
Available for XML/Document/Information Architecture/XSLT/
XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
Barefoot Web-slave, antique illustrations:  http://www.fromoldbooks.org