Lack of capability to deal appropriately with whitespaces (and punctuation) 
results in false positives in our StratML-enabled query service at 
https://search.aboutthem.info/
Will look forward to learning if anything can be done about it.
Owen Amburhttps://www.linkedin.com/in/owenambur/
 

    On Wednesday, February 14, 2024 at 05:38:41 AM EST, Imsieke, Gerrit, le-tex 
<gerrit.imsi...@le-tex.de> wrote:  
 
 Whitespace is probably only a minor factor here. It can’t explain the loading 
times that grow non-linearly with document count.

Dietmar, have you looked at the memory consumption? My experience is that if 
memory gets scarce, garbage collection will kick in frequently, slowing down 
the import process. Increasing -Xmx in the startup script might improve the 
import speed. If your computer has 16 GB of RAM, try setting -Xmx12g, for 
example, and see whether there is an improvement. You can see the memory 
consumption in the GUI, so try to create the DB from the GUI.

Gerrit

On 14.02.2024 10:48, Christian Grün wrote:
> Thanks for the addition, Liam; I should have mentioned that.
> 
> If your input has mixed content, and if the relevant sections have 
> xml:space='preserve' attributes…
> 
> <p xml:space='preserve'>The <em>very</em> <id>tc34q</id>.</p>
> 
> …whitespace stripping will be safe.
> 
> Similarly, it may be helpful to know that the whitspace gets lost if XML 
> strings…
> 
> <p>The <em>very</em> <id>tc34q</id>.</p>
> 
> …are evaluated as XQuery. To prevent that, you can add a statement to the 
> prolog of the query:
> 
> declare boundary-space preserve;
> <p>The <em>very</em> <id>tc34q</id>.</p>
> 
> Whitespace handling is generally a tricky issue in XML.
> 
> Best,
> Christian
> 
> 
> On Wed, Feb 14, 2024 at 10:38 AM Liam R. E. Quin <l...@fromoldbooks.org 
> <mailto:l...@fromoldbooks.org>> wrote:
> 
>    On Tue, 2024-02-13 at 20:29 +0100, Christian Grün wrote:
>>
>>    If your XML input has been properly indented to improve readibility, you 
>>can reduce the size of your database by dropping superfluous whitespace 
>>during the import:
>>
>>    SET STRIPWS ON; CREATE DB ...
>>    db:create('db', '/path/to/documents', (), map { 'stripws': true() })
> 
>    Beware that this is not schema-based, and can remove whitespace nodes in 
>mixed content -
>    <p>The <em>very</em> <id>tc34q</id>.</p>
>    may become (as i understand it)
>          <p>The <em>very</em><id>tc34q</id>.</p>
>    (i have seen this, with different software, cause potentially catastrophic 
>problems in aircraft manuals!)
> 
>    liam
> 
  

Reply via email to