Hello Gioele,

I have a souvenir that the use of namespaces was slowing down (or maybe 
invalidating) the structure index.
Someone @BaseX will certainly correct me if I am wrong,
but if your data is single namespaced, what about reloading data with the "skip 
namespaces" option enabled and test if performance improves ?

Another solution could be to create an index collection, where key would be 
your search terms, and values the node-pre or node-id of your (sub-)documents.

Best regards,
Fabrice


-----Message d'origine-----
De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Gioele 
Barabucci
Envoyé : vendredi 12 juin 2015 10:42
À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Optimization of a slow query with `//`

Hello,

I am working on an application that retrieves its data from a TEI XML file via 
BaseX. The following query lies at the core of this application but is too slow 
to be used in production: on a modern PC it requires about 600 ms to run over a 
4MB file (1/10 of the complete dataset). Any suggestion on how to improve its 
performance (without changing the underlying TEI files) would be much 
appreciated.

Here is the query:

     declare namespace tei='http://www.tei-c.org/ns/1.0';

     /tei:TEI/tei:text/tei:body//
       *[self::tei:entry or self::tei:re]
       [./tei:form/tei:orth[. = "arci"]
         [ancestor-or-self::*
           [@xml:lang][1]
           [(starts-with(@xml:lang, "san"))]
         ]
       ]

In human terms is should return all the `tei:entry` or `tei:re` that

* have the word "arci" in their `/tei:form/tei:orth` element,
* their nearest `xml:lang` attribute starts with 'san'.

I made some tests and it turned out that the main culprit is the use of `//` in 
the first line. (_Main_ culprit, not the only one...)

I use the `//` axis because I do not know what is the structure of the 
underlying TEI file. I expect BaseX to keep track of all the `tei:entry` and 
`tei:re` elements and their parents, so selecting the correct ones should be 
quite fast anyway. But the measurements disagree with my assumptions...

What could I do to improve the performance of this query?


Now, some remarks based on some small tests I have done:

1. Removing the

     [ancestor-or-self::*[....]]

predicate slashes the run time in half, but the query is still way too slow.

2. Changing

     ./tei:form/tei:orth[. = "arci"]

to

     ./tei:form[1]/tei:orth[1][. = "arci"]

makes the query even slower.

3. changing `starts-with(@xml:lang, "san")` to `@xml:lang = 'san-xxx'` has a 
negligible effect.

4. Dropping the `[1]` from

     [@xml:lang][1]

makes the whole query twice as fast.

Regards,

--
Gioele Barabucci <gio...@svario.it>

Reply via email to