Re: [basex-talk] Basex Inner Workings

Kirsten, Dirk Fri, 15 Sep 2017 08:56:34 -0700

Hello Athanasios,

I think you should really check the actual query plan which is executed. If you 
have such a huge spike in performance surely they processor will be executing 
it differently. I don't think looking into file access patterns BaseX 
internally uses is very useful for an end user. You should let BaseX handle 
that (but of course, if you find better/more efficient ways I am sure 
Christian' gladly accepts Pull Requests). But the pattern you describe sounds 
very much excepted, so reads if you open databases seem logical and short write 
operations are also expected when just reading a database, because e.g. BaseX 
has to lock the databases.


So I think it would be more useful to look into the query plan. Of course you 
are more than welcome to ask about what is going on there on this list. I would 
expect that because of your rewrite maybe some indexes are not applied anymore 
(or if your rewrite is simply very big that most of the time is spent 
serializing the data).

Cheers
Dirk

Senacor Technologies Aktiengesellschaft - Sitz: Eschborn - Amtsgericht 
Frankfurt am Main - Reg.-Nr.: HRB 105546
Vorstand: Matthias Tomann, Marcus Purzer - Aufsichtsratsvorsitzender: Daniel 
Grözinger

-----Ursprüngliche Nachricht-----
Von: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] Im Auftrag von Fabrice 
ETANCHAUD
Gesendet: Freitag, 15. September 2017 17:35
An: 'Anastasiou A.' <a.anastas...@swansea.ac.uk>; 
basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Basex Inner Workings

You can find the time spent in each step in the query info bar graph.

If you are looking for the schema and the facets of your dataset, you should 
have a look at the index module, and for sure at index:facets()

Best regards,
Fabrice

-----Message d'origine-----
De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk]
Envoyé : vendredi 15 septembre 2017 17:23 À : Fabrice ETANCHAUD; 
basex-talk@mailman.uni-konstanz.de
Objet : RE: Basex Inner Workings

Thank you Fabrice. I understand.

I have not tried querying from the command prompt or sending the output to a 
file directly, which I could also work with. But, my understanding is that the 
time we are being quoted by the gui is the DB time, not taking into account the 
time it takes for the list to be pushed into whatever data structures the list 
boxes might be supporting (?).

I am trying to get a better understanding of the dataset at the moment and I 
have short and long queries which depending on the results I get from this step 
could be optimised further.

All the best

-----Original Message-----
From: Fabrice ETANCHAUD [mailto:fetanch...@pch.cerfrance.fr]
Sent: 15 September 2017 16:17
To: Anastasiou A.; basex-talk@mailman.uni-konstanz.de
Subject: RE: Basex Inner Workings

I understand that you are reformatting a lot of data, aren't you ?
I will have only little advice, because this is not my use case.

>From what I know, resulting document will be materialized entirely in memory 
>before presentation or export.
You should export your results to disk, in order not to lose time in BaseXGUI 
rendering.

To reformat very big amounts of data, you might have a look at saxon streaming 
features (not in the free version).

But usually, big results are not requested frequently.

Best regards,
Fabrice

-----Message d'origine-----
De : Anastasiou A. [mailto:a.anastas...@swansea.ac.uk]
Envoyé : vendredi 15 septembre 2017 16:39 À : Fabrice ETANCHAUD; 
basex-talk@mailman.uni-konstanz.de
Objet : RE: Basex Inner Workings

Hello Fabrice

Yes, I am having a query which jumped from ~1500 ms to about a minute with a 
tiny little change...

The DB is about 2GB and it is my test set before putting the query to work on 
the full dataset.

The change was to go from simply returning the nodes themselves with a `return 
thisnode | thatnode |theothernode` to a "formatted" document that has an outer 
<collection> with a number of `return 
<item>{thisNode|thatNode|theOtherNode}</item>` inside it.

I understand that the new query might be creating some new entities but 
compared to the element content, these few extra characters are not THAT many 
more.

The query jumps from ~1500 ms when using plain XML, to ~55000ms with the 
addition of the collection, item nodes, to ~57000ms with the addition of CSV 
exporting via the CSV module. These are "informal average" values. So, I have 
not run the same query a few times and then obtain the average, but that's the 
sort of vicinity I have seen numbers in from the times I have run the queries 
so far.

The database itself is "static", there are no update/insert transactions at the 
moment, the only thing that I am trying to do is extract some data in a 
different format from it.

I have Text, Attribute and Token indexes on that database (optimised right 
after importing) but no further options enabled. I also have not experimented 
with the SPLITSIZE (?). I have 32GB of memory and it should be enough to handle 
this 2GB test dataset (?). I will have a go with DEBUG on.

Did you have to enable any additional options for indexes to work faster?

All the best





-----Original Message-----
From: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] On Behalf Of Fabrice 
ETANCHAUD
Sent: 15 September 2017 13:27
To: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Basex Inner Workings

Hi Athanasios,

Did you experience slow queries ?
Are you sure to use all the index features ?
Are these queries operational ones (direct access on a key value) or analytics ?

I never experienced slow queries, even on huge xml corpus (patent 
registrations), But this is at the cost of longer indexing times on updates.

Best regards,


-----Message d'origine-----
De : basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] De la part de Anastasiou A.
Envoyé : vendredi 15 septembre 2017 14:01 À : basex-talk@mailman.uni-konstanz.de
Objet : [basex-talk] Basex Inner Workings

Hello everyone

Quick question: Is there any document / URL where I could find out more about 
how does Basex access the disk during its operation?

For example, are there any reads to be expected during executing a query?

Through iotop, I can see 3-4 processes reading during startup, then another 2, 
very briefly firing when opening the database and then during querying there 
are periodic writes (?) but of very brief duration.

I was wondering if there is anything that could be done from the point of view 
of the hardware to speed up queries (?) (except a more powerful machine at the 
moment)

All  the best
Athanasios Anastasiou

Re: [basex-talk] Basex Inner Workings

Reply via email to