Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-18 Thread Christian Grün
Hi Lucian,

Thanks for taking your time, rerunning the tests and do some profiling.

> When inserting a 160 KB xml structure 100.000 times, the persist operation 
> duration starts by  ~45 ms and reach ~2000 ms after 68.000 persist 
> invocations and 16 hours of run time (!)

Indeed this differs quite a lot from the tests I have made so far, and
from the patterns I am used to.

It was helpful to have a look into the Java profiling files: A plain
FileOutputStream.open call takes most of the time, while it’s hardly
measurable in my own tests. Do you work with a local file system?
Maybe the file listing of your database directory could shed some more
light here.

I would additionally assume that this that you were closing and
opening your database after each addition, right? Obviously this makes
sense if no bulk operations take place; it’s just different from what
I did in my tests.

> There are actually two AUTOFLUSH-related issues: […]

You are obviously right: AUTOFLUSH should only be used for bulk
operations, and avoided if persistency of data is critical. And the
addition of documents will always be faster if the database is small
(but it should definitely not take more than a second to add a single
small document).

Best,
Christian


Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-14 Thread Bram Vanroy | KU Leuven
Possibly related, but I'm not sure:

When creating millions of databases in a loop in the same session, I found that 
after some thousands I'd get an OOM error by BaseX. This seemed odd to me, 
because after each iteration, the database creation query was closed (and I'd 
expect GC to run at such a time?). To by-pass this I just closed the session 
and opened a new one each couple of thousand-th time in the loop.

Maybe there is a (small) memory leak somewhere in BaseX that only becomes 
noticeable (and annoying) after hundreds of thousands of even millions of 
queries? 

-Oorspronkelijk bericht-
Van: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] Namens Christian Grün
Verzonden: zaterdag 14 januari 2017 12:09
Aan: Bularca, Lucian <lucian.bula...@mueller.de>
CC: basex-talk@mailman.uni-konstanz.de
Onderwerp: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung 
von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I have a hard time reproducing the reported behavior. The attached, revised 
Java example (without AUTOFLUSH) required around 30 ms for the first documents 
and 120 ms for the last documents, which is still pretty far from what you’ve 
been encountering:

> von Anfang ~ 10 ms auf  ~ 2500 ms kommne würde

But obviously something weird has been going on in your setup. Let’s see what 
alternatives we have…

• Could you possibly try to update my example code such that it shows the 
reported behavior? Ideally with small input, in order to speed up the process. 
Maybe the runtime increase can also be demonstrated after
1.000 or 10.000 documents...
• You could also send me a list of the files of your test_database directory; 
maybe the file sizes indicate some unusual patterns.
• You could start BaseXServer with the JVM flag -Xrunhprof:cpu=samples (to be 
inserted in the basexserver script), start the server, run your script, stop 
the server directly afterwards, and send me the result file, which will be 
stored in the directory from where you started BaseX (java.hprof.txt).

Best,
Christian


On Wed, Jan 11, 2017 at 4:57 PM, Christian Grün <christian.gr...@gmail.com> 
wrote:
> Hi Lucian,
>
> Thanks for your analysis. Indeed I’m wondering about the monotonic 
> delay caused by auto flushing the data; this hasn’t always been the 
> case. I’m wondering even more why no one else noticed this in recent 
> time.. Maybe it’s not too long ago that this was introduced. It may 
> take some time to find the culprit, but I’ll keep you updated.
>
> All the best,
> Christian
>
>
> On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian 
> <lucian.bula...@mueller.de> wrote:
>> Hi Christian,
>>
>> I've made a comparation of the persistence time series running your example 
>> code and mine, in all possible combinations of following scenarios:
>> - with and without "set intparse on"
>> - using my prepared test data and your test data
>> - closing and opening the DB connection each "n"-th insertion 
>> operation (where n in {5, 100, 500, 1000})
>> - with and without "set autoflush on".
>>
>> I finally found out, that the only relevant variable that influence the 
>> insert operation duration is the value of the AUTOFLASH option.
>>
>> If AUTOFLASH = OFF when opening a database, then the persistence durations 
>> remains relative constant (on my machine about 43 ms) during the entire 
>> insert operations sequence (50.000 or 100.000 times), for all possible 
>> combinations named above.
>>
>> If AUTOFLASH = ON when opening a database, then the persistence durations 
>> increase monotonic, for all possible combinations named above.
>>
>> The persistence duration, if AUTOFLASH = ON, is directly proportional to the 
>> number of DB clients executing these insert operations, respectively to the 
>> sequence length of insert operations executed by a DB client.
>>
>> In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is 
>> implcitly set to ON (see BaseX documentation 
>> http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly 
>> set AUTOFLASH = OFF in order to keep the insert operation durations 
>> relatively constant over time. Additionally, no explicitly flushing data, 
>> increases the risk of data loss (see BaseX documentation 
>> http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly 
>> execute the FLUSH command increase the durations of the subsequent insert 
>> operations.
>>
>> Regards,
>> Lucian
>>
>> ____________
>> Von: Christian Grün [christian.gr...@gmail.com]
>> Gesendet: Dienstag, 10. Januar 2017 17:33
>> A

Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-14 Thread Christian Grün
Hi Lucian,

I have a hard time reproducing the reported behavior. The attached,
revised Java example (without AUTOFLUSH) required around 30 ms for the
first documents and 120 ms for the last documents, which is still
pretty far from what you’ve been encountering:

> von Anfang ~ 10 ms auf  ~ 2500 ms kommne würde

But obviously something weird has been going on in your setup. Let’s
see what alternatives we have…

• Could you possibly try to update my example code such that it shows
the reported behavior? Ideally with small input, in order to speed up
the process. Maybe the runtime increase can also be demonstrated after
1.000 or 10.000 documents...
• You could also send me a list of the files of your test_database
directory; maybe the file sizes indicate some unusual patterns.
• You could start BaseXServer with the JVM flag -Xrunhprof:cpu=samples
(to be inserted in the basexserver script), start the server, run your
script, stop the server directly afterwards, and send me the result
file, which will be stored in the directory from where you started
BaseX (java.hprof.txt).

Best,
Christian


On Wed, Jan 11, 2017 at 4:57 PM, Christian Grün
<christian.gr...@gmail.com> wrote:
> Hi Lucian,
>
> Thanks for your analysis. Indeed I’m wondering about the monotonic
> delay caused by auto flushing the data; this hasn’t always been the
> case. I’m wondering even more why no one else noticed this in recent
> time.. Maybe it’s not too long ago that this was introduced. It may
> take some time to find the culprit, but I’ll keep you updated.
>
> All the best,
> Christian
>
>
> On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian
> <lucian.bula...@mueller.de> wrote:
>> Hi Christian,
>>
>> I've made a comparation of the persistence time series running your example 
>> code and mine, in all possible combinations of following scenarios:
>> - with and without "set intparse on"
>> - using my prepared test data and your test data
>> - closing and opening the DB connection each "n"-th insertion operation 
>> (where n in {5, 100, 500, 1000})
>> - with and without "set autoflush on".
>>
>> I finally found out, that the only relevant variable that influence the 
>> insert operation duration is the value of the AUTOFLASH option.
>>
>> If AUTOFLASH = OFF when opening a database, then the persistence durations 
>> remains relative constant (on my machine about 43 ms) during the entire 
>> insert operations sequence (50.000 or 100.000 times), for all possible 
>> combinations named above.
>>
>> If AUTOFLASH = ON when opening a database, then the persistence durations 
>> increase monotonic, for all possible combinations named above.
>>
>> The persistence duration, if AUTOFLASH = ON, is directly proportional to the 
>> number of DB clients executing these insert operations, respectively to the 
>> sequence length of insert operations executed by a DB client.
>>
>> In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is 
>> implcitly set to ON (see BaseX documentation 
>> http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly 
>> set AUTOFLASH = OFF in order to keep the insert operation durations 
>> relatively constant over time. Additionally, no explicitly flushing data, 
>> increases the risk of data loss (see BaseX documentation 
>> http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly 
>> execute the FLUSH command increase the durations of the subsequent insert 
>> operations.
>>
>> Regards,
>> Lucian
>>
>> ________________
>> Von: Christian Grün [christian.gr...@gmail.com]
>> Gesendet: Dienstag, 10. Januar 2017 17:33
>> An: Bularca, Lucian
>> Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de
>> Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung 
>> von mehr als 5000, 160 KB große XML Datenstrukturen.
>>
>> Hi Lucian,
>>
>> I couldn’t run your code example out of the box. 24 hours sounds
>> pretty alarming, though, so I have written my own example (attached).
>> It creates 50.000 XML documents, each sized around 160 KB. It’s not as
>> fast as I had expected, but the total runtime is around 13 minutes,
>> and it only slow down a little when adding more documents...
>>
>> 1: 125279.45 ms
>> 2: 128244.23 ms
>> 3: 130499.9 ms
>> 4: 132286.05 ms
>> 5: 134814.82 ms
>>
>> Maybe you could compare the code with yours, and we can find out what
>> causes the delay?
>>
>> Best,
>> Christian
>>
>>
>> On Tue, Jan 10, 2017 at 4:44 PM, Bula

Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-11 Thread Christian Grün
Hi Lucian,

Thanks for your analysis. Indeed I’m wondering about the monotonic
delay caused by auto flushing the data; this hasn’t always been the
case. I’m wondering even more why no one else noticed this in recent
time.. Maybe it’s not too long ago that this was introduced. It may
take some time to find the culprit, but I’ll keep you updated.

All the best,
Christian


On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian
<lucian.bula...@mueller.de> wrote:
> Hi Christian,
>
> I've made a comparation of the persistence time series running your example 
> code and mine, in all possible combinations of following scenarios:
> - with and without "set intparse on"
> - using my prepared test data and your test data
> - closing and opening the DB connection each "n"-th insertion operation 
> (where n in {5, 100, 500, 1000})
> - with and without "set autoflush on".
>
> I finally found out, that the only relevant variable that influence the 
> insert operation duration is the value of the AUTOFLASH option.
>
> If AUTOFLASH = OFF when opening a database, then the persistence durations 
> remains relative constant (on my machine about 43 ms) during the entire 
> insert operations sequence (50.000 or 100.000 times), for all possible 
> combinations named above.
>
> If AUTOFLASH = ON when opening a database, then the persistence durations 
> increase monotonic, for all possible combinations named above.
>
> The persistence duration, if AUTOFLASH = ON, is directly proportional to the 
> number of DB clients executing these insert operations, respectively to the 
> sequence length of insert operations executed by a DB client.
>
> In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is 
> implcitly set to ON (see BaseX documentation 
> http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly 
> set AUTOFLASH = OFF in order to keep the insert operation durations 
> relatively constant over time. Additionally, no explicitly flushing data, 
> increases the risk of data loss (see BaseX documentation 
> http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly 
> execute the FLUSH command increase the durations of the subsequent insert 
> operations.
>
> Regards,
> Lucian
>
> 
> Von: Christian Grün [christian.gr...@gmail.com]
> Gesendet: Dienstag, 10. Januar 2017 17:33
> An: Bularca, Lucian
> Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de
> Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung 
> von mehr als 5000, 160 KB große XML Datenstrukturen.
>
> Hi Lucian,
>
> I couldn’t run your code example out of the box. 24 hours sounds
> pretty alarming, though, so I have written my own example (attached).
> It creates 50.000 XML documents, each sized around 160 KB. It’s not as
> fast as I had expected, but the total runtime is around 13 minutes,
> and it only slow down a little when adding more documents...
>
> 1: 125279.45 ms
> 2: 128244.23 ms
> 3: 130499.9 ms
> 4: 132286.05 ms
> 5: 134814.82 ms
>
> Maybe you could compare the code with yours, and we can find out what
> causes the delay?
>
> Best,
> Christian
>
>
> On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian
> <lucian.bula...@mueller.de> wrote:
>> Hi Dirk,
>>
>>  of course, querying millions of data entries on a single database rise
>> problems. This is equally problematic for all databases, not only for the
>> BaseX DB and certain storing strategies will be mandatory at production
>> time.
>>
>> The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24
>> hours because that inexplicable monotonic increase of the insert operation
>> durations.
>>
>> I'll really appreciate if someone can explain this behaviour or a
>> counterexample can demonstrate, that the cause of this behaviour is test
>> case but not DB inherent.
>>
>> Regards,
>> Lucian


Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-11 Thread Bularca, Lucian
Hi Christian,

I've made a comparation of the persistence time series running your example 
code and mine, in all possible combinations of following scenarios: 
- with and without "set intparse on"
- using my prepared test data and your test data
- closing and opening the DB connection each "n"-th insertion operation (where 
n in {5, 100, 500, 1000})
- with and without "set autoflush on".

I finally found out, that the only relevant variable that influence the insert 
operation duration is the value of the AUTOFLASH option. 

If AUTOFLASH = OFF when opening a database, then the persistence durations 
remains relative constant (on my machine about 43 ms) during the entire insert 
operations sequence (50.000 or 100.000 times), for all possible combinations 
named above.

If AUTOFLASH = ON when opening a database, then the persistence durations 
increase monotonic, for all possible combinations named above. 

The persistence duration, if AUTOFLASH = ON, is directly proportional to the 
number of DB clients executing these insert operations, respectively to the 
sequence length of insert operations executed by a DB client.

In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is 
implcitly set to ON (see BaseX documentation 
http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly 
set AUTOFLASH = OFF in order to keep the insert operation durations relatively 
constant over time. Additionally, no explicitly flushing data, increases the 
risk of data loss (see BaseX documentation 
http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly 
execute the FLUSH command increase the durations of the subsequent insert 
operations.

Regards,
Lucian


Von: Christian Grün [christian.gr...@gmail.com]
Gesendet: Dienstag, 10. Januar 2017 17:33
An: Bularca, Lucian
Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von 
mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I couldn’t run your code example out of the box. 24 hours sounds
pretty alarming, though, so I have written my own example (attached).
It creates 50.000 XML documents, each sized around 160 KB. It’s not as
fast as I had expected, but the total runtime is around 13 minutes,
and it only slow down a little when adding more documents...

1: 125279.45 ms
2: 128244.23 ms
3: 130499.9 ms
4: 132286.05 ms
5: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what
causes the delay?

Best,
Christian


On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian
<lucian.bula...@mueller.de> wrote:
> Hi Dirk,
>
>  of course, querying millions of data entries on a single database rise
> problems. This is equally problematic for all databases, not only for the
> BaseX DB and certain storing strategies will be mandatory at production
> time.
>
> The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24
> hours because that inexplicable monotonic increase of the insert operation
> durations.
>
> I'll really appreciate if someone can explain this behaviour or a
> counterexample can demonstrate, that the cause of this behaviour is test
> case but not DB inherent.
>
> Regards,
> Lucian


Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-10 Thread Christian Grün
Hi Lucian,

I couldn’t run your code example out of the box. 24 hours sounds
pretty alarming, though, so I have written my own example (attached).
It creates 50.000 XML documents, each sized around 160 KB. It’s not as
fast as I had expected, but the total runtime is around 13 minutes,
and it only slow down a little when adding more documents...

1: 125279.45 ms
2: 128244.23 ms
3: 130499.9 ms
4: 132286.05 ms
5: 134814.82 ms

Maybe you could compare the code with yours, and we can find out what
causes the delay?

Best,
Christian


On Tue, Jan 10, 2017 at 4:44 PM, Bularca, Lucian
 wrote:
> Hi Dirk,
>
>  of course, querying millions of data entries on a single database rise
> problems. This is equally problematic for all databases, not only for the
> BaseX DB and certain storing strategies will be mandatory at production
> time.
>
> The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24
> hours because that inexplicable monotonic increase of the insert operation
> durations.
>
> I'll really appreciate if someone can explain this behaviour or a
> counterexample can demonstrate, that the cause of this behaviour is test
> case but not DB inherent.
>
> Regards,
> Lucian


AddDocs.java
Description: Binary data


Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-10 Thread Bularca, Lucian
Hi Dirk,

 of course, querying millions of data entries on a single database rise 
problems. This is equally problematic for all databases, not only for the BaseX 
DB and certain storing strategies will be mandatory at production time.

The actual problem is, that adding 50.000 of 160 KB xml stuctures took 24 hours 
because that inexplicable monotonic increase of the insert operation durations.

I'll really appreciate if someone can explain this behaviour or a 
counterexample can demonstrate, that the cause of this behaviour is test case 
but not DB inherent.


Regards,
Lucian

Von: Dirk Kirsten [d...@basex.org]
Gesendet: Dienstag, 10. Januar 2017 14:37
An: Bularca, Lucian; basex-talk@mailman.uni-konstanz.de
Betreff: Re: AW: [basex-talk] Gravierende Performance-Einbüße bei Persistierung 
von mehr als 5000, 160 KB große XML Datenstrukturen.


Hi Lucian,


sorry, I obviously forgot a very important word: "not". I would NOT expect 
100,000 documents to be added to be much of a problem. Sorry for the confusion.


The log file is interesting, it certainly looks like the performance is 
degrading. I can't say much about it, but I am sure Christian (our head 
architect) will give you some pointers when he has time to answer.


However, given the description of your problem I would advise you to in general 
rethink your architecture. So many consistent updates on a database seem to me 
to be not very performant when done on only a single database. So maybe you 
want to split up your data, e.g. you could put all documents of a certain day 
into a separate database.

Or you could have one "up-to-date" database, which you always update and 
transfer the entries within this database into another database during 
low-performance times. The other database could have proper indexes and 
whatever you need.

Because otherwise you will run into problems when querying your data. I guess 
you don't want to just store your data, you want to do something with it, don't 
you? Because just storing without using data seems a bit useless... And for 
this you probably want to use some indexes and having an up-to-date-index with 
constant updates is quite costly.

To sum it up: I think you want to split up your data in some way into several 
databases.


However, I understand that you will still have something like 100,000 documents 
in a database (which should be fine), so your current performance issue will 
still exist. My comment is more towards your general architecture.


Cheers

Dirk


On 01/10/2017 08:24 PM, Bularca, Lucian wrote:

Hi Dirk,

 thanks for your fast reply :)

Regarding the performance measure, I've forgot to mention, that I've based my 
affirmations on the protocol entries from the BaseX log file (see attached 
basex.log). The intention of the System.out made in each iteration, is just to 
protocol the order number of the added xml structure, not the duration of a 
persist operation. This System.out indeed does have an impact on the overall 
performance, but cannot explain the monotonic increase of the insert operations 
duration (see attached basex.log file). After 24 hours of inserting xml 
test-structures, only the half of the 100.000 xml test-structures where added 
in the database, at a rate of at most 1 structure / 2 seconds.

All these tests where made against the 8.5.3 version of the BaseX database.

In production, we expect peaks of 2,7 * 10 ^ 5 xml structures to persist / 24 
hours (~ 31 xml structures / 1 second). Do you mean with "However, I would 
expect 100,000 documents added to be much of a problem.", that persisting 
100.000 xml structures in the BaseX database is problematic?


Regards,
Lucian

Von: 
basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>
 
[basex-talk-boun...@mailman.uni-konstanz.de<mailto:basex-talk-boun...@mailman.uni-konstanz.de>]"
 im Auftrag von "Dirk Kirsten [d...@basex.org<mailto:d...@basex.org>]
Gesendet: Dienstag, 10. Januar 2017 12:52
An: 
basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>
Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von 
mehr als 5000, 160 KB große XML Datenstrukturen.


Hello Lucian,


please be aware that this is an English-speaking mailing list as we have many 
users from all over the world and the mailing list is intended to help 
everyone. But as most of our team members are German (well, and Bavarians...) 
we of course understand it. Hence, I answer in English (for all other: Lucian 
seem to have same performance issues when adding many documents).


First of all, are you sure your tests sufficiently test the add performance. 
Looking at your file TestBaseXClient.java it seems to not record the runtimes 
of the individual insertions, but just the overall runtime of in this case 
10 insertions.

Also, at least in t

Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-10 Thread Dirk Kirsten
Hi Lucian,


sorry, I obviously forgot a very important word: "not". I would NOT
expect 100,000 documents to be added to be much of a problem. Sorry for
the confusion.


The log file is interesting, it certainly looks like the performance is
degrading. I can't say much about it, but I am sure Christian (our head
architect) will give you some pointers when he has time to answer.


However, given the description of your problem I would advise you to in
general rethink your architecture. So many consistent updates on a
database seem to me to be not very performant when done on only a single
database. So maybe you want to split up your data, e.g. you could put
all documents of a certain day into a separate database.

Or you could have one "up-to-date" database, which you always update and
transfer the entries within this database into another database during
low-performance times. The other database could have proper indexes and
whatever you need.

Because otherwise you will run into problems when querying your data. I
guess you don't want to just store your data, you want to do something
with it, don't you? Because just storing without using data seems a bit
useless... And for this you probably want to use some indexes and having
an up-to-date-index with constant updates is quite costly.

To sum it up: I think you want to split up your data in some way into
several databases.


However, I understand that you will still have something like 100,000
documents in a database (which should be fine), so your current
performance issue will still exist. My comment is more towards your
general architecture.


Cheers

Dirk


On 01/10/2017 08:24 PM, Bularca, Lucian wrote:

> Hi Dirk,
>
>  thanks for your fast reply :)
>
> Regarding the performance measure, I've forgot to mention, that I've
> based my affirmations on the protocol entries from the BaseX log file
> (see attached basex.log). The intention of the System.out made in each
> iteration, is just to protocol the order number of the added xml
> structure, not the duration of a persist operation. This System.out
> indeed does have an impact on the overall performance, but cannot
> explain the monotonic increase of the insert operations duration (see
> attached basex.log file). After 24 hours of inserting xml
> test-structures, only the half of the 100.000 xml test-structures
> where added in the database, at a rate of at most 1 structure / 2 seconds.
>
> All these tests where made against the 8.5.3 version of the BaseX
> database.
>
> In production, we expect peaks of 2,7 * 10 ^ 5 xml structures to
> persist / 24 hours (~ 31 xml structures / 1 second). Do you mean with
> "However, I would expect 100,000 documents added to be much of a
> problem.", that persisting 100.000 xml structures in the BaseX
> database is problematic?
>
>
> Regards,
> Lucian
> 
> *Von:* basex-talk-boun...@mailman.uni-konstanz.de
> [basex-talk-boun...@mailman.uni-konstanz.de]" im Auftrag von "Dirk
> Kirsten [d...@basex.org]
> *Gesendet:* Dienstag, 10. Januar 2017 12:52
> *An:* basex-talk@mailman.uni-konstanz.de
> *Betreff:* Re: [basex-talk] Gravierende Performance-Einbüße bei
> Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.
>
> Hello Lucian,
>
>
> please be aware that this is an English-speaking mailing list as we
> have many users from all over the world and the mailing list is
> intended to help everyone. But as most of our team members are German
> (well, and Bavarians...) we of course understand it. Hence, I answer
> in English (for all other: Lucian seem to have same performance issues
> when adding many documents).
>
>
> First of all, are you sure your tests sufficiently test the add
> performance. Looking at your file TestBaseXClient.java it seems to not
> record the runtimes of the individual insertions, but just the overall
> runtime of in this case 10 insertions.
>
> Also, at least in the Example you provided you also do some other
> stuff (especiall printing to sysout), which obviously also has a
> performance impact.
>
>
> Optimizing or creating indexes in between a mass update should not
> increase the speed, as it builds the indexes, which will be
> invalidated after the next index, so I would not expect any speed up here.
>
>
> What version of BaseX did you use?
>
>
> Did you set AUTOFLUSH (see
> http://docs.basex.org/wiki/Options#AUTOFLUSH) to false? This should
> benefit performance.
>
>
> In general it is also a good architectural approach to split up
> documents into many databases instead of having one large database.
> Given that you can access as many databases as you want in one query
> you will not lose any qu

Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-10 Thread Bularca, Lucian
Hi Dirk,

 thanks for your fast reply :)

Regarding the performance measure, I've forgot to mention, that I've based my 
affirmations on the protocol entries from the BaseX log file (see attached 
basex.log). The intention of the System.out made in each iteration, is just to 
protocol the order number of the added xml structure, not the duration of a 
persist operation. This System.out indeed does have an impact on the overall 
performance, but cannot explain the monotonic increase of the insert operations 
duration (see attached basex.log file). After 24 hours of inserting xml 
test-structures, only the half of the 100.000 xml test-structures where added 
in the database, at a rate of at most 1 structure / 2 seconds.

All these tests where made against the 8.5.3 version of the BaseX database.

In production, we expect peaks of 2,7 * 10 ^ 5 xml structures to persist / 24 
hours (~ 31 xml structures / 1 second). Do you mean with "However, I would 
expect 100,000 documents added to be much of a problem.", that persisting 
100.000 xml structures in the BaseX database is problematic?


Regards,
Lucian

Von: basex-talk-boun...@mailman.uni-konstanz.de 
[basex-talk-boun...@mailman.uni-konstanz.de]" im Auftrag von "Dirk Kirsten 
[d...@basex.org]
Gesendet: Dienstag, 10. Januar 2017 12:52
An: basex-talk@mailman.uni-konstanz.de
Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von 
mehr als 5000, 160 KB große XML Datenstrukturen.


Hello Lucian,


please be aware that this is an English-speaking mailing list as we have many 
users from all over the world and the mailing list is intended to help 
everyone. But as most of our team members are German (well, and Bavarians...) 
we of course understand it. Hence, I answer in English (for all other: Lucian 
seem to have same performance issues when adding many documents).


First of all, are you sure your tests sufficiently test the add performance. 
Looking at your file TestBaseXClient.java it seems to not record the runtimes 
of the individual insertions, but just the overall runtime of in this case 
10 insertions.

Also, at least in the Example you provided you also do some other stuff 
(especiall printing to sysout), which obviously also has a performance impact.


Optimizing or creating indexes in between a mass update should not increase the 
speed, as it builds the indexes, which will be invalidated after the next 
index, so I would not expect any speed up here.


What version of BaseX did you use?


Did you set AUTOFLUSH (see http://docs.basex.org/wiki/Options#AUTOFLUSH) to 
false? This should benefit performance.


In general it is also a good architectural approach to split up documents into 
many databases instead of having one large database. Given that you can access 
as many databases as you want in one query you will not lose any query 
capabilities and at some point you might encounter certain limits. However, I 
would expect 100,000 documents added to be much of a problem.


As a side node, as it seems you are evaluating BaseX and I guess you are doing 
this for a reason, it might be faster/easier when talking to our BaseX members, 
who of course can help you with evaluating your problem and identifying whether 
BaseX is the right choice for your given problem. Take a look at 
http://basexgmbh.de/ for our commercial offerings.


Cheers

Dirk

On 01/10/2017 05:44 PM, Bularca, Lucian wrote:
Guten Tag,

 im Rahmen einer Performance-Evaluierung der Persistierung von XML 
Datenstrukturen in einer BaseX Datenbank, haben wir stetig absinkende 
Persistierungs-Raten umgekehrt proportional zu der Datenbank-Größe festgestellt.

Dieses Verhalten ist erklährbar und wäre auch einnehmbar, wenn nicht die Dauer 
der Persistierung einer ~ 160 KB großen XML Datenstruktur, von Anfang ~ 10 ms 
auf  ~ 2500 ms kommne würde, nach ~ 50.000 Persistierungs-Vorgänge.

Dabei versuchen wir 100.000 unterschiedliche, ungefähr 160 KB große XML 
Datenstrukturen mittels der Java API in eine BaseX Datenbank zu speichern um 
dabei die Gesammt-Dauer bzw. die Dauer der einzelnen Persistierungs-Vorgänge zu 
messen. Die BaseX Datenbank wurde im HTTP Modus (basexhttp) mit -Xmx 4048m 
gestartert.

Die oben genannte Messwerte blieben gleich, unabhängig davon, ob alle XM 
Datenstrukturen in eine einzige Session gespeichert wurden, oder wenn alle 500 
Persistierungs-Vorgänge der Socket (DB-Anbindung) geschlossen und erneut 
geöffnet wurde. Eine Indizierung der Datenbank (mittels der GUI "Optimize All", 
bzw. "Create Text Index") zwischendurch konnte die Persistierungs-Raten nicht 
beeinflussen bzw. optimieren.

Ein Beispiel der Test-Klassen (nur exemplarisch, nicht kompilierbar !) die wir 
dazu benutzt haben, sind im  Anhang BaseXClient.java.zip zu dieser E-Mail zu 
finden.

Sind generell, Persistierungs-Raten von mehr als 160 KB / 2500 ms bei mehr als 
30.000 vorhandene Einträge in der 

Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-10 Thread Dirk Kirsten
Hello Lucian,


please be aware that this is an English-speaking mailing list as we have
many users from all over the world and the mailing list is intended to
help everyone. But as most of our team members are German (well, and
Bavarians...) we of course understand it. Hence, I answer in English
(for all other: Lucian seem to have same performance issues when adding
many documents).


First of all, are you sure your tests sufficiently test the add
performance. Looking at your file TestBaseXClient.java it seems to not
record the runtimes of the individual insertions, but just the overall
runtime of in this case 10 insertions.

Also, at least in the Example you provided you also do some other stuff
(especiall printing to sysout), which obviously also has a performance
impact.


Optimizing or creating indexes in between a mass update should not
increase the speed, as it builds the indexes, which will be invalidated
after the next index, so I would not expect any speed up here.


What version of BaseX did you use?


Did you set AUTOFLUSH (see http://docs.basex.org/wiki/Options#AUTOFLUSH)
to false? This should benefit performance.


In general it is also a good architectural approach to split up
documents into many databases instead of having one large database.
Given that you can access as many databases as you want in one query you
will not lose any query capabilities and at some point you might
encounter certain limits. However, I would expect 100,000 documents
added to be much of a problem.


As a side node, as it seems you are evaluating BaseX and I guess you are
doing this for a reason, it might be faster/easier when talking to our
BaseX members, who of course can help you with evaluating your problem
and identifying whether BaseX is the right choice for your given
problem. Take a look at http://basexgmbh.de/ for our commercial offerings.


Cheers

Dirk


On 01/10/2017 05:44 PM, Bularca, Lucian wrote:
> Guten Tag,
>
>  im Rahmen einer Performance-Evaluierung der Persistierung von XML
> Datenstrukturen in einer BaseX Datenbank, haben wir stetig absinkende
> Persistierungs-Raten umgekehrt proportional zu der Datenbank-Größe
> festgestellt.
>
> Dieses Verhalten ist erklährbar und wäre auch einnehmbar, wenn nicht
> die Dauer der Persistierung einer ~ 160 KB großen XML Datenstruktur,
> von Anfang ~ 10 ms auf  ~ 2500 ms kommne würde, nach ~ 50.000
> Persistierungs-Vorgänge.
>
> Dabei versuchen wir 100.000 unterschiedliche, ungefähr 160 KB große
> XML Datenstrukturen mittels der Java API in eine BaseX Datenbank zu
> speichern um dabei die Gesammt-Dauer bzw. die Dauer der einzelnen
> Persistierungs-Vorgänge zu messen. Die BaseX Datenbank wurde im HTTP
> Modus (basexhttp) mit -Xmx 4048m gestartert.
>
> Die oben genannte Messwerte blieben gleich, unabhängig davon, ob alle
> XM Datenstrukturen in eine einzige Session gespeichert wurden, oder
> wenn alle 500 Persistierungs-Vorgänge der Socket (DB-Anbindung)
> geschlossen und erneut geöffnet wurde. Eine Indizierung der Datenbank
> (mittels der GUI "Optimize All", bzw. "Create Text Index")
> zwischendurch konnte die Persistierungs-Raten nicht beeinflussen bzw.
> optimieren.
>
> Ein Beispiel der Test-Klassen (nur exemplarisch, nicht kompilierbar !)
> die wir dazu benutzt haben, sind im  Anhang BaseXClient.java.zip zu
> dieser E-Mail zu finden.
>
> Sind generell, Persistierungs-Raten von mehr als 160 KB / 2500 ms bei
> mehr als 30.000 vorhandene Einträge in der BaseX zu erwarten, oder
> können wir diese Persistierungs-Zeiten drastisch optimieren (und wenn
> ja, wie)?
>
>
> Mit freundlichen Grüßen,
> Lucian Bularca
>
>

-- 
Dirk Kirsten, BaseX GmbH, http://basexgmbh.de
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 91 68 276, Fax: 0049 7531 20 05 22