[basex-talk] Search algorithm used by BaseX

2016-05-27 Thread Bram Vanroy | KU Leuven
Good afternoon all,

It's me again

 

While writing my paper, I was wondering how BaseX (and/or XQuery)'s search
algorithm actually works. I imagine each XML-structure is search through one
by one, but what technique is used in this search? I'm looking for some
terminology such as A*, IDA, D*, depth-first, breadth-first, top down,
bottom up, etc. but I cannot find anything in your documentation. Can you
enlighten me on the subject? And is the algorithm used specified by BaseX,
or is it implied by XQuery?

 

 

Kind regards

 

Bram Vanroy



[basex-talk] Benchmarking and caching in BaseX

2016-02-15 Thread Bram Vanroy | KU Leuven
Dear all

My name is Bram Vanroy, and I am an intern at the Centre for Computational
Linguistics (CCL; http://www.arts.kuleuven.be/ling/ccl [Dutch]) at the
University of Leuven. My supervisor, Vincent Vandeghinste, has had contact
with this mailing list some time ago, more specifically with Dirk Kirsten.
My intership is titled "Fine-tuning the GrETEL Treebank Query Engine".
GrETEL stands for Greedy Extraction of Trees for Empirical Linguistics;
available at http://gretel.ccl.kuleuven.be/gretel-2.0/. Its goal is to
provide users with a fast, user-friendly on-line tool to search through text
corpora backed by treebanks. Accessibility is an important point for us:
users do not need to be proficient with any programming languages, strict
formalisms, or treebank specific annotations; every query can be executed by
using an intuitive graphical interface. More advanced users can use XPath to
write the representation of the syntactic structure that they are looking
for. BaseX is our tool of choice as a database for our corpora in XML
format.

Initially, GrETEL provided access to smaller corpora such as CGN (9 million
words) and Lassy Small (1 million words). We would like to expand the
searchable corpora by also making the full Sonar corpus available (500
million words). This is already partially possible in GrETEL 2.0 but due to
efficiency reasons, capabilities are restricted: users can only search in
one component at a time, and the largest component in the corpus is not
available due to its size (15 million sentences). We have applied these
restrictions because the search time for the whole corpus was too long,
which in turn would decrease the user-friendliness of the tool drastically.

Steps have already been taken to improve search times in larger corpora.
(See "Making a Large Treebank Searchable Online. The SoNaR Case." by Vincent
Vandeghinste, and Liesbeth Augustinus;
http://nederbooms.ccl.kuleuven.be/documentation/LREC2014-GrETELSoNaR.pdf.)
To spare you the effort to go through the whole article, I hereby quote the
most relevant citation from that article for this email:

 

The general idea behind our approach is to restrict the search space by
splitting up the data in many small databases, allowing for faster retrieval
of syntactic structures. We organise the data in databases that contain all
bottom-up subtrees for which the two top levels (i.e. the root and its
children) adhere to the same syntactic pattern. When querying the database
for certain syntactic constructions, we know on which databases we have to
apply the XPath query which would otherwise have to be applied on the whole
data set. We have called this method GrETEL Indexing (GrInd). (p. 17)

 

So to optimise searching, the data has been pulled apart - in a sense -
which would make the search space smaller and subsequently the search time
shorter. In the future we would like to apply this technique on parallel
corpora as well. We have not tested yet what influence this change has made
to query time which is what I am going to find out during my internship. I
have already analysed the XPath queries that users have made since GrETEL
saw its first user and found that the queries are ten embedded levels deep
at the most, but most are between one and five. The amount of nodes per
query varies between one and 24, but most searches are for structures that
contain between one and eight nodes. Based on this information, I am writing
example XPaths that I am going to pull through BaseX as a sort of benchmark.
I can then compare the query speeds between the split-up corpus, and the
regular one. The problem that I have encountered is that BaseX seems to
cache very efficiently. Obviously this is not a problem on production
websites but for benchmarking it may not be ideal. My first question to you,
then, is: is it possible to disable caching when testing queries locally?
And how exactly does BaseX handle the caching? Or more specifically, if I
enter a query: what is cached, and for how long? This information me be
useful to analyse our logs with.

 

If you have any feedback on GrETEL, or the new approach of GrInding, or if
you have any ideas to improve search time for large corpora - I would love
to hear from you, you can contact me via this email address or on LinkedIn.
I reply to each email as extensively as possible.

 

 

Thank you in advance,

Kind regards

 

Bram Vanroy
https://be.linkedin.com/in/bramvanroy



Re: [basex-talk] Benchmarking and caching in BaseX

2016-02-21 Thread Bram Vanroy | KU Leuven
I was going to write a more extensive email, but then I saw that many of you 
are also active on StackOverflow. Therefore I have moved my question there. I 
hope to see you there!

http://stackoverflow.com/questions/35536286/benchmarking-in-basex


Kind regards

-Oorspronkelijk bericht-
Van: Christian Grün [mailto:christian.gr...@gmail.com] 
Verzonden: dinsdag 16 februari 2016 11:28
Aan: Bram Vanroy 
CC: BaseX 
Onderwerp: Re: [basex-talk] Benchmarking and caching in BaseX

Hi Bram,

> I did read on your website that it is possible to communicate with BaseX from 
> Java. Is there any documentation or guidelines on this?

We are spending quite some time into our documentation, so I hope that the 
existing articles will give you some initial help (see e.g.
[1,2]).

> I am knowledgeable with Java, so I assume I should be able to conjure up a 
> benchmark script in Java. The only thing that I don't know is how to contact 
> the database and insert a query.

As you will see in the examples ("QueryExample.java" and others), there is no 
need to insert queries in a database. Instead, you can directly send your query 
strings to the BaseX server and retrieve the results.

Cheers,
Christian

[1] http://docs.basex.org/wiki/Java_Examples
[2] http://docs.basex.org/wiki/Clients



Re: [basex-talk] Querying Basex from a web interface

2016-05-22 Thread Bram Vanroy | KU Leuven
Hi there Christian

Thank you for the extensive answer! 

In the meanwhile, I have solved the issue I had that caused simultaneous 
queries not to fire asynchronously. The problem was a locked PHP $_SESSION 
variable. Not related to Basex. My bad!

I am going to test the caching difference again soon, on a larger subcorpus. 
I'm curious to find out the results! I will incorporate results in a thesis. If 
you're interested, I can definitely share the results with you when I have 
finished writing them down!

Finally, I'd like to thank you for providing these answers. Never expected such 
good feedback and response. It really means a lot! Thank you!


Kind regards

Bram

-Oorspronkelijk bericht-
Van: Christian Grün [mailto:christian.gr...@gmail.com] 
Verzonden: zondag 22 mei 2016 13:29
Aan: Bram Vanroy | KU Leuven <bram.vanr...@student.kuleuven.be>
CC: BaseX <basex-talk@mailman.uni-konstanz.de>
Onderwerp: Re: [basex-talk] Querying Basex from a web interface

Hi Bram,

Thanks for your reply. It’s long indeed, so sorry in advance if I didn’t 
capture all relevant info…

> The approach explained above also implies that we had to create a lot of 
> BaseX databases. A lot. Around 10 million of them.

Impressive :)


> •   Would a query that returns single results really be faster 
> than one that returns 10 results?
> Yes. In a search space of 500 million tokens, you can imagine that a rare 
> pattern may take a lot of time to query – even in the GrInded version.

I see. So I assume there won’t be many chances to speed up this scenario by 
working on index structures, as most time is spent for sequentially browsing 
all the databases, right?

> •   Do you sort your search results? If yes, returning 100 
> results instead of 10 should not be much slower.
> As I am not entirely sure what you mean by that, I don’t think we do. By 
> sorting, do you mean the XQuery order by function?

Exactly. I also assume it shouldn’t play a role in your scenario.


> wouldn’t that mean that BaseX’ cache is cleared more often? I could imagine 
> that the garbage collector passes by after a query, or at least a session, is 
> closed? Have you any idea how this is possible?

Phew, a difficult one… I would need to spend some real time with your framework 
to give a solid answer.

> My two questions are: is count() actually faster than getting all results?

Yes, it will always be faster; but “faster” can mean 1% or 1000%… It will be 
much faster if the database statistics can be utilized to answer your query 
(which is probably not the case in your scenario), or if the step of retrieving 
the data, and/or returning it via the network consumes too much time. If you 
only count nodes, there is no need to retrieve all database contents from disk 
(node properties, textual data) that will be returned in the XML representation.

> Or does count() get all the hits any way, and should I count and get all 
> results in one step?

As you already indicated that the last result may occur much later than the 
first result in your database(s), I assume you won’t win that much. But for 
testing, you can wrap your query with count() to see what would be the minimum 
time to find all hits.

> Secondly, it seems that when the last step is initialised, the other 
> processes hang – leaving the user without any feedback. The processes 
> literally seem to stop running. My question then is: does this happen because 
> BaseX does not handle different sessions asynchronously, and new queries 
> block others?

By default, 8 queries can be run in parallel [1]. If your other queries are 
delayed a lot, it may be that the random disk access pattern causes by parallel 
queries outweigh the advantage of allowing parallel requests. But, in the first 
place, I would also assume that it’s worth checking your PHP environment first.

> Finally, I simply want to ask what the best flow is for opening and closing 
> BaseX sessions, and when one should open a new session.

With the light-weight PHP client, it’s usually best to open a new session for 
each request, and close it directly after your command or query has been 
evaluated. As usual, you should ensure that every session will be closed, even 
if an error occurs.

Hope this helps,
Christian

[1] http://docs.basex.org/wiki/Options#PARALLEL



Re: [basex-talk] Creating db in restxq interface

2016-05-21 Thread Bram Vanroy | KU Leuven
Just some general comments on the HTML that you are trying to generate: 

-  You are using XHTML. XHTML as quite strict. 

-  You forgot a head element;

-  The time element is HTML5, and won't "work" as such under any
other (X)HTML declaration. Additionally, it is forbidden to have such
content outside the body tag;

-  You also have a list item (li) outside a ul. This is not valid
either.

-  Note that you wrote factbook instead of facebook, if that was
what you were trying to do

 

Edited code below. Hope it helps!

 

declare 

  %rest:path("/start")

  %updating

  %output:method("xhtml")

  %output:omit-xml-declaration("no")

  %output:doctype-public("-//W3C//DTD XHTML 1.0 Transitional//EN")

 
%output:doctype-system("http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.
dtd")

  function page:hello()

as element(Q{http://www.w3.org/1999/xhtml}html)

{

http://www.w3.org/1999/xhtml;>

  

Good day Sir!



  

  

The current time is: { current-time() }



  Home

  Link 1

  Link 2

  Link 3

  

  {

  for $result in db:open('factbook')//continent/@name

  return { data($result) }

  }

 

  

  { db:create("test") }

  

  



 

 

Van: basex-talk-boun...@mailman.uni-konstanz.de
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] Namens Henning Phan
Verzonden: zaterdag 21 mei 2016 19:03
Aan: basex-talk@mailman.uni-konstanz.de
Onderwerp: [basex-talk] Creating db in restxq interface

 

Hi,

 

My question is how do you create a new database in restxq?

When trying to create a db I get the error msg:

HTTP Error 400
[XUST0001] element constructor: no updating expression allowed.

Some digging around and I learned that I might need the "updating"
annotation though even after the change it gives the same error.

My file looks like this now:

=== File start 

 

declare 
  %rest:path("/start")
  %updating
  %output:method("xhtml")
  %output:omit-xml-declaration("no")
  %output:doctype-public("-//W3C//DTD XHTML 1.0 Transitional//EN")
  %output:doctype-system("

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd;)
  function page:hello()
as element(Q{http://www.w3.org/1999/xhtml}html)
{
  http://www.w3.org/1999/xhtml>
http://www.w3.org/1999/xhtml;>
Good day Sir!

The current time is: { current-time() }


  Home
  Link 1
  Link 2
  Link 3

  {
for $result in db:open('factbook')//continent/@name
return { data($result) }
  }

{
db:create("test")
}


  
 
=== File End  


[basex-talk] Off topic: on leave

2016-07-13 Thread Bram Vanroy | KU Leuven
Hello everyone

 

I will be leaving on vacation soon, from 13th til 25th of July. I set an
automated reply for my email account, but I'm not sure if it is smart enough
to detect mailing lists. So if the programme is not smart enough, and you
keep getting automated replies, feel free to manually unsubscribe me. I
haven't done that myself, because I'd like to read the list when I get back.

 

 

Good bye!



Re: [basex-talk] Timeout 30 seconds exceeded

2016-07-19 Thread Bram Vanroy | KU Leuven
Reading through the list from the hot shores of Italy!

This *is* a PHP error. PHP has an execution time limit, most probably defined 
in an ini-file. However, an "easy way out" is setting the limit for the current 
script separately. However, I advise you to only do this for testing purposes, 
this is not the best work-around in production environments because other 
(possibly malicious) processes are given all the time they need.

At the top of your PHP script, put this line:

set_time_limit(0);


-Oorspronkelijk bericht-
Van: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] Namens Maximilian Gärber
Verzonden: dinsdag 19 juli 2016 8:45
Aan: Mohamed kharrat 
CC: ba...@inf.uni-konstanz.de; BaseX Talk 
Onderwerp: Re: [basex-talk] Timeout 30 seconds exceeded

Hi,

if you are connecting as admin user, there is no timeout in BaseX.
Neither if you are sending an updating query.

Maybe this is a network/php timeout?

Regards,
Max

2016-07-19 0:57 GMT+02:00 Mohamed kharrat :
> Hi,
> i have got the error
>
> Fatal error: Maximum execution time of 30 seconds exceeded in 
> E:\EasyPHP-DevServer-14.1VC9\data\...\BaseXClient.php on line 99
>
> i'm working on Windows.
> I have changed the 30 by 300 in the .basex configuration file but i 
> still get same error of 30 seconds.
> Is there any other option i have to set?
>
> Thank you



Re: [basex-talk] Starting multiple sessions in BaseX with Perl: compilation failed at Carp.pm

2016-06-30 Thread Bram Vanroy | KU Leuven
So I ran strace and this is the result

$ strace perl gendervariatie-multiple.pl 2>&1 | grep -i carp
stat("/usr/local/lib64/perl5/Carp.pmc", 0x7fff387d8190) = -1 ENOENT (No such 
file or directory)
stat("/usr/local/lib64/perl5/Carp.pm", 0x7fff387d80e0) = -1 ENOENT (No such 
file or directory)
stat("/usr/local/share/perl5/Carp.pmc", 0x7fff387d8190) = -1 ENOENT (No such 
file or directory)
stat("/usr/local/share/perl5/Carp.pm", 0x7fff387d80e0) = -1 ENOENT (No such 
file or directory)
stat("/usr/lib64/perl5/vendor_perl/Carp.pmc", 0x7fff387d8190) = -1 ENOENT (No 
such file or directory)
stat("/usr/lib64/perl5/vendor_perl/Carp.pm", 0x7fff387d80e0) = -1 ENOENT (No 
such file or directory)
stat("/usr/share/perl5/vendor_perl/Carp.pmc", 0x7fff387d8190) = -1 ENOENT (No 
such file or directory)
stat("/usr/share/perl5/vendor_perl/Carp.pm", 0x7fff387d80e0) = -1 ENOENT (No 
such file or directory)
stat("/usr/lib64/perl5/Carp.pmc", 0x7fff387d8190) = -1 ENOENT (No such file or 
directory)
stat("/usr/lib64/perl5/Carp.pm", 0x7fff387d80e0) = -1 ENOENT (No such file or 
directory)
stat("/usr/share/perl5/Carp.pmc", 0x7fff387d8190) = -1 ENOENT (No such file or 
directory)
stat("/usr/share/perl5/Carp.pm", {st_mode=S_IFREG|0644, st_size=7611, ...}) = 0
open("/usr/share/perl5/Carp.pm", O_RDONLY) = 4
read(4, "package Carp;\n\nour $VERSION = '1"..., 4096) = 4096

A lot of 'no such file or directory', but I assume that is normal and that Perl 
just looks until it finds the file? Maybe you can see something that's wrong 
here?

Thanks again!

Bram

-Oorspronkelijk bericht-
Van: Liam R. E. Quin [mailto:l...@w3.org] 
Verzonden: woensdag 29 juni 2016 23:26
Aan: Bram Vanroy | KU Leuven <bram.vanr...@student.kuleuven.be>; 'BaseX' 
<basex-talk@mailman.uni-konstanz.de>
Onderwerp: Re: [basex-talk] Starting multiple sessions in BaseX with Perl: 
compilation failed at Carp.pm

On Wed, 2016-06-29 at 21:47 +0200, Bram Vanroy | KU Leuven wrote:
> When I try to get the version manually from the command line, this 
> works fine.
> 
> perl -le 'use Carp; print $Carp::VERSION;'
> # returns 1.11

OK

> I'm at loss. I have no idea at all why the program would crash on such 
> a line!
Is there by any chance more than one version of Perl on your system?
You could try using strace to see which files were accessed - strace 
./your-script 2>&1 | grep -i carp might be useful.

> You say you didn't manage to start multiple basex sessions either.
> Can I ask what the cause of this was? Did Basex' Perl API throw an 
> error, and if so which one?

I didn't get an error but my script fails (or throws an exception) the second 
time I try to use a BaseX session; I can open several but only the first one 
used with $s->query() actually works. It might be that
$query->more() thinks it's got past the end of the results or something.

I didn't try for long though.

Liam

--
Liam R. E. Quin <l...@w3.org>
The World Wide Web Consortium (W3C)



[basex-talk] Starting multiple sessions in BaseX with Perl: compilation failed at Carp.pm

2016-06-29 Thread Bram Vanroy | KU Leuven
Hi there BaseX people!

 

Original post on StackOverflow:
http://stackoverflow.com/questions/38086358/starting-multiple-sessions-in-ba
sex-with-perl-compilation-failed-at-carp-pm

 

I wanted to see if I could start multiple BaseX sessions (on a single
server) to try some things out, but I can't even get the sessions to launch
- heck, I can't even get the Perl script to compile! This is odd to me,
because the same code works perfectly without multiple sessions.

 

Let's say I have a subroutine that creates sessions given an amount. E.g.
the following code would create 3 sessions (variables for dbhost, user and
password left out):

 

CreateSessions(3);

 

# For testing

my $session = $sessions[0];

 

sub CreateSessions {

  my ($amountofsessions) = @_;

  our @sessions = ();

  for (my $i = 0; $i < $amountofsessions; $i++) {

my $session = Session->new($dbhost, 1950, $usr, $pw);

push(@sessions, $session);

  }

}

 

I also tried other approaches, e.g. creating each session individually and
assigning it to a variable, and then building an array with these variables.
The same problem occurs:

 

Compilation failed in require at /usr/share/perl5/Carp.pm line 33.

 

When I look up that file and that line, it states:

 

eval { require Carp::Heavy };

 

 

Which, I guess, means that I don't have the Carp module installed? But I do
not understand why I would need this, and why that the Perl script works
when I only create one session.

 

Is it possible to launch multiple sessions from Perl that can work
simultaneously, and if so, how do you launch these sessions and access them?
It's important to note that in PHP, for another project but the same
database, having multiple sessions does work. In other words, there I can
have multiple sessions that run in parallel.

 

We're on BaseX 7.9 and Perl 5.10.1. (Old, I know, but I can't change it.)

 

 

Thank you for your time!

 

Bram



[basex-talk] Creating more than a million databases per session: Out Of Memory

2016-10-15 Thread Bram Vanroy | KU Leuven
Hi all

 

I've talked before on how we restructured our data to drastically improve
search times on a 500 million token corpus. [1] Now, after some minor
improvements, I am trying to import the generated XML files into BaseX. The
result would be 100,00s to millions of BaseX databases - as we expect. When
doing the import, though, I am running into OOM errors. We put our memory
limit on 512MB. The thing is that this seems incredibly odd to me: because
we are creating so many different databases, which are all really small as a
consequence, I would not expect BaseX to need to store much in memory. After
each database is created, the garbage collector can come along and remove
everything that was needed for the previously generated database. 

 

A solution, I suppose, would be to close and open the BaseX session on each
creation but I'm afraid that (on such a huge scale) the impact on speed
would be too large. How it is set up now, in pseudo code:

 




 

$session = Session->new(host, port, user, pw);

 

# @allFiles is at least 100,000 items 

For $file (@allFiles) {

$database_name = $file . "name";

$session->execute("CREATE DB $database_name file ");

$session->execute("CLOSE");

}

 

$session->close();

 




 

So all databases are created on the same session which I believe causes the
issue. But why? What is still required in memory after ->execute("CLOSE")?
Are the indices for the generated databases stored in memory? If so, can we
force them to write to disk?

 

ANY thoughts on this are appreciated. Enlightenment on how what is stored in
a Session's memory is useful as well. Increasing the memory should be a last
resort.

 

 

Thank you in advance!

 

Bram

 

 

[1]:
http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-CML
C2%20Proceedings-rev2.pdf#page=20

 



Re: [basex-talk] Creating more than a million databases per session: Out Of Memory

2016-10-17 Thread Bram Vanroy | KU Leuven
Hi all

I currently implemented this idea which closes and opens a new session every 
1000 imports. We'll see how it goes. But my question remains, what information 
is kept in memory after a database connection is closed?

Also, the memory limit that is been set for different servers only applies to 
*that* basex server, right, and not to all basex servers running on a single 
machine? If I am running 6 servers on different ports on a single machine, does 
a set memory limit of, say, 512MB mean that each instance is allocated 512MB, 
or that 512MB is distributed among all basex instances?


Kind regards

Bram

-Oorspronkelijk bericht-
Van: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] Namens Christian Grün
Verzonden: zondag 16 oktober 2016 10:22
Aan: Marco Lettere <m.lett...@gmail.com>
CC: BaseX <basex-talk@mailman.uni-konstanz.de>
Onderwerp: Re: [basex-talk] Creating more than a million databases per session: 
Out Of Memory

Hi Bram,

I second Marco in his advise to find a good compromise between single databases 
and single documents.

Regarding the OOM, the stack trace could possibly be helpful for judging what 
might go wrong in your setup.

Cheers
Christian


On Sat, Oct 15, 2016 at 4:19 PM, Marco Lettere <m.lett...@gmail.com> wrote:
> Hi Bram,
> not being much into the issue of creating databases at this scale I'm 
> not sure whether the OOM problems you are facing are related to Basex 
> of JVM actually.
> Anyway something rather simple you could try is to behave "in between".
> Instead of opening a single session for the create statements 
> alltogether or one session for each and every create, you could split 
> your create statements in chunks of 100/1000 or the like and 
> distribute them over subsequent (or maybe even parallel?) sessions
> I'm not sure whether this is applicable for your use case though.
> Regards,
> Marco.
>
>
> On 15/10/2016 10:48, Bram Vanroy | KU Leuven wrote:
>
> Hi all
>
>
>
> I’ve talked before on how we restructured our data to drastically 
> improve search times on a 500 million token corpus. [1] Now, after 
> some minor improvements, I am trying to import the generated XML files 
> into BaseX. The result would be 100,00s to millions of BaseX databases 
> – as we expect. When doing the import, though, I am running into OOM 
> errors. We put our memory limit on 512MB. The thing is that this seems 
> incredibly odd to me: because we are creating so many different 
> databases, which are all really small as a consequence, I would not 
> expect BaseX to need to store much in memory. After each database is 
> created, the garbage collector can come along and remove everything that was 
> needed for the previously generated database.
>
>
>
> A solution, I suppose, would be to close and open the BaseX session on 
> each creation but I’m afraid that (on such a huge scale) the impact on 
> speed would be too large. How it is set up now, in pseudo code:
>
>
>
> --
> --
>
>
>
> $session = Session->new(host, port, user, pw);
>
>
>
> # @allFiles is at least 100,000 items
>
> For $file (@allFiles) {
>
> $database_name = $file . “name”;
>
> $session->execute("CREATE DB $database_name file ");
>
> $session->execute("CLOSE");
>
> }
>
>
>
> $session->close();
>
>
>
> --
> --
>
>
>
> So all databases are created on the same session which I believe 
> causes the issue. But why? What is still required in memory after 
> ->execute(“CLOSE”)?
> Are the indices for the generated databases stored in memory? If so, 
> can we force them to write to disk?
>
>
>
> ANY thoughts on this are appreciated. Enlightenment on how what is 
> stored in a Session’s memory is useful as well. Increasing the memory 
> should be a last resort.
>
>
>
>
>
> Thank you in advance!
>
>
>
> Bram
>
>
>
>
>
> [1]:
> http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Worksh
> op-CMLC2%20Proceedings-rev2.pdf#page=20
>
>
>
>



Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung von mehr als 5000, 160 KB große XML Datenstrukturen.

2017-01-14 Thread Bram Vanroy | KU Leuven
Possibly related, but I'm not sure:

When creating millions of databases in a loop in the same session, I found that 
after some thousands I'd get an OOM error by BaseX. This seemed odd to me, 
because after each iteration, the database creation query was closed (and I'd 
expect GC to run at such a time?). To by-pass this I just closed the session 
and opened a new one each couple of thousand-th time in the loop.

Maybe there is a (small) memory leak somewhere in BaseX that only becomes 
noticeable (and annoying) after hundreds of thousands of even millions of 
queries? 

-Oorspronkelijk bericht-
Van: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] Namens Christian Grün
Verzonden: zaterdag 14 januari 2017 12:09
Aan: Bularca, Lucian 
CC: basex-talk@mailman.uni-konstanz.de
Onderwerp: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung 
von mehr als 5000, 160 KB große XML Datenstrukturen.

Hi Lucian,

I have a hard time reproducing the reported behavior. The attached, revised 
Java example (without AUTOFLUSH) required around 30 ms for the first documents 
and 120 ms for the last documents, which is still pretty far from what you’ve 
been encountering:

> von Anfang ~ 10 ms auf  ~ 2500 ms kommne würde

But obviously something weird has been going on in your setup. Let’s see what 
alternatives we have…

• Could you possibly try to update my example code such that it shows the 
reported behavior? Ideally with small input, in order to speed up the process. 
Maybe the runtime increase can also be demonstrated after
1.000 or 10.000 documents...
• You could also send me a list of the files of your test_database directory; 
maybe the file sizes indicate some unusual patterns.
• You could start BaseXServer with the JVM flag -Xrunhprof:cpu=samples (to be 
inserted in the basexserver script), start the server, run your script, stop 
the server directly afterwards, and send me the result file, which will be 
stored in the directory from where you started BaseX (java.hprof.txt).

Best,
Christian


On Wed, Jan 11, 2017 at 4:57 PM, Christian Grün  
wrote:
> Hi Lucian,
>
> Thanks for your analysis. Indeed I’m wondering about the monotonic 
> delay caused by auto flushing the data; this hasn’t always been the 
> case. I’m wondering even more why no one else noticed this in recent 
> time.. Maybe it’s not too long ago that this was introduced. It may 
> take some time to find the culprit, but I’ll keep you updated.
>
> All the best,
> Christian
>
>
> On Wed, Jan 11, 2017 at 2:46 PM, Bularca, Lucian 
>  wrote:
>> Hi Christian,
>>
>> I've made a comparation of the persistence time series running your example 
>> code and mine, in all possible combinations of following scenarios:
>> - with and without "set intparse on"
>> - using my prepared test data and your test data
>> - closing and opening the DB connection each "n"-th insertion 
>> operation (where n in {5, 100, 500, 1000})
>> - with and without "set autoflush on".
>>
>> I finally found out, that the only relevant variable that influence the 
>> insert operation duration is the value of the AUTOFLASH option.
>>
>> If AUTOFLASH = OFF when opening a database, then the persistence durations 
>> remains relative constant (on my machine about 43 ms) during the entire 
>> insert operations sequence (50.000 or 100.000 times), for all possible 
>> combinations named above.
>>
>> If AUTOFLASH = ON when opening a database, then the persistence durations 
>> increase monotonic, for all possible combinations named above.
>>
>> The persistence duration, if AUTOFLASH = ON, is directly proportional to the 
>> number of DB clients executing these insert operations, respectively to the 
>> sequence length of insert operations executed by a DB client.
>>
>> In my opinion, this behaviour is an issue of BaseX, because AUTOFLASH is 
>> implcitly set to ON (see BaseX documentation 
>> http://docs.basex.org/wiki/Options#AUTOFLUSH), so DB clients must explicitly 
>> set AUTOFLASH = OFF in order to keep the insert operation durations 
>> relatively constant over time. Additionally, no explicitly flushing data, 
>> increases the risk of data loss (see BaseX documentation 
>> http://docs.basex.org/wiki/Options#AUTOFLUSH), but clients how repeatedly 
>> execute the FLUSH command increase the durations of the subsequent insert 
>> operations.
>>
>> Regards,
>> Lucian
>>
>> 
>> Von: Christian Grün [christian.gr...@gmail.com]
>> Gesendet: Dienstag, 10. Januar 2017 17:33
>> An: Bularca, Lucian
>> Cc: Dirk Kirsten; basex-talk@mailman.uni-konstanz.de
>> Betreff: Re: [basex-talk] Gravierende Performance-Einbüße bei Persistierung 
>> von mehr als 5000, 160 KB große XML Datenstrukturen.
>>
>> Hi Lucian,
>>
>> I couldn’t run your code example out of the box. 24 hours sounds 
>> pretty alarming, though, so I have written