Re: [basex-talk] Creating more than a million databases per session: Out Of Memory

2016-10-17 Thread Christian Grün
> I currently implemented this idea which closes and opens a new session every 
> 1000 imports. We'll see how it goes. But my question remains, what 
> information is kept in memory after a database connection is closed?

Usually nothing ;) I just created some thousand databases in a loop,
and memory consumption was constant. But I guess we’re doing slightly
different things?



>
> Also, the memory limit that is been set for different servers only applies to 
> *that* basex server, right, and not to all basex servers running on a single 
> machine? If I am running 6 servers on different ports on a single machine, 
> does a set memory limit of, say, 512MB mean that each instance is allocated 
> 512MB, or that 512MB is distributed among all basex instances?
>
>
> Kind regards
>
> Bram
>
> -Oorspronkelijk bericht-
> Van: basex-talk-boun...@mailman.uni-konstanz.de 
> [mailto:basex-talk-boun...@mailman.uni-konstanz.de] Namens Christian Grün
> Verzonden: zondag 16 oktober 2016 10:22
> Aan: Marco Lettere <m.lett...@gmail.com>
> CC: BaseX <basex-talk@mailman.uni-konstanz.de>
> Onderwerp: Re: [basex-talk] Creating more than a million databases per 
> session: Out Of Memory
>
> Hi Bram,
>
> I second Marco in his advise to find a good compromise between single 
> databases and single documents.
>
> Regarding the OOM, the stack trace could possibly be helpful for judging what 
> might go wrong in your setup.
>
> Cheers
> Christian
>
>
> On Sat, Oct 15, 2016 at 4:19 PM, Marco Lettere <m.lett...@gmail.com> wrote:
>> Hi Bram,
>> not being much into the issue of creating databases at this scale I'm
>> not sure whether the OOM problems you are facing are related to Basex
>> of JVM actually.
>> Anyway something rather simple you could try is to behave "in between".
>> Instead of opening a single session for the create statements
>> alltogether or one session for each and every create, you could split
>> your create statements in chunks of 100/1000 or the like and
>> distribute them over subsequent (or maybe even parallel?) sessions
>> I'm not sure whether this is applicable for your use case though.
>> Regards,
>> Marco.
>>
>>
>> On 15/10/2016 10:48, Bram Vanroy | KU Leuven wrote:
>>
>> Hi all
>>
>>
>>
>> I’ve talked before on how we restructured our data to drastically
>> improve search times on a 500 million token corpus. [1] Now, after
>> some minor improvements, I am trying to import the generated XML files
>> into BaseX. The result would be 100,00s to millions of BaseX databases
>> – as we expect. When doing the import, though, I am running into OOM
>> errors. We put our memory limit on 512MB. The thing is that this seems
>> incredibly odd to me: because we are creating so many different
>> databases, which are all really small as a consequence, I would not
>> expect BaseX to need to store much in memory. After each database is
>> created, the garbage collector can come along and remove everything that was 
>> needed for the previously generated database.
>>
>>
>>
>> A solution, I suppose, would be to close and open the BaseX session on
>> each creation but I’m afraid that (on such a huge scale) the impact on
>> speed would be too large. How it is set up now, in pseudo code:
>>
>>
>>
>> --
>> --
>>
>>
>>
>> $session = Session->new(host, port, user, pw);
>>
>>
>>
>> # @allFiles is at least 100,000 items
>>
>> For $file (@allFiles) {
>>
>> $database_name = $file . “name”;
>>
>> $session->execute("CREATE DB $database_name file ");
>>
>> $session->execute("CLOSE");
>>
>> }
>>
>>
>>
>> $session->close();
>>
>>
>>
>> --
>> --
>>
>>
>>
>> So all databases are created on the same session which I believe
>> causes the issue. But why? What is still required in memory after 
>> ->execute(“CLOSE”)?
>> Are the indices for the generated databases stored in memory? If so,
>> can we force them to write to disk?
>>
>>
>>
>> ANY thoughts on this are appreciated. Enlightenment on how what is
>> stored in a Session’s memory is useful as well. Increasing the memory
>> should be a last resort.
>>
>>
>>
>>
>>
>> Thank you in advance!
>>
>>
>>
>> Bram
>>
>>
>>
>>
>>
>> [1]:
>> http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Worksh
>> op-CMLC2%20Proceedings-rev2.pdf#page=20
>>
>>
>>
>>
>


Re: [basex-talk] Creating more than a million databases per session: Out Of Memory

2016-10-17 Thread Bram Vanroy | KU Leuven
Hi all

I currently implemented this idea which closes and opens a new session every 
1000 imports. We'll see how it goes. But my question remains, what information 
is kept in memory after a database connection is closed?

Also, the memory limit that is been set for different servers only applies to 
*that* basex server, right, and not to all basex servers running on a single 
machine? If I am running 6 servers on different ports on a single machine, does 
a set memory limit of, say, 512MB mean that each instance is allocated 512MB, 
or that 512MB is distributed among all basex instances?


Kind regards

Bram

-Oorspronkelijk bericht-
Van: basex-talk-boun...@mailman.uni-konstanz.de 
[mailto:basex-talk-boun...@mailman.uni-konstanz.de] Namens Christian Grün
Verzonden: zondag 16 oktober 2016 10:22
Aan: Marco Lettere <m.lett...@gmail.com>
CC: BaseX <basex-talk@mailman.uni-konstanz.de>
Onderwerp: Re: [basex-talk] Creating more than a million databases per session: 
Out Of Memory

Hi Bram,

I second Marco in his advise to find a good compromise between single databases 
and single documents.

Regarding the OOM, the stack trace could possibly be helpful for judging what 
might go wrong in your setup.

Cheers
Christian


On Sat, Oct 15, 2016 at 4:19 PM, Marco Lettere <m.lett...@gmail.com> wrote:
> Hi Bram,
> not being much into the issue of creating databases at this scale I'm 
> not sure whether the OOM problems you are facing are related to Basex 
> of JVM actually.
> Anyway something rather simple you could try is to behave "in between".
> Instead of opening a single session for the create statements 
> alltogether or one session for each and every create, you could split 
> your create statements in chunks of 100/1000 or the like and 
> distribute them over subsequent (or maybe even parallel?) sessions
> I'm not sure whether this is applicable for your use case though.
> Regards,
> Marco.
>
>
> On 15/10/2016 10:48, Bram Vanroy | KU Leuven wrote:
>
> Hi all
>
>
>
> I’ve talked before on how we restructured our data to drastically 
> improve search times on a 500 million token corpus. [1] Now, after 
> some minor improvements, I am trying to import the generated XML files 
> into BaseX. The result would be 100,00s to millions of BaseX databases 
> – as we expect. When doing the import, though, I am running into OOM 
> errors. We put our memory limit on 512MB. The thing is that this seems 
> incredibly odd to me: because we are creating so many different 
> databases, which are all really small as a consequence, I would not 
> expect BaseX to need to store much in memory. After each database is 
> created, the garbage collector can come along and remove everything that was 
> needed for the previously generated database.
>
>
>
> A solution, I suppose, would be to close and open the BaseX session on 
> each creation but I’m afraid that (on such a huge scale) the impact on 
> speed would be too large. How it is set up now, in pseudo code:
>
>
>
> --
> --
>
>
>
> $session = Session->new(host, port, user, pw);
>
>
>
> # @allFiles is at least 100,000 items
>
> For $file (@allFiles) {
>
> $database_name = $file . “name”;
>
> $session->execute("CREATE DB $database_name file ");
>
> $session->execute("CLOSE");
>
> }
>
>
>
> $session->close();
>
>
>
> --
> --
>
>
>
> So all databases are created on the same session which I believe 
> causes the issue. But why? What is still required in memory after 
> ->execute(“CLOSE”)?
> Are the indices for the generated databases stored in memory? If so, 
> can we force them to write to disk?
>
>
>
> ANY thoughts on this are appreciated. Enlightenment on how what is 
> stored in a Session’s memory is useful as well. Increasing the memory 
> should be a last resort.
>
>
>
>
>
> Thank you in advance!
>
>
>
> Bram
>
>
>
>
>
> [1]:
> http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Worksh
> op-CMLC2%20Proceedings-rev2.pdf#page=20
>
>
>
>



Re: [basex-talk] Creating more than a million databases per session: Out Of Memory

2016-10-16 Thread Christian Grün
Hi Bram,

I second Marco in his advise to find a good compromise between single
databases and single documents.

Regarding the OOM, the stack trace could possibly be helpful for
judging what might go wrong in your setup.

Cheers
Christian


On Sat, Oct 15, 2016 at 4:19 PM, Marco Lettere  wrote:
> Hi Bram,
> not being much into the issue of creating databases at this scale I'm not
> sure whether the OOM problems you are facing are related to Basex of JVM
> actually.
> Anyway something rather simple you could try is to behave "in between".
> Instead of opening a single session for the create statements alltogether or
> one session for each and every create, you could split your create
> statements in chunks of 100/1000 or the like and distribute them over
> subsequent (or maybe even parallel?) sessions
> I'm not sure whether this is applicable for your use case though.
> Regards,
> Marco.
>
>
> On 15/10/2016 10:48, Bram Vanroy | KU Leuven wrote:
>
> Hi all
>
>
>
> I’ve talked before on how we restructured our data to drastically improve
> search times on a 500 million token corpus. [1] Now, after some minor
> improvements, I am trying to import the generated XML files into BaseX. The
> result would be 100,00s to millions of BaseX databases – as we expect. When
> doing the import, though, I am running into OOM errors. We put our memory
> limit on 512MB. The thing is that this seems incredibly odd to me: because
> we are creating so many different databases, which are all really small as a
> consequence, I would not expect BaseX to need to store much in memory. After
> each database is created, the garbage collector can come along and remove
> everything that was needed for the previously generated database.
>
>
>
> A solution, I suppose, would be to close and open the BaseX session on each
> creation but I’m afraid that (on such a huge scale) the impact on speed
> would be too large. How it is set up now, in pseudo code:
>
>
>
> 
>
>
>
> $session = Session->new(host, port, user, pw);
>
>
>
> # @allFiles is at least 100,000 items
>
> For $file (@allFiles) {
>
> $database_name = $file . “name”;
>
> $session->execute("CREATE DB $database_name file ");
>
> $session->execute("CLOSE");
>
> }
>
>
>
> $session->close();
>
>
>
> 
>
>
>
> So all databases are created on the same session which I believe causes the
> issue. But why? What is still required in memory after ->execute(“CLOSE”)?
> Are the indices for the generated databases stored in memory? If so, can we
> force them to write to disk?
>
>
>
> ANY thoughts on this are appreciated. Enlightenment on how what is stored in
> a Session’s memory is useful as well. Increasing the memory should be a last
> resort.
>
>
>
>
>
> Thank you in advance!
>
>
>
> Bram
>
>
>
>
>
> [1]:
> http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-CMLC2%20Proceedings-rev2.pdf#page=20
>
>
>
>