Re: [basex-talk] Creating more than a million databases per session: Out Of Memory

Marco Lettere Sat, 15 Oct 2016 07:20:07 -0700

Hi Bram,

not being much into the issue of creating databases at this scale I'mnot sure whether the OOM problems you are facing are related to Basex ofJVM actually.

Anyway something rather simple you could try is to behave "in between".

Instead of opening a single session for the create statementsalltogether or one session for each and every create, you could splityour create statements in chunks of 100/1000 or the like and distributethem over subsequent (or maybe even parallel?) sessions....

I'm not sure whether this is applicable for your use case though.
Regards,
Marco.


On 15/10/2016 10:48, Bram Vanroy | KU Leuven wrote:

Hi all
I’ve talked before on how we restructured our data to drasticallyimprove search times on a 500 million token corpus. [1] Now, aftersome minor improvements, I am trying to import the generated XML filesinto BaseX. The result would be 100,00s to millions of BaseX databases– as we expect. When doing the import, though, I am running into OOMerrors. We put our memory limit on 512MB. The thing is that this seemsincredibly odd to me: because we are creating so many differentdatabases, which are all really small as a consequence, I would notexpect BaseX to need to store much in memory. After each database iscreated, the garbage collector can come along and remove everythingthat was needed for the previously generated database.
A solution, I suppose, would be to close and open the BaseX session oneach creation but I’m afraid that (on such a huge scale) the impact onspeed would be too large. How it is set up now, in pseudo code:
--------------------------------------------------------------------------------

$session = Session->new(host, port, user, pw);

# @allFiles is at least 100,000 items

For $file (@allFiles) {

    $database_name = $file . “name”;

$session->execute("CREATE DB $database_name file ");

$session->execute("CLOSE");

}

$session->close();

--------------------------------------------------------------------------------
So all databases are created on the same session which I believecauses the issue. But why? What is still required in memory after->execute(“CLOSE”)? Are the indices for the generated databases storedin memory? *If so, can we force them to write to disk?*
ANY thoughts on this are appreciated. Enlightenment on how what isstored in a Session’s memory is useful as well. Increasing the memoryshould be a last resort.
Thank you in advance!

Bram
[1]:http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-CMLC2%20Proceedings-rev2.pdf#page=20

Re: [basex-talk] Creating more than a million databases per session: Out Of Memory

Reply via email to