Hi all
I've talked before on how we restructured our data to drastically improve search times on a 500 million token corpus. [1] Now, after some minor improvements, I am trying to import the generated XML files into BaseX. The result would be 100,00s to millions of BaseX databases - as we expect. When doing the import, though, I am running into OOM errors. We put our memory limit on 512MB. The thing is that this seems incredibly odd to me: because we are creating so many different databases, which are all really small as a consequence, I would not expect BaseX to need to store much in memory. After each database is created, the garbage collector can come along and remove everything that was needed for the previously generated database. A solution, I suppose, would be to close and open the BaseX session on each creation but I'm afraid that (on such a huge scale) the impact on speed would be too large. How it is set up now, in pseudo code: ---------------------------------------------------------------------------- ---- $session = Session->new(host, port, user, pw); # @allFiles is at least 100,000 items For $file (@allFiles) { $database_name = $file . "name"; $session->execute("CREATE DB $database_name file "); $session->execute("CLOSE"); } $session->close(); ---------------------------------------------------------------------------- ---- So all databases are created on the same session which I believe causes the issue. But why? What is still required in memory after ->execute("CLOSE")? Are the indices for the generated databases stored in memory? If so, can we force them to write to disk? ANY thoughts on this are appreciated. Enlightenment on how what is stored in a Session's memory is useful as well. Increasing the memory should be a last resort. Thank you in advance! Bram [1]: http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-CML C2%20Proceedings-rev2.pdf#page=20