To follow up:

I got use of a system with 512GB of RAM, which is not nearly as helpful as 
I'd like.  The JVM is still perfectly happy to eat all of it and go commit 
seppuku.

The first thing I had to do was split the Lucene index into chunks.  Trying 
to parse 4700 files of 10,000 documents each resulted in Lucene 
continuously merging, which slowly consumed all available memory until 
there was no longer enough to open a new connection to the database.  By 
instead having Lucene parse each file into a separate directory, I could 
defer merging until after the parsing was complete.  That allowed me to, at 
least, complete the initial step of parsing the files and building the 
temporary tables.

Now I'm trying to copy the temporary tables into permanent tables via the 
CREATE TABLE ... AS SELECT DISTINCT command, but some of the tables are 
like 400+ GB in size, so I'm still having some trouble with running out of 
memory.  Thus far my strategy has been to split the database correction 
steps into smaller and smaller chunks, which allows me to shut down the 
application and free all the memory that, for whatever reason, the JVM 
can't seem to garbage collect.  I'm now up to 15 distinct steps, each 
taking a full day to run, so it takes quite a while to fix any problems I 
run into.

I did try the idea of using a MySQL database and connecting to it instead 
of using H2.  The results I got suggest that using H2 in MySQL mode is 
faster than running MySQL as a service on localhost.  Memory usage is about 
the same.  I shifted back to using H2 and got some multi-threading support 
by writing my temporary tables into separate H2 databases, so that the 
threads aren't blocked by file writes.

If I get to the point that I can't subdivide my database activities any 
further and I still can't get it done with half a terabyte of RAM, I'm 
really not sure where to go next.  I think my best option is probably to 
use the SCRIPT command to write the table out to a file, then try to do my 
own file parsing to copy only the distinct rows.

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to h2-database+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/8b291c44-39e7-4d20-a9ea-63071fe8fb08%40googlegroups.com.

Reply via email to