might be easier to use the unix uniq or sort -u commands or something similar, followed by CREATE TABLE AS .. SELECT
DISTINCT is problematic because it requires building a huge set before perfomring the insert. Or use the MERGE command to avoid the DISTINCT step. Also turn off lthe undo log with SET UNDO_LOG 0 and the transaction log with SET LOG 0 at the expensive of having a dead DB if anything goes wrong, but these commands can be quite useful during initial population. Don't forget to re-enable them when you are done. On Mon, 18 Nov 2019 at 16:38, Tim Fielder <[email protected]> wrote: > To follow up: > > I got use of a system with 512GB of RAM, which is not nearly as helpful as > I'd like. The JVM is still perfectly happy to eat all of it and go commit > seppuku. > > The first thing I had to do was split the Lucene index into chunks. > Trying to parse 4700 files of 10,000 documents each resulted in Lucene > continuously merging, which slowly consumed all available memory until > there was no longer enough to open a new connection to the database. By > instead having Lucene parse each file into a separate directory, I could > defer merging until after the parsing was complete. That allowed me to, at > least, complete the initial step of parsing the files and building the > temporary tables. > > Now I'm trying to copy the temporary tables into permanent tables via the > CREATE TABLE ... AS SELECT DISTINCT command, but some of the tables are > like 400+ GB in size, so I'm still having some trouble with running out of > memory. Thus far my strategy has been to split the database correction > steps into smaller and smaller chunks, which allows me to shut down the > application and free all the memory that, for whatever reason, the JVM > can't seem to garbage collect. I'm now up to 15 distinct steps, each > taking a full day to run, so it takes quite a while to fix any problems I > run into. > > I did try the idea of using a MySQL database and connecting to it instead > of using H2. The results I got suggest that using H2 in MySQL mode is > faster than running MySQL as a service on localhost. Memory usage is about > the same. I shifted back to using H2 and got some multi-threading support > by writing my temporary tables into separate H2 databases, so that the > threads aren't blocked by file writes. > > If I get to the point that I can't subdivide my database activities any > further and I still can't get it done with half a terabyte of RAM, I'm > really not sure where to go next. I think my best option is probably to > use the SCRIPT command to write the table out to a file, then try to do my > own file parsing to copy only the distinct rows. > > -- > You received this message because you are subscribed to the Google Groups > "H2 Database" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/h2-database/8b291c44-39e7-4d20-a9ea-63071fe8fb08%40googlegroups.com > <https://groups.google.com/d/msgid/h2-database/8b291c44-39e7-4d20-a9ea-63071fe8fb08%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "H2 Database" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/h2-database/CAFYHVnXS_FQh%2B4BvgcJWOHy7NxRH3-36a5av_ATLgKHmdmpm%2Bw%40mail.gmail.com.
