might be easier to use the unix uniq or sort -u commands or something
similar, followed by
   CREATE TABLE AS .. SELECT

DISTINCT is problematic because it requires building a huge set before
perfomring the insert.

Or use the MERGE command to avoid the DISTINCT step.

Also turn off lthe undo log with
  SET UNDO_LOG 0
and the transaction log with
   SET LOG 0
at the expensive of having a dead DB if anything goes wrong, but these
commands can be quite useful during initial population.
Don't forget to re-enable them when you are done.



On Mon, 18 Nov 2019 at 16:38, Tim Fielder <[email protected]> wrote:

> To follow up:
>
> I got use of a system with 512GB of RAM, which is not nearly as helpful as
> I'd like.  The JVM is still perfectly happy to eat all of it and go commit
> seppuku.
>
> The first thing I had to do was split the Lucene index into chunks.
> Trying to parse 4700 files of 10,000 documents each resulted in Lucene
> continuously merging, which slowly consumed all available memory until
> there was no longer enough to open a new connection to the database.  By
> instead having Lucene parse each file into a separate directory, I could
> defer merging until after the parsing was complete.  That allowed me to, at
> least, complete the initial step of parsing the files and building the
> temporary tables.
>
> Now I'm trying to copy the temporary tables into permanent tables via the
> CREATE TABLE ... AS SELECT DISTINCT command, but some of the tables are
> like 400+ GB in size, so I'm still having some trouble with running out of
> memory.  Thus far my strategy has been to split the database correction
> steps into smaller and smaller chunks, which allows me to shut down the
> application and free all the memory that, for whatever reason, the JVM
> can't seem to garbage collect.  I'm now up to 15 distinct steps, each
> taking a full day to run, so it takes quite a while to fix any problems I
> run into.
>
> I did try the idea of using a MySQL database and connecting to it instead
> of using H2.  The results I got suggest that using H2 in MySQL mode is
> faster than running MySQL as a service on localhost.  Memory usage is about
> the same.  I shifted back to using H2 and got some multi-threading support
> by writing my temporary tables into separate H2 databases, so that the
> threads aren't blocked by file writes.
>
> If I get to the point that I can't subdivide my database activities any
> further and I still can't get it done with half a terabyte of RAM, I'm
> really not sure where to go next.  I think my best option is probably to
> use the SCRIPT command to write the table out to a file, then try to do my
> own file parsing to copy only the distinct rows.
>
> --
> You received this message because you are subscribed to the Google Groups
> "H2 Database" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/h2-database/8b291c44-39e7-4d20-a9ea-63071fe8fb08%40googlegroups.com
> <https://groups.google.com/d/msgid/h2-database/8b291c44-39e7-4d20-a9ea-63071fe8fb08%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/h2-database/CAFYHVnXS_FQh%2B4BvgcJWOHy7NxRH3-36a5av_ATLgKHmdmpm%2Bw%40mail.gmail.com.

Reply via email to