Sorry, I should have been more clear... :)

The shards are irrelevant to this discussion I suppose, but just for a
brief background;

I have 64 H2 instances over 8 servers.  These servers are running
Hadoop and data that the Hadoop jobs are extracting is being
distributed across them evenly.  We tried many other options for
achieving this task, but doing it this way was just too screaming fast
to pass up.

The memory has thus far not been an issue on any of these databases,
each gets 3G of mem and has even come close to using more than 1G
while we're pounding data into them.  The servers have TB's of storage
on them, so that's not an issue either.

What tends to happen is that this setup is insanely fast when the
databases and tables are all empty (naturally), but as the tables get
beyond 500,000 rows per shard, the performance gets brutally slow.
The test I've been using to check this performance is pretty dead
simple, where I basically read in one batch of our data - then use
CSVWRITE to dump the table in a csv file.  After that I just keep
INSERT INTO with CSVREAD to add more data.  the first batch insert
takes less than a sec but by the time I'm at 5 million records (adding
1 million at a clip) the inserts take close to 30 minutes.

Now the solution I've been working on to get around this is to keep
the tables small (already somewhat done via the sharding) and to dump
the tables out into CSV's every hour or so and push them into a mysql
(or postgres) instance outside the cluster to be imported by a
database that is slower, yes, but can handle larger table sizes more
consistently.  MySql would take 10 minutes to do what H2 does in 3
minutes when the tables are empty.  When the tables get over 5 million
records though.. MySql still takes 10 minutes where H2 takes 30.  It
seems like this combo (although annoying) plays to the strengths of
each db and is just crazy enough to work.

After using H2 for the bulk of this project, I've been spoiled by it's
ease of use, elegance and speed.  I'd love to use H2 the whole way
through and don't want to go back to MySql or Postgres - so I'd love
to know if anyone has suggestions for dealing with massive data sets
in H2 without degrading the performance to the point of making it
useless.

Again, thanks for any help and thanks for reading.  I can provide more
technical data as needed, of course, but I'm trying to take the
specifics out of the group discussion to protect my employers
interests.

Rock on,

Brian

On Oct 26, 12:19 pm, "Thomas Mueller" <[EMAIL PROTECTED]>
wrote:
> Hi,
>
> > Once my db shards grow beyond 500,000 records
>
> Sorry, what is a 'db shard'?
>
> > Inserting 5 million records over 64 clean db shards happens in
> > 3 minutes.
>
> So you mean 64 databases?
>
> >  After those shards get over 500,000 records - the whole
> > insert takes 29 minutes.
>
> Maybe the system runs out of physical memory?
>
> Regards,
> Thomas
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/h2-database?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to