Sorry Julian, I didn't mean to steal your thunder :)  I think we might
be experiencing separate issues and perhaps I should split this
question out into another topic.

On Oct 26, 4:44 pm, brianstarke <[EMAIL PROTECTED]> wrote:
> Sorry, I should have been more clear... :)
>
> The shards are irrelevant to this discussion I suppose, but just for a
> brief background;
>
> I have 64 H2 instances over 8 servers.  These servers are running
> Hadoop and data that the Hadoop jobs are extracting is being
> distributed across them evenly.  We tried many other options for
> achieving this task, but doing it this way was just too screaming fast
> to pass up.
>
> The memory has thus far not been an issue on any of these databases,
> each gets 3G of mem and has even come close to using more than 1G
> while we're pounding data into them.  The servers have TB's of storage
> on them, so that's not an issue either.
>
> What tends to happen is that this setup is insanely fast when the
> databases and tables are all empty (naturally), but as the tables get
> beyond 500,000 rows per shard, the performance gets brutally slow.
> The test I've been using to check this performance is pretty dead
> simple, where I basically read in one batch of our data - then use
> CSVWRITE to dump the table in a csv file.  After that I just keep
> INSERT INTO with CSVREAD to add more data.  the first batch insert
> takes less than a sec but by the time I'm at 5 million records (adding
> 1 million at a clip) the inserts take close to 30 minutes.
>
> Now the solution I've been working on to get around this is to keep
> the tables small (already somewhat done via the sharding) and to dump
> the tables out into CSV's every hour or so and push them into a mysql
> (or postgres) instance outside the cluster to be imported by a
> database that is slower, yes, but can handle larger table sizes more
> consistently.  MySql would take 10 minutes to do what H2 does in 3
> minutes when the tables are empty.  When the tables get over 5 million
> records though.. MySql still takes 10 minutes where H2 takes 30.  It
> seems like this combo (although annoying) plays to the strengths of
> each db and is just crazy enough to work.
>
> After using H2 for the bulk of this project, I've been spoiled by it's
> ease of use, elegance and speed.  I'd love to use H2 the whole way
> through and don't want to go back to MySql or Postgres - so I'd love
> to know if anyone has suggestions for dealing with massive data sets
> in H2 without degrading the performance to the point of making it
> useless.
>
> Again, thanks for any help and thanks for reading.  I can provide more
> technical data as needed, of course, but I'm trying to take the
> specifics out of the group discussion to protect my employers
> interests.
>
> Rock on,
>
> Brian
>
> On Oct 26, 12:19 pm, "Thomas Mueller" <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
>
> > > Once my db shards grow beyond 500,000 records
>
> > Sorry, what is a 'db shard'?
>
> > > Inserting 5 million records over 64 clean db shards happens in
> > > 3 minutes.
>
> > So you mean 64 databases?
>
> > >  After those shards get over 500,000 records - the whole
> > > insert takes 29 minutes.
>
> > Maybe the system runs out of physical memory?
>
> > Regards,
> > Thomas
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/h2-database?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to