Sorry, I should have been more clear... :) The shards are irrelevant to this discussion I suppose, but just for a brief background;
I have 64 H2 instances over 8 servers. These servers are running Hadoop and data that the Hadoop jobs are extracting is being distributed across them evenly. We tried many other options for achieving this task, but doing it this way was just too screaming fast to pass up. The memory has thus far not been an issue on any of these databases, each gets 3G of mem and has even come close to using more than 1G while we're pounding data into them. The servers have TB's of storage on them, so that's not an issue either. What tends to happen is that this setup is insanely fast when the databases and tables are all empty (naturally), but as the tables get beyond 500,000 rows per shard, the performance gets brutally slow. The test I've been using to check this performance is pretty dead simple, where I basically read in one batch of our data - then use CSVWRITE to dump the table in a csv file. After that I just keep INSERT INTO with CSVREAD to add more data. the first batch insert takes less than a sec but by the time I'm at 5 million records (adding 1 million at a clip) the inserts take close to 30 minutes. Now the solution I've been working on to get around this is to keep the tables small (already somewhat done via the sharding) and to dump the tables out into CSV's every hour or so and push them into a mysql (or postgres) instance outside the cluster to be imported by a database that is slower, yes, but can handle larger table sizes more consistently. MySql would take 10 minutes to do what H2 does in 3 minutes when the tables are empty. When the tables get over 5 million records though.. MySql still takes 10 minutes where H2 takes 30. It seems like this combo (although annoying) plays to the strengths of each db and is just crazy enough to work. After using H2 for the bulk of this project, I've been spoiled by it's ease of use, elegance and speed. I'd love to use H2 the whole way through and don't want to go back to MySql or Postgres - so I'd love to know if anyone has suggestions for dealing with massive data sets in H2 without degrading the performance to the point of making it useless. Again, thanks for any help and thanks for reading. I can provide more technical data as needed, of course, but I'm trying to take the specifics out of the group discussion to protect my employers interests. Rock on, Brian On Oct 26, 12:19 pm, "Thomas Mueller" <[EMAIL PROTECTED]> wrote: > Hi, > > > Once my db shards grow beyond 500,000 records > > Sorry, what is a 'db shard'? > > > Inserting 5 million records over 64 clean db shards happens in > > 3 minutes. > > So you mean 64 databases? > > > After those shards get over 500,000 records - the whole > > insert takes 29 minutes. > > Maybe the system runs out of physical memory? > > Regards, > Thomas --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "H2 Database" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/h2-database?hl=en -~----------~----~----~----~------~----~------~--~---
