Sorry Julian, I didn't mean to steal your thunder :) I think we might be experiencing separate issues and perhaps I should split this question out into another topic.
On Oct 26, 4:44 pm, brianstarke <[EMAIL PROTECTED]> wrote: > Sorry, I should have been more clear... :) > > The shards are irrelevant to this discussion I suppose, but just for a > brief background; > > I have 64 H2 instances over 8 servers. These servers are running > Hadoop and data that the Hadoop jobs are extracting is being > distributed across them evenly. We tried many other options for > achieving this task, but doing it this way was just too screaming fast > to pass up. > > The memory has thus far not been an issue on any of these databases, > each gets 3G of mem and has even come close to using more than 1G > while we're pounding data into them. The servers have TB's of storage > on them, so that's not an issue either. > > What tends to happen is that this setup is insanely fast when the > databases and tables are all empty (naturally), but as the tables get > beyond 500,000 rows per shard, the performance gets brutally slow. > The test I've been using to check this performance is pretty dead > simple, where I basically read in one batch of our data - then use > CSVWRITE to dump the table in a csv file. After that I just keep > INSERT INTO with CSVREAD to add more data. the first batch insert > takes less than a sec but by the time I'm at 5 million records (adding > 1 million at a clip) the inserts take close to 30 minutes. > > Now the solution I've been working on to get around this is to keep > the tables small (already somewhat done via the sharding) and to dump > the tables out into CSV's every hour or so and push them into a mysql > (or postgres) instance outside the cluster to be imported by a > database that is slower, yes, but can handle larger table sizes more > consistently. MySql would take 10 minutes to do what H2 does in 3 > minutes when the tables are empty. When the tables get over 5 million > records though.. MySql still takes 10 minutes where H2 takes 30. It > seems like this combo (although annoying) plays to the strengths of > each db and is just crazy enough to work. > > After using H2 for the bulk of this project, I've been spoiled by it's > ease of use, elegance and speed. I'd love to use H2 the whole way > through and don't want to go back to MySql or Postgres - so I'd love > to know if anyone has suggestions for dealing with massive data sets > in H2 without degrading the performance to the point of making it > useless. > > Again, thanks for any help and thanks for reading. I can provide more > technical data as needed, of course, but I'm trying to take the > specifics out of the group discussion to protect my employers > interests. > > Rock on, > > Brian > > On Oct 26, 12:19 pm, "Thomas Mueller" <[EMAIL PROTECTED]> > wrote: > > > Hi, > > > > Once my db shards grow beyond 500,000 records > > > Sorry, what is a 'db shard'? > > > > Inserting 5 million records over 64 clean db shards happens in > > > 3 minutes. > > > So you mean 64 databases? > > > > After those shards get over 500,000 records - the whole > > > insert takes 29 minutes. > > > Maybe the system runs out of physical memory? > > > Regards, > > Thomas --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "H2 Database" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/h2-database?hl=en -~----------~----~----~----~------~----~------~--~---
