Thanks for your comments! I'm using a custom script which reads the first 
20,000 lines of a csv file, then generates the h2 import command 
dynamically from that. So it's not a straight "create table X as select * 
from Y", but rather it includes proper column types for every column it 
detects. 

The string / varchar data could possibly be handled better, right now it 
defaults to varchar(5000); still better than the Integer.max value I 
believe H2 has by default, but not quite as small as it could be. We have 
widely variable strings all over the place though so it's hard to predict a 
max value for the string columns. Sometimes the first csv file will import, 
but then one of the remaining (of hundreds) csv files throws up on the 
"INSERT" command (also generated at the same time with the same schema as 
the CREATE TABLE) so we prefer to keep the varchars a little bigger to be 
safe.

We do add the indexes after the data is all inserted though, so at least 
that looks good.

But I guess I'm confused as to why the ram usage is so much higher than the 
file on disk is, when compared to file based databases. Wouldn't a 4 gig 
.mv.db file use roughly the same amount of ram when the same data is loaded 
in a "mem" database? Even then, we might be able to work around the huge 
memory usage, as long as we can regain the space in ram again when we drop 
a table and reload it from updated csv. Right now the "mem" mode just 
continues to grow and grow until eventually h2 stops responding (cpu usage 
increases but query never completes; I'm assuming it just runs out of 
available memory)

Unfortunately the data is not allowed to leave the building. I might be 
able to create a small java program which generates a large database; to 
compare "mem" mode vs "file" mode, but not sure how helpful that would be.


On Tuesday, June 2, 2015 at 2:19:58 AM UTC-4, Christian MICHON wrote:
>
> Are you using some commands like "create table mytable as select * from 
> csvread('myfile.csv')" ?
>
> If so, I've few possible pieces of advice:
> - create a schema first, with the right types expected for attributes. 
> otherwise each attribute will be casted as text with 2Gb max allocation.
> - insert like this example:
>   INSERT INTO LANG_CHINESE (ISO ,NAME ) SELECT * FROM 
> CSVREAD('CHINESE.CSV',NULL,'UTF8') ;
> - do not create indexes first, create indexes after the csv import is 
> completed.
>
> Can your data be shared?
>
> On Friday, May 29, 2015 at 7:45:53 PM UTC+2, John Smith wrote:
>>
>> I have several csv files I'm loading up into an h2 instance (running the 
>> beta version of h2, ver 1.4.187).
>>
>> Total bytes on disk for csv files is approx 4 GB.
>>
>> After importing this in h2, adding a few indexes, the size of my h2 mv.db 
>> database file is 3.6 gigs. Same csv files loaded up into an h2 "mem" 
>> database, all in ram, windows server shows usage of 85 gigs. The memory 
>> database just explodes in size compared to the file database, all else 
>> between the databases remains the same (tables + indexes etc). 
>>
>> Does that make sense? The huge difference in size there? Even accounting 
>> for some of that 85 gigs being used as a cache of some sort that might also 
>> be present with the file database (I forgot to test how much ram the h2 
>> process consumed when using file database), why is there such a huge 
>> increase in ram usage? My csv files themselves are only 4 gigs, I add one 
>> or two indexes on each table (there are two tables), I don't see how 4 gb 
>> of csv translates to 85 gigs once loaded in ram?
>>
>> Second thing is speed. I moved my file database to a ram / memory disk. 
>> Then loaded up h2 and executed a query. It took 8.4 minutes. It was a 
>> simple query, no joins, but used one orderby on an unindexed column. The 
>> same query using the h2 mem database returned in 15 seconds; same thing, no 
>> index on the orderby. I get that I could add an index to cut down the query 
>> time, but that's not the point; I wanted to test how much difference in 
>> performance there is b/n the native mem mode, and the h2 file database 
>> stored on a ram disk.
>>
>> So to sum up: 
>> 1: why does h2 use so much more memory in "mem" mode vs the file size on 
>> disk, for the same amount of data? 
>> 2: why would the mem database perform so much quicker than a file 
>> database stored in ram? 
>> 3: if I am required to use the "mem" mode for the performance that we 
>> need, is there any way I can "defrag" the database while it's in ram? After 
>> several drops, deletes, updates etc., the mem based database continues to 
>> grow and I can't shrink it using "shutdown defrag" like I can with a file 
>> based database.
>>
>> Thanks in advance for any thoughts on all this.
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/h2-database.
For more options, visit https://groups.google.com/d/optout.

Reply via email to