Re: Parallel loading, truncation, and booleans

Stanley Bradbury Wed, 07 Mar 2007 14:07:00 -0800

Larry Yogman wrote:

==============
Use case description:
==============


I'm bulk loading derby tables from in-memory data.  I'm looking for maximum 
performance on multi-threaded hardware.  My columns have many data types, with 
the most common being boolean and string.  I'm willing to truncate data if 
necessary, and I do not need ACID properties.  To support later queries, I need 
the tables to be indexed by a shared, application-generated, 64-bit primary key.

I'm using version 10.2.2.0.  I'm having some problems with performance, data 
truncation, and boolean data.

========================
Problem 1. Disappointing performance
========================

I create all my tables, then start to insert using prepared sql statements like 
this: INSERT INTO DATA1(DP_ROWID,F1,F2,F3) VALUES (?,?,?,?).  I use multiple 
threads per table, trying to keep all available CPU cores busy.  Each thread 
calls executeBatch once for every 1000 inserts.  The DP_ROWID field is the 
application-generated key (type BIGINT).  I create the index on DP_ROWID after 
the loading is finished.

On a 4-Opteron box, each CPU dual-core, my application spends over 60% of its 
time loading derby tables, about 20% of its time creating derby indexes, and 
less than 20% of its time doing other work.   I would like to improve the derby 
performance as much as possible.  Some questions:

My derby.properties file includes:derby.system.durability=testderby.storage.pageCacheSize=10000

Any additions/changes recommended for my derby.properties settings?

Should the interval of 1000 rows between calls to executeBatch be increased? What's a good value?

Am I hurting myself by using multiple threads per table?  Could this be counter 
productive due to locking at the level of the whole table?  An alternative way 
to parallelize is to write a larger number of smaller tables, but I would 
prefer not to do this as it will complicate later queries.

Are there any known issues with creating indexes in parallel?  I have not yet 
parallelized this part of the work, but once the load is faster the indexing 
will be a larger fraction of the remaining time.

=================
Problem 2. Data truncation.
=================

The data inserted includes java.lang.String objects of unknown varying length 
and java.lang.BigDecimal objects of unknown varying length and unknown, 
potentially varying precision.  The corresponding column types are currently 
VARCHAR(100) and DECIMAL(31,5), but sometimes I still get truncation errors.  
Ideal would be a way to preserve the values exactly as they are, without 
truncation.  Can this be done using appropriate declarations of column data 
types when creating the table? If so, what's the performance impact?

For my use case, data truncation is acceptable, but failing with an error about 
data truncation is not, so second best would be a way to truncate as needed. 
Hoping for a propery to configure this behavior for the whole database or 
system.

If the approachs above do not work, then I would appreciate advice on the 
relative efficiency of:

A) Changing my INSERT statements to use SQL casts

B) Using the existing INSERT statements, and truncating values in Java code 
before calling PreparedStatement.setX

================
Problem 3. Boolean data
================

I have a lot of boolean columns.  I assume that using the most space-efficient 
representation will give maximum performance when loading.  However, I also 
need to be able to query for true vs. false.

I started out trying to declare BOOLEAN columns, but this data type is not 
supported.  I'm currently using one SMALLINT column per boolean field, and 
querying for 0 vs 1.  Should I switch to CHAR(1) or CHAR(1) FOR BIT DATA?  What 
would queries look like?  I also considered packing 16 bits into each SMALLINT 
column, but there do not seem to be any shift or mask operations available in 
SQL, so I don't see how to query for true vs. false.

Hi Larry -

I can't address your problems from experience with inserting lots ofdata using multiple threads but want to put out some ideas that may helpyou based with my other experience using Derby.

Allocation of disk space can be a significant overhead when a lot ofdata is inserted. Pre-allocate the space needed by settingderby.storage.initialPages when creating the tables.

If you could use IMPORT the data load would be faster but I don't knowhow you could utilize this feature with in-memory data.


Unless I misunderstand the answer to this is 'yes':

Ideal would be a way to preserve the values exactly as they are, without truncation. Can this be done using appropriate declarations of column data types when creating the table?


I wouldn't expect much impact but the best way to tell is to test it:

If so, what's the performance impact?

And I believe that STRING datatypes do truncate so assume the failureyou are encountering is with the DECIMAL. If truncation is OK thenperhaps only the DECIMAL values need to be managed.


SMALLINT is the recommended datatype to use for BOOLEAN values

HTH

Re: Parallel loading, truncation, and booleans

Reply via email to