Hi , the problem we are trying to solve with protocol buffer is to
serialize large relational db into protobuff messages files. We have
millions (b/w 100 to 500 million or even more) of row in the database that
we are trying to serialize in a chunk of 10,000 (or 5,000) rows each file.
10,000 rows protobuff message size goes around 1.2 MB where as for 5000
rows it goes up to 650 KB.
The slowness starts as we start processing files > 1000 or 1100 after that
serialization continues to slows down as we progress more . ex. initially
say on 100th file serialization takes around 100 ms , while on 500th file
it 300 ms , while on 1000th file takes around 1000 ms ..and so on.
the time most is while inserting and packing
TableData.Builder tableDataBuilder = TableData.newBuilder().setName(tableName);
-- Message creation (only once)
while (loop){
//each row
DataRow.Builder dataRowBuilder = DataRow.newBuilder();
//some row processing
tableDataBuilder.addDataRows(dataRowBuilder);
//each 10000 chunk here
tableDataBuilder.build().writeTo(output); // write to stream
//assign a new message for next file
tableDataBuilder = TableData.newBuilder().setName(tableName);
}
Can we have some suggestion to improve anything in processing proto here.
the main time taken here is in the call tableDataBuilder.addDataRows(
dataRowBuilder); which is happening for each row.
Here is proto message
message TableData {
required string name = 1; //Name of the database table
repeated ColNameDbType colNameDbType = 2; //Column name and column Db type
mapping
repeated DataRow dataRows = 3; //Table data rows
message DataRow {
repeated ColNameRowData colNameRowData = 1;
message ColNameRowData {
required string colName = 1; //column name
required DbType colDbType = 2; //column db type
optional string data = 3; //using string for all types except bool
optional bool boolData = 4; //this fileds gets poplulated if column
db datatype is bool
optional bytes blobData = 5;
}
}
message ColNameDbType {
required string name = 1;
required DbType type = 2;
}
enum DbType {
BIGINT = 0;
BIT = 1;
INT = 2;
VARCHAR = 3;
DATE = 4;
SMALLINT = 5;
SMALLINT_UNSIGNED = 6;
TIMESTAMP = 7;
BLOB = 8;
DATETIME = 9;
TINYINT = 10;
TINYINT_UNSIGNED = 11;
CHAR = 12;
INTEGER = 13;
LONGVARCHAR = 14;
DECIMAL = 15;
BIGINT_UNSIGNED = 16;
DOUBLE = 17;
LONGBLOB = 18;
VARBINARY = 19;
VARCHAR2=20; //Oracle specific
NUMBER=21; //Oracle specific
CLOB=22; //Oracle specific
IMAGE=23; //Its a blob (sql server)
NUMERIC=24; //sqlserver specific
DATETIME2=25; //sqlserver specific
FLOAT=26; //sqlserver specific
NVARCHAR=27; //sqlserver specific
INT2=28; //postgres specific
INT8=29; //postgres specific
INT4=30; //postgres specific
BOOL=31; //postgres specific
BYTEA=32; //postgres specific
TEXT=33; //postgres specific
FLOAT8=34; //postgres specific
BPCHAR=35; //postgres specific
RAW=36; //Oracel equivalent of VARBINARY in mysql
BINARY=37; //MSSQL equivalent of VARBINARY in mysql
UNKNOWN = 38;
}
}
Thank you,
--
You received this message because you are subscribed to the Google Groups
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.