[jira] Updated: (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable

Matthew F. Dennis (JIRA) Sun, 23 Jan 2011 02:55:32 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthew F. Dennis updated CASSANDRA-1278:
-----------------------------------------

    Attachment: 1278-cassandra-0.7.txt

Attached patch implements a Cassandra Pipelined Thrift (CPT) loader.

Thrift objects are serialized over a socket to C* in "segments".  Once the 
segment is loaded, C* responds with a CPTResponse object detailing the various 
counts about what was loaded and the connection is closed. Rows and/or columns 
need not be sorted.  No throttling need occur - the client can safely write 
data to the socket as fast as it will accept it.  CPT correctly handles 
secondary indexes.

The format of sending a segment is:

PROTOCOL_MAGIC (i.e. MessagingService.PROTOCOL_MAGIC)
PROTOCOL_HEADER (i.e. IncomingCassandraPipelinedThriftReader.PROTOCOL_HEADER)
CPTHeader
CPTRowHeader (for each row sent)
Column|SuperColumn (for each Column|SuperColumn in the row)

Each row is terminated by an empty Column or SuperColumn (e.g. 
IncomingCassandraPipelinedThriftReader.END_OF_COLUMNS) Each segment is 
terminated by an empty CPTRowHeader (e.g. 
IncomingCassandraPipelinedThriftReader.END_OF_ROWS)

A CPTHeader consists of several fields:

{code}
struct CPTHeader {
   1: required string keyspace,
   2: required string column_family,
   3: required i32 table_flush_size,
   4: required i32 forward_frame_size,
   5: required i32 so_rcvbuf_size,
   6: required i32 so_sndbuf_size,
   7: required bool forward,
}
{code}

table_flush_size controls the amount of data that is buffered before applying 
the mutation.  Several KB is a good starting value for this.

forward_frame_size only applies with forward=true and controls what thrift 
frame size to use when forwarding data to other nodes.  256K is a good value 
here, but it should be 1/2 or less the size of so_rcvbuf_size and 
so_sndbuf_size.

so_rcvbuf_size/so_sndbuf_size is the size of the socket buffer used by C* when 
accepting/forwarding CPT data respectively.  It should be at least twice as big 
as forward_frame_size and/or the frame sized used when sending CPT data.  
Values > 128K usually require changing net.core.rmem_max/net.core.wmem_max.

forward controls whether C* should forward data to other nodes.

bin/generatecptdata will produce test data with the parameters given in 
bin/generatecptdata.config

bin/listcptdetails will list information about file(s)/dir(s) given as 
arguments, including number of rows and size of the useful raw data (versus 
overhead/packaging).

bin/loadcptdata will load CPT files to a cluster and serves as a great starting 
point for doing this from other applications.

Thrift generates a *lot* of garbage and as such it is trivial to max out JVM GC 
with the current implementation.  Tuning GC for your loads is a good idea for 
best performance (smallish memtables and larger newgen is a good place to 
start).

After a CPT load it is important to flush as CPT loading skips the commitlog 
(which also implies that the "retry" division is on segments, not row and/or 
individual mutations).

Best performance, especially on larger clusters, will be achieved by 
partitioning your data into segments such that a given segment corresponds to 
precisely one range in the cluster (see loadcptdata for an example of how to do 
this).  Then sending each segment to each of the replicas for that range in a 
separate connection/thread with forward=false (note that loadcptdata does *not* 
do this, it assumes segments can contain data for any node).  This provides two 
important benefits:

1) it allows all nodes at all times to make as much progress as possible 
independent of other nodes/replicas slowing down (e.g. compaction, GC).

2) it allows you to retry precisely the failed segments on precisely the failed 
nodes if a node fails during a CPT load (in the more general sense it lets you 
reason about what data was loaded where instead of depending on AE, RR, HH).


> Make bulk loading into Cassandra less crappy, more pluggable
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-1278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Jeremy Hanna
>            Assignee: Matthew F. Dennis
>             Fix For: 0.7.1
>
>         Attachments: 1278-cassandra-0.7.txt
>
>   Original Estimate: 40h
>  Remaining Estimate: 40h
>
> Currently bulk loading into Cassandra is a black art.  People are either 
> directed to just do it responsibly with thrift or a higher level client, or 
> they have to explore the contrib/bmt example - 
> http://wiki.apache.org/cassandra/BinaryMemtable  That contrib module requires 
> delving into the code to find out how it works and then applying it to the 
> given problem.  Using either method, the user also needs to keep in mind that 
> overloading the cluster is possible - which will hopefully be addressed in 
> CASSANDRA-685
> This improvement would be to create a contrib module or set of documents 
> dealing with bulk loading.  Perhaps it could include code in the Core to make 
> it more pluggable for external clients of different types.
> It is just that this is something that many that are new to Cassandra need to 
> do - bulk load their data into Cassandra.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-1278) Make bulk loading into Cassandra less crappy, more pluggable

Reply via email to