Re: Performance Difference between Batch Insert and Bulk Load

Dong Dai Mon, 01 Dec 2014 07:45:38 -0800

Thank a lot for the reply, Raj,

I understand they are different. But if we define a Batch with UNLOGGED, it 
will not guarantee the atomic transaction, and become more like a data import 
tool. According to my knowledge, BATCH statement packs several mutations into 
one RPC to save time. Similarly, Bulk Loader also pack all the mutations as a 
SSTable file and (I think) may be able to save lot of time too.


I am interested that, in the coordinator server, are Batch Insert and Bulk 
Loader the similar thing? I mean are they implemented in the similar way?

P.S. I try to randomly insert 1000 rows into a simple table on my laptop as a 
test. Sync Insert will take almost 2s to finish, but sync batch insert only 
take like 900ms. It is a huge performance improvement, I wonder is this 
expected?

Also, I used CQLSStableWriter to put these 1000 insertions into a single 
SSTable file, it costs around 2s to finish on my laptop. Seems to be pretty 
slow.

thanks!
- Dong

> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana 
> <rnambood...@gmail.com> wrote:
> 
> BATCH statement and Bulk Load are totally different things. The BATCH 
> statement comes in the atomic transaction space which provides a way to make 
> more than one statements into an atomic unit and bulk loader provides the 
> ability to bulk load external data into a cluster. Two are totally different 
> things and cannot be compared. 
> 
> Thanks
> -Raj
> 
> On 01-Dec-2014, at 4:32 am, Dong Dai <daidon...@gmail.com> wrote:
> 
>> Hi, all, 
>> 
>> I have a performance question about the batch insert and bulk load. 
>> 
>> According to the documents, to import large volume of data into Cassandra, 
>> Batch Insert and Bulk Load can both be an option. Using batch insert is 
>> pretty straightforwards, but there have not been an ‘official’ way to use 
>> Bulk Load to import the data (in this case, i mean the data was generated 
>> online). 
>> 
>> So, i am thinking first clients use CQLSSTableWriter to create the SSTable 
>> files, then use “org.apache.cassandra.tools.BulkLoader”  to import these 
>> SSTables into Cassandra directly. 
>> 
>> The question is can I expect a better performance using the BulkLoader this 
>> way comparing with using Batch insert?
>> 
>> I am not so familiar with the implementation of Bulk Load. But i do see a 
>> huge performance improvement using Batch Insert. Really want to know the 
>> upper limits of the write performance. Any comment will be helpful, Thanks!
>> 
>> - Dong
>> 
>

Re: Performance Difference between Batch Insert and Bulk Load

Reply via email to