We are in product development and batch size depends on the customer base of 
customer buying our product. Huge customers buying product may have huge 
batches while small customers may have much smaller ones. So we dont know 
upgront how many buckets per batch would be required and we dont wanna ask for 
additional configuration from our customer to input average batch size. So, we 
are planning to use dynamic bucketing. Every row in primary is associated with 
only one batch.


Comments required on the following:

1. I want to know any suggestios on proposed design?

2. Whats the best approach for updating/deleting from index table. When a row 
is manually purged from primary table, we dont know where that row key exists 
in x number of buckets created for its batch id? 

 

Thanks

Anuj

Sent from Yahoo Mail on Android

From:"sean_r_dur...@homedepot.com" <sean_r_dur...@homedepot.com>
Date:Fri, 24 Jul, 2015 at 5:39 pm
Subject:RE: Manual Indexing With Buckets

It is a bit hard to follow. Perhaps you could include your proposed schema 
(annotated with your size predictions) to spur more discussion. To me, it 
sounds a bit convoluted. Why is a “batch” so big (up to 100 million rows)? Is a 
row in the primary only associated with one batch?

 

 

Sean Durity – Cassandra Admin, Big Data Team

To engage the team, create a request

 

From: Anuj Wadehra [mailto:anujw_2...@yahoo.co.in] 
Sent: Friday, July 24, 2015 3:57 AM
To: user@cassandra.apache.org
Subject: Re: Manual Indexing With Buckets

 

Can anyone take this one?

 

Thanks

Anuj

Sent from Yahoo Mail on Android

From:"Anuj Wadehra" <anujw_2...@yahoo.co.in>
Date:Thu, 23 Jul, 2015 at 10:57 pm
Subject:Manual Indexing With Buckets

We have a primary table and we need search capability by batchid column. So we 
are creating a manual index for search by batch id. We are using buckets to 
restrict a row size in batch id index table to 50mb. As batch size may vary 
drastically ( ie one batch id may be associated to 100k row keys in primary 
table while other may be associated with 100million row keys), we are creating 
a metadata table to track the approximate data while insertions for a batch in 
primary table, so that batch id index table has dynamic no of buckets/rows. As 
more data is inserted for a batch in primary table, new set of 10 buckets are 
added. At any point in time, clients will write to latest 10 buckets created 
for a batch od index in round robin  to avoid hotspots.

 

Comments required on the following:

1. I want to know any suggestios on above design?

 

2. Whats the best approach for updating/deleting from index table. When a row 
is manually purged from primary table, we dont know where that row key exists 
in x number of buckets created for its batch id? 

 

Thanks

Anuj

Sent from Yahoo Mail on Android

 



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with
 this e-mail message or its attachment.

Reply via email to