On Wed, Oct 23, 2013 at 6:06 PM, Maninda Edirisooriya <[email protected]>wrote:

> Hi Malith,
>
> Your first approach of solution seems not that useable as we cannot
> enforce the schedule of each Hive script and schedule of archiving. The
> only solutions seems is the second one which can be done programmatically.
> When it comes to wide row implementation 2 ways are possible.
>
> 1. Check value of each field and discard null tuples programatically when
> storing into RDB.
> 2. Check a mandatory field value (like Stream Name, Stream Version,
> timestamp) with the same row key, each time when a tuple is about to be
> stored into RDB and if mandatory field is null the insertion can be
> avoided. The reason is that they become null only when the row is deleted.
>
> I am not sure about the assumptions here we made about null characters as
> there may be some other reason to return null rows from Cassandra. Better
> we check again to confirm the course.
>

Yes option 2 looks better. And exposing tombstones to the outside is a
doubtful fact from Cassandra end, which seems fundamentally not correct.
Will get a clarification from Cassandra mailing list about that.


> *
> Maninda Edirisooriya*
> Software Engineer
> *WSO2, Inc.
> *lean.enterprise.middleware.
>
> *Blog* : http://maninda.blogspot.com/
> *Phone* : +94 777603226
>
>
> On Wed, Oct 23, 2013 at 3:51 PM, Malith Dhanushka <[email protected]> wrote:
>
>> Hi folks,
>>
>> We have a functionality in BAM data archival feature where it deletes the
>> respective rows from Cassandra original CF while archiving. But in
>> Cassandra the delete operation doesn't just wipe out all traces of the data
>> being removed immediately. Instead of wiping out data on delete, Cassandra
>> replaces it with a special value called a tombstone. So basically it keeps
>> the raw id with null values in columns. So after archiving, if someone runs
>> a hive script on that CF ,it triggers exceptions for raw id with null
>> values when writing to RDBMS.
>>
>> But it seems this is rather a feature in Cassandra to make eventual
>> consistency of data in replicas. The data can't actually be removed if we
>> perform a delete, instead, a marker (tombstone) is written to indicate the
>> value's new status. On the first compaction that occurs between the data
>> and the tombstone, the data will be removed completely and the
>> corresponding disk space recovered. There is a property called
>> GCGraceSeconds which can be defined per CF basis to specify the time to
>> wait before garbage collecting tombstones (default value is 10 days). In
>> many deployments this interval can be reduced, and in a single-node cluster
>> it can be safely set to zero.
>>
>> So by considering above facts there are couple of alternatives we can
>> think of,
>>
>> 1. We can fine-tune and reduce the value of GCGraceSeconds and tell the
>> users to run the hive scripts after that time once run the archival
>> process. But both hive scripts and archiving have their own scheduling, so
>> later on syncing might get messy.
>>
>> 2. Programmatically check the Casandra null values (considering mandatory
>> column like timestamp) and skip those when writing to RDBMS. But this is
>> bit tricky when it comes to Cassandra wide-row operation.
>>
>> Any ideas on this,
>>
>> Thanks,
>> Malith
>> --
>> Malith Dhanushka
>> Engineer - Data Technologies
>> *WSO2, Inc. : wso2.com*
>> *Mobile*          : +94 716 506 693
>>
>> _______________________________________________
>> Architecture mailing list
>> [email protected]
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
Malith Dhanushka
Engineer - Data Technologies
*WSO2, Inc. : wso2.com*
*Mobile*          : +94 716 506 693
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to