Hi folks, We have a functionality in BAM data archival feature where it deletes the respective rows from Cassandra original CF while archiving. But in Cassandra the delete operation doesn't just wipe out all traces of the data being removed immediately. Instead of wiping out data on delete, Cassandra replaces it with a special value called a tombstone. So basically it keeps the raw id with null values in columns. So after archiving, if someone runs a hive script on that CF ,it triggers exceptions for raw id with null values when writing to RDBMS.
But it seems this is rather a feature in Cassandra to make eventual consistency of data in replicas. The data can't actually be removed if we perform a delete, instead, a marker (tombstone) is written to indicate the value's new status. On the first compaction that occurs between the data and the tombstone, the data will be removed completely and the corresponding disk space recovered. There is a property called GCGraceSeconds which can be defined per CF basis to specify the time to wait before garbage collecting tombstones (default value is 10 days). In many deployments this interval can be reduced, and in a single-node cluster it can be safely set to zero. So by considering above facts there are couple of alternatives we can think of, 1. We can fine-tune and reduce the value of GCGraceSeconds and tell the users to run the hive scripts after that time once run the archival process. But both hive scripts and archiving have their own scheduling, so later on syncing might get messy. 2. Programmatically check the Casandra null values (considering mandatory column like timestamp) and skip those when writing to RDBMS. But this is bit tricky when it comes to Cassandra wide-row operation. Any ideas on this, Thanks, Malith -- Malith Dhanushka Engineer - Data Technologies *WSO2, Inc. : wso2.com* *Mobile* : +94 716 506 693
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
