Yes, I think we have to check nulls and ignore
On Wed, Oct 23, 2013 at 5:36 AM, Maninda Edirisooriya <[email protected]>wrote: > Hi Malith, > > Your first approach of solution seems not that useable as we cannot > enforce the schedule of each Hive script and schedule of archiving. The > only solutions seems is the second one which can be done programmatically. > When it comes to wide row implementation 2 ways are possible. > > 1. Check value of each field and discard null tuples programatically when > storing into RDB. > 2. Check a mandatory field value (like Stream Name, Stream Version, > timestamp) with the same row key, each time when a tuple is about to be > stored into RDB and if mandatory field is null the insertion can be > avoided. The reason is that they become null only when the row is deleted. > > I am not sure about the assumptions here we made about null characters as > there may be some other reason to return null rows from Cassandra. Better > we check again to confirm the course. > > > *Maninda Edirisooriya* > Software Engineer > > *WSO2, Inc. *lean.enterprise.middleware. > > *Blog* : http://maninda.blogspot.com/ > *Phone* : +94 777603226 > > > On Wed, Oct 23, 2013 at 3:51 PM, Malith Dhanushka <[email protected]> wrote: > >> Hi folks, >> >> We have a functionality in BAM data archival feature where it deletes the >> respective rows from Cassandra original CF while archiving. But in >> Cassandra the delete operation doesn't just wipe out all traces of the data >> being removed immediately. Instead of wiping out data on delete, Cassandra >> replaces it with a special value called a tombstone. So basically it keeps >> the raw id with null values in columns. So after archiving, if someone runs >> a hive script on that CF ,it triggers exceptions for raw id with null >> values when writing to RDBMS. >> >> But it seems this is rather a feature in Cassandra to make eventual >> consistency of data in replicas. The data can't actually be removed if we >> perform a delete, instead, a marker (tombstone) is written to indicate the >> value's new status. On the first compaction that occurs between the data >> and the tombstone, the data will be removed completely and the >> corresponding disk space recovered. There is a property called >> GCGraceSeconds which can be defined per CF basis to specify the time to >> wait before garbage collecting tombstones (default value is 10 days). In >> many deployments this interval can be reduced, and in a single-node cluster >> it can be safely set to zero. >> >> So by considering above facts there are couple of alternatives we can >> think of, >> >> 1. We can fine-tune and reduce the value of GCGraceSeconds and tell the >> users to run the hive scripts after that time once run the archival >> process. But both hive scripts and archiving have their own scheduling, so >> later on syncing might get messy. >> >> 2. Programmatically check the Casandra null values (considering mandatory >> column like timestamp) and skip those when writing to RDBMS. But this is >> bit tricky when it comes to Cassandra wide-row operation. >> >> Any ideas on this, >> >> Thanks, >> Malith >> -- >> Malith Dhanushka >> Engineer - Data Technologies >> *WSO2, Inc. : wso2.com <http://wso2.com/>* >> *Mobile* : +94 716 506 693 >> >> _______________________________________________ >> Architecture mailing list >> [email protected] >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > -- ============================ Srinath Perera, Ph.D. Director, Research, WSO2 Inc. Visiting Faculty, University of Moratuwa Member, Apache Software Foundation Research Scientist, Lanka Software Foundation Blog: http://srinathsview.blogspot.com/ Photos: http://www.flickr.com/photos/hemapani/ Phone: 0772360902
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
