Hi Malith,

Your first approach of solution seems not that useable as we cannot enforce
the schedule of each Hive script and schedule of archiving. The only
solutions seems is the second one which can be done programmatically. When
it comes to wide row implementation 2 ways are possible.

1. Check value of each field and discard null tuples programatically when
storing into RDB.
2. Check a mandatory field value (like Stream Name, Stream Version,
timestamp) with the same row key, each time when a tuple is about to be
stored into RDB and if mandatory field is null the insertion can be
avoided. The reason is that they become null only when the row is deleted.

I am not sure about the assumptions here we made about null characters as
there may be some other reason to return null rows from Cassandra. Better
we check again to confirm the course.

*
Maninda Edirisooriya*
Software Engineer
*WSO2, Inc.
*lean.enterprise.middleware.

*Blog* : http://maninda.blogspot.com/
*Phone* : +94 777603226


On Wed, Oct 23, 2013 at 3:51 PM, Malith Dhanushka <[email protected]> wrote:

> Hi folks,
>
> We have a functionality in BAM data archival feature where it deletes the
> respective rows from Cassandra original CF while archiving. But in
> Cassandra the delete operation doesn't just wipe out all traces of the data
> being removed immediately. Instead of wiping out data on delete, Cassandra
> replaces it with a special value called a tombstone. So basically it keeps
> the raw id with null values in columns. So after archiving, if someone runs
> a hive script on that CF ,it triggers exceptions for raw id with null
> values when writing to RDBMS.
>
> But it seems this is rather a feature in Cassandra to make eventual
> consistency of data in replicas. The data can't actually be removed if we
> perform a delete, instead, a marker (tombstone) is written to indicate the
> value's new status. On the first compaction that occurs between the data
> and the tombstone, the data will be removed completely and the
> corresponding disk space recovered. There is a property called
> GCGraceSeconds which can be defined per CF basis to specify the time to
> wait before garbage collecting tombstones (default value is 10 days). In
> many deployments this interval can be reduced, and in a single-node cluster
> it can be safely set to zero.
>
> So by considering above facts there are couple of alternatives we can
> think of,
>
> 1. We can fine-tune and reduce the value of GCGraceSeconds and tell the
> users to run the hive scripts after that time once run the archival
> process. But both hive scripts and archiving have their own scheduling, so
> later on syncing might get messy.
>
> 2. Programmatically check the Casandra null values (considering mandatory
> column like timestamp) and skip those when writing to RDBMS. But this is
> bit tricky when it comes to Cassandra wide-row operation.
>
> Any ideas on this,
>
> Thanks,
> Malith
> --
> Malith Dhanushka
> Engineer - Data Technologies
> *WSO2, Inc. : wso2.com*
> *Mobile*          : +94 716 506 693
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>
_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to