Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Tomas Bartalos Thu, 27 Dec 2018 13:21:39 -0800

Hello,

The problem is I can't know the combination of set/unset values. From my
perspective every value should be set. The event from Kafka represents the
complete state of the happening at certain point in time. In my table I
want to store the latest event so the most recent state of the happening
(in this table I don't care about the history). Actually I used wrong
expression since its just the opposite of "incremental update", every event
carries all data (state) for specific point of time.


The event is represented with nested json structure. Top level elements of
the json are table fields with type like text, boolean, timestamp, list and
the nested elements are UDT fields.

Simplified example:
There is a new purchase for the happening, event:
{total_amount: 50, items : [A, B, C, new_item], purchase_time : '2018-12-27
13:30', specials: null, customer : {... }, fare_amount,...}
I don't know what actually happened for this event, maybe there is a new
item purchased, maybe some customer info have been changed, maybe the
specials have been revoked and I have to reset them. I just need to store
the state as it artived from Kafka, there might already be an event for
this happening saved before, or maybe this is the first one.

BR,
Tomas


On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <migh...@gmail.com wrote:

> Depending on the use case, creating separate prepared statements for each
> combination of set / unset values in large INSERT/UPDATE statements may be
> prohibitive.
>
> Instead, you can look into driver level support for UNSET values.
> Requires Cassandra 2.2 or later IIRC.
>
> See:
> Java Driver:
> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding
> Python Driver:
> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values
> Node Driver:
> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset
>
> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <
> sean_r_dur...@homedepot.com> wrote:
>
>> You say the events are incremental updates. I am interpreting this to
>> mean only some columns are updated. Others should keep their original
>> values.
>>
>> You are correct that inserting null creates a tombstone.
>>
>> Can you only insert the columns that actually have new values? Just skip
>> the columns with no information. (Make the insert generator a bit smarter.)
>>
>> Create table happening (id text primary key, event text, a text, b text,
>> c text);
>> Insert into table happening (id, event, a, b, c) values ("MainEvent","The
>> most complete info we have right now","Priceless","10 pm","Grand Ballroom");
>> -- b changes
>> Insert into happening (id, b) values ("MainEvent","9:30 pm");
>>
>>
>> Sean Durity
>>
>>
>> -----Original Message-----
>> From: Tomas Bartalos <tomas.barta...@gmail.com>
>> Sent: Thursday, December 27, 2018 9:27 AM
>> To: user@cassandra.apache.org
>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values
>>
>> Hello,
>>
>> I’d start with describing my use case and how I’d like to use Cassandra
>> to solve my storage needs.
>> We're processing a stream of events for various happenings. Every event
>> have a unique happening_id.
>> One happening may have many events, usually ~ 20-100 events. I’d like to
>> store only the latest event for the same happening (Event is an incremental
>> update and it contains all up-to date data about happening).
>> Technically the events are streamed from Kafka, processed with Spark an
>> saved to Cassandra.
>> In Cassandra we use upserts (insert with same primary key).  So far so
>> good, however there comes the tombstone...
>>
>> When I’m inserting field with NULL value, Cassandra creates tombstone for
>> this field. As I understood this is due to space efficiency, Cassandra
>> doesn’t have to remember there is a NULL value, she just deletes the
>> respective column and a delete creates a ... tombstone.
>> I was hoping there could be an option to tell Cassandra not to be so
>> space effective and store “unset" info without generating tombstones.
>> Something similar to inserting empty strings instead of null values:
>>
>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into
>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening
>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’);
>>
>> Possible solutions:
>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low
>> value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore
>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not
>> good since this will never overwrite previously inserted event field with
>> “empty” one.
>> 3. On inserts with spark, find all NULL values and replace them with
>> “empty” equivalent (empty string for text, 0 for integer). Very inefficient
>> and problematic to find “empty” equivalent for some data types.
>>
>> Until tombstones appeared Cassandra was the right fit for our use case,
>> however now I’m not sure if we’re heading the right direction.
>> Could you please give me some advice how to solve this problem ?
>>
>> Thank you,
>> Tomas
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>> ________________________________
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

Reply via email to