Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> I beleive your approach is the same as using spark with " >>>>> spark.cassandra.output.ignoreNulls=true" >>>>> This will not cover the situation when a value have to be overwriten >>>>> with null. >>>>> >>>>> I found one possible solution - change the schema to keep only primary >>>>> key fields and move all other fields to frozen UDT. >>>>> create table (year, month, day, id, frozen, primary key((year, >>>>> month, day), id) ) >>>>> In this way anything that is null inside event doesn't create >>>>> tombstone, since event is serialized to BLOB. >>>>> The penalty is in need of deserializing the whole Event when selecting >>>>> only few columns. >>>>> Can anyone confirm if this is good solution performance wise? >>>>> >>>>> Thank you, >>>>> >>>>> On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan >>>> >>>>>> "The problem is I can't know the combination of set/unset values" --> >>>>>> Just for this requirement, Achilles has a working solution for many years >>>>>> using INSERT_NOT_NULL_FIELDS strategy: >>>>>> >>>>>> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy >>>>>> >>>>>> Or you can use the Update API that by design only perform update on >>>>>> not null fields: >>>>>> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity >>>>>> >>>>>> >>>>>> Behind the scene, for each new combination of INSERT INTO >>>>>> table(x,y,z) statement, Achilles will check its prepared statement cache >>>>>> and if the statement does not exist yet, create a new prepared statement >>>>>> and put it into the cache for later re-use for you >>>>>> >>>>>> Disclaiment: I'm the creator of Achilles >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos < >>>>>> tomas.barta...@gmail.com> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> The problem is I can't know the combination of set/unset values. >>>>>>> From my perspective every value should be set. The event from Kafka >>>>>>> represents the complete state of the happening at certain point in >>>>>>> time. In >>>>>>> my table I want to store the latest event so the most recent state of >>>>>>> the >>>>>>> happening (in this table I don't care about the history). Actually I >>>>>>> used >>>>>>> wrong expression since its just the opposite of "incremental update", >>>>>>> every >>>>>>> event carries all data (state) for specific point of time. >>>>>>> >>>>>>> The event is represented with nested json structure. Top level >>>>>>> elements of the json are table fields with type like text, boolean, >>>>>>> timestamp, list and the nested elements are UDT fields. >>>>>>> >>>>>>> Simplified example: >>>>>>> There is a new purchase for the happening, event: >>>>>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time : >>>>>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...} >>>>>>> I don't know what actually happened for this event, maybe there is a >>>>>>> new item purchased, maybe some customer info have been changed, maybe >>>>>>> the >>>>>>> specials have been revoked and I have to reset them. I just need to >>>>>>> store >>>>>>> the state as it artived from Kafka, there might already be an event for >>>>>>> this happening saved before, or maybe this is the first one. >>>>>>> >>>>>>> BR, >>>>>>> Tomas >>>>>>> >>>>>>> >>>>>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens >>>>>> >>
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
> day), id) ) > In this way anything that is null inside event doesn't create tombstone, > since event is serialized to BLOB. > The penalty is in need of deserializing the whole Event when selecting only > few columns. > Can anyone confirm if this is good solution performance wise? > > Thank you, > > On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan <mailto:doanduy...@gmail.com> wrote: > "The problem is I can't know the combination of set/unset values" --> Just > for this requirement, Achilles has a working solution for many years using > INSERT_NOT_NULL_FIELDS strategy: > > https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy > <https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy> > > Or you can use the Update API that by design only perform update on not null > fields: > https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity > > <https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity> > > > Behind the scene, for each new combination of INSERT INTO table(x,y,z) > statement, Achilles will check its prepared statement cache and if the > statement does not exist yet, create a new prepared statement and put it into > the cache for later re-use for you > > Disclaiment: I'm the creator of Achilles > > > > On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos <mailto:tomas.barta...@gmail.com>> wrote: > Hello, > > The problem is I can't know the combination of set/unset values. From my > perspective every value should be set. The event from Kafka represents the > complete state of the happening at certain point in time. In my table I want > to store the latest event so the most recent state of the happening (in this > table I don't care about the history). Actually I used wrong expression since > its just the opposite of "incremental update", every event carries all data > (state) for specific point of time. > > The event is represented with nested json structure. Top level elements of > the json are table fields with type like text, boolean, timestamp, list and > the nested elements are UDT fields. > > Simplified example: > There is a new purchase for the happening, event: > {total_amount: 50, items : [A, B, C, new_item], purchase_time : '2018-12-27 > 13:30', specials: null, customer : {... }, fare_amount,...} > I don't know what actually happened for this event, maybe there is a new item > purchased, maybe some customer info have been changed, maybe the specials > have been revoked and I have to reset them. I just need to store the state as > it artived from Kafka, there might already be an event for this happening > saved before, or maybe this is the first one. > > BR, > Tomas > > > On Thu, 27 Dec 2018, 9:36 pm Eric Stevens <mailto:migh...@gmail.com> wrote: > Depending on the use case, creating separate prepared statements for each > combination of set / unset values in large INSERT/UPDATE statements may be > prohibitive. > > Instead, you can look into driver level support for UNSET values. Requires > Cassandra 2.2 or later IIRC. > > See: > Java Driver: > https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding > > <https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding> > Python Driver: > https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values > > <https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values> > Node Driver: > https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset > > <https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset> > On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R <mailto:sean_r_dur...@homedepot.com>> wrote: > You say the events are incremental updates. I am interpreting this to mean > only some columns are updated. Others should keep their original values. > > You are correct that inserting null creates a tombstone. > > Can you only insert the columns that actually have new values? Just skip the > columns with no information. (Make the insert generator a bit smarter.) > > Create table happening (id text primary key, event text, a text, b text, c > text); > Insert into table happening (id, event, a, b, c) values ("MainEvent","The > most complete info we have right now","Priceless","10 pm","Grand Ballroo
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
using INSERT_NOT_NULL_FIELDS strategy: >>>>> >>>>> https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy >>>>> >>>>> Or you can use the Update API that by design only perform update on >>>>> not null fields: >>>>> https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity >>>>> >>>>> >>>>> Behind the scene, for each new combination of INSERT INTO table(x,y,z) >>>>> statement, Achilles will check its prepared statement cache and if the >>>>> statement does not exist yet, create a new prepared statement and put it >>>>> into the cache for later re-use for you >>>>> >>>>> Disclaiment: I'm the creator of Achilles >>>>> >>>>> >>>>> >>>>> On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos < >>>>> tomas.barta...@gmail.com> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> The problem is I can't know the combination of set/unset values. From >>>>>> my perspective every value should be set. The event from Kafka represents >>>>>> the complete state of the happening at certain point in time. In my >>>>>> table I >>>>>> want to store the latest event so the most recent state of the happening >>>>>> (in this table I don't care about the history). Actually I used wrong >>>>>> expression since its just the opposite of "incremental update", every >>>>>> event >>>>>> carries all data (state) for specific point of time. >>>>>> >>>>>> The event is represented with nested json structure. Top level >>>>>> elements of the json are table fields with type like text, boolean, >>>>>> timestamp, list and the nested elements are UDT fields. >>>>>> >>>>>> Simplified example: >>>>>> There is a new purchase for the happening, event: >>>>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time : >>>>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...} >>>>>> I don't know what actually happened for this event, maybe there is a >>>>>> new item purchased, maybe some customer info have been changed, maybe the >>>>>> specials have been revoked and I have to reset them. I just need to store >>>>>> the state as it artived from Kafka, there might already be an event for >>>>>> this happening saved before, or maybe this is the first one. >>>>>> >>>>>> BR, >>>>>> Tomas >>>>>> >>>>>> >>>>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens >>>>> >>>>>>> Depending on the use case, creating separate prepared statements for >>>>>>> each combination of set / unset values in large INSERT/UPDATE statements >>>>>>> may be prohibitive. >>>>>>> >>>>>>> Instead, you can look into driver level support for UNSET values. >>>>>>> Requires Cassandra 2.2 or later IIRC. >>>>>>> >>>>>>> See: >>>>>>> Java Driver: >>>>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding >>>>>>> Python Driver: >>>>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values >>>>>>> Node Driver: >>>>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset >>>>>>> >>>>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R < >>>>>>> sean_r_dur...@homedepot.com> wrote: >>>>>>> >>>>>>>> You say the events are incremental updates. I am interpreting this >>>>>>>> to mean only some columns are updated. Others should keep their >>>>>>>> original >>>>>>>> values. >>>>>>>> >>>>>>>> You are correct that inserting null creates a tombstone. >>>>>>>> >>>>>>>> Can you only insert the columns that actually have new values? Just >&g
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
il.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> The problem is I can't know the combination of set/unset values. From >>>>> my perspective every value should be set. The event from Kafka represents >>>>> the complete state of the happening at certain point in time. In my table >>>>> I >>>>> want to store the latest event so the most recent state of the happening >>>>> (in this table I don't care about the history). Actually I used wrong >>>>> expression since its just the opposite of "incremental update", every >>>>> event >>>>> carries all data (state) for specific point of time. >>>>> >>>>> The event is represented with nested json structure. Top level >>>>> elements of the json are table fields with type like text, boolean, >>>>> timestamp, list and the nested elements are UDT fields. >>>>> >>>>> Simplified example: >>>>> There is a new purchase for the happening, event: >>>>> {total_amount: 50, items : [A, B, C, new_item], purchase_time : >>>>> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...} >>>>> I don't know what actually happened for this event, maybe there is a >>>>> new item purchased, maybe some customer info have been changed, maybe the >>>>> specials have been revoked and I have to reset them. I just need to store >>>>> the state as it artived from Kafka, there might already be an event for >>>>> this happening saved before, or maybe this is the first one. >>>>> >>>>> BR, >>>>> Tomas >>>>> >>>>> >>>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens >>>> >>>>>> Depending on the use case, creating separate prepared statements for >>>>>> each combination of set / unset values in large INSERT/UPDATE statements >>>>>> may be prohibitive. >>>>>> >>>>>> Instead, you can look into driver level support for UNSET values. >>>>>> Requires Cassandra 2.2 or later IIRC. >>>>>> >>>>>> See: >>>>>> Java Driver: >>>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding >>>>>> Python Driver: >>>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values >>>>>> Node Driver: >>>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset >>>>>> >>>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R < >>>>>> sean_r_dur...@homedepot.com> wrote: >>>>>> >>>>>>> You say the events are incremental updates. I am interpreting this >>>>>>> to mean only some columns are updated. Others should keep their original >>>>>>> values. >>>>>>> >>>>>>> You are correct that inserting null creates a tombstone. >>>>>>> >>>>>>> Can you only insert the columns that actually have new values? Just >>>>>>> skip the columns with no information. (Make the insert generator a bit >>>>>>> smarter.) >>>>>>> >>>>>>> Create table happening (id text primary key, event text, a text, b >>>>>>> text, c text); >>>>>>> Insert into table happening (id, event, a, b, c) values >>>>>>> ("MainEvent","The most complete info we have right now","Priceless","10 >>>>>>> pm","Grand Ballroom"); >>>>>>> -- b changes >>>>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm"); >>>>>>> >>>>>>> >>>>>>> Sean Durity >>>>>>> >>>>>>> >>>>>>> -Original Message- >>>>>>> From: Tomas Bartalos >>>>>>> Sent: Thursday, December 27, 2018 9:27 AM >>>>>>> To: user@cassandra.apache.org >>>>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values >>>>>>> >>>>>>> Hello, >>>>>>> &g
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
event, maybe there is a >>>> new item purchased, maybe some customer info have been changed, maybe the >>>> specials have been revoked and I have to reset them. I just need to store >>>> the state as it artived from Kafka, there might already be an event for >>>> this happening saved before, or maybe this is the first one. >>>> >>>> BR, >>>> Tomas >>>> >>>> >>>> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens >>> >>>>> Depending on the use case, creating separate prepared statements for >>>>> each combination of set / unset values in large INSERT/UPDATE statements >>>>> may be prohibitive. >>>>> >>>>> Instead, you can look into driver level support for UNSET values. >>>>> Requires Cassandra 2.2 or later IIRC. >>>>> >>>>> See: >>>>> Java Driver: >>>>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding >>>>> Python Driver: >>>>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values >>>>> Node Driver: >>>>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset >>>>> >>>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R < >>>>> sean_r_dur...@homedepot.com> wrote: >>>>> >>>>>> You say the events are incremental updates. I am interpreting this to >>>>>> mean only some columns are updated. Others should keep their original >>>>>> values. >>>>>> >>>>>> You are correct that inserting null creates a tombstone. >>>>>> >>>>>> Can you only insert the columns that actually have new values? Just >>>>>> skip the columns with no information. (Make the insert generator a bit >>>>>> smarter.) >>>>>> >>>>>> Create table happening (id text primary key, event text, a text, b >>>>>> text, c text); >>>>>> Insert into table happening (id, event, a, b, c) values >>>>>> ("MainEvent","The most complete info we have right now","Priceless","10 >>>>>> pm","Grand Ballroom"); >>>>>> -- b changes >>>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm"); >>>>>> >>>>>> >>>>>> Sean Durity >>>>>> >>>>>> >>>>>> -Original Message- >>>>>> From: Tomas Bartalos >>>>>> Sent: Thursday, December 27, 2018 9:27 AM >>>>>> To: user@cassandra.apache.org >>>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values >>>>>> >>>>>> Hello, >>>>>> >>>>>> I’d start with describing my use case and how I’d like to use >>>>>> Cassandra to solve my storage needs. >>>>>> We're processing a stream of events for various happenings. Every >>>>>> event have a unique happening_id. >>>>>> One happening may have many events, usually ~ 20-100 events. I’d like >>>>>> to store only the latest event for the same happening (Event is an >>>>>> incremental update and it contains all up-to date data about happening). >>>>>> Technically the events are streamed from Kafka, processed with Spark >>>>>> an saved to Cassandra. >>>>>> In Cassandra we use upserts (insert with same primary key). So far >>>>>> so good, however there comes the tombstone... >>>>>> >>>>>> When I’m inserting field with NULL value, Cassandra creates tombstone >>>>>> for this field. As I understood this is due to space efficiency, >>>>>> Cassandra >>>>>> doesn’t have to remember there is a NULL value, she just deletes the >>>>>> respective column and a delete creates a ... tombstone. >>>>>> I was hoping there could be an option to tell Cassandra not to be so >>>>>> space effective and store “unset" info without generating tombstones. >>>>>> Something similar to inserting empty strings instead of null values: >>>>>> >>>>>> CREAT
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
atures/datatypes/nulls/#unset >>>> >>>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R < >>>> sean_r_dur...@homedepot.com> wrote: >>>> >>>>> You say the events are incremental updates. I am interpreting this to >>>>> mean only some columns are updated. Others should keep their original >>>>> values. >>>>> >>>>> You are correct that inserting null creates a tombstone. >>>>> >>>>> Can you only insert the columns that actually have new values? Just >>>>> skip the columns with no information. (Make the insert generator a bit >>>>> smarter.) >>>>> >>>>> Create table happening (id text primary key, event text, a text, b >>>>> text, c text); >>>>> Insert into table happening (id, event, a, b, c) values >>>>> ("MainEvent","The most complete info we have right now","Priceless","10 >>>>> pm","Grand Ballroom"); >>>>> -- b changes >>>>> Insert into happening (id, b) values ("MainEvent","9:30 pm"); >>>>> >>>>> >>>>> Sean Durity >>>>> >>>>> >>>>> -Original Message- >>>>> From: Tomas Bartalos >>>>> Sent: Thursday, December 27, 2018 9:27 AM >>>>> To: user@cassandra.apache.org >>>>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values >>>>> >>>>> Hello, >>>>> >>>>> I’d start with describing my use case and how I’d like to use >>>>> Cassandra to solve my storage needs. >>>>> We're processing a stream of events for various happenings. Every >>>>> event have a unique happening_id. >>>>> One happening may have many events, usually ~ 20-100 events. I’d like >>>>> to store only the latest event for the same happening (Event is an >>>>> incremental update and it contains all up-to date data about happening). >>>>> Technically the events are streamed from Kafka, processed with Spark >>>>> an saved to Cassandra. >>>>> In Cassandra we use upserts (insert with same primary key). So far so >>>>> good, however there comes the tombstone... >>>>> >>>>> When I’m inserting field with NULL value, Cassandra creates tombstone >>>>> for this field. As I understood this is due to space efficiency, Cassandra >>>>> doesn’t have to remember there is a NULL value, she just deletes the >>>>> respective column and a delete creates a ... tombstone. >>>>> I was hoping there could be an option to tell Cassandra not to be so >>>>> space effective and store “unset" info without generating tombstones. >>>>> Something similar to inserting empty strings instead of null values: >>>>> >>>>> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into >>>>> happening (‘1’, ‘event1’); — tombstone is generated insert into happening >>>>> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’); >>>>> >>>>> Possible solutions: >>>>> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable >>>>> low value (1 hour ?) . Not good, since phantom data may re-appear 2. >>>>> ignore >>>>> NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not >>>>> good since this will never overwrite previously inserted event field with >>>>> “empty” one. >>>>> 3. On inserts with spark, find all NULL values and replace them with >>>>> “empty” equivalent (empty string for text, 0 for integer). Very >>>>> inefficient >>>>> and problematic to find “empty” equivalent for some data types. >>>>> >>>>> Until tombstones appeared Cassandra was the right fit for our use >>>>> case, however now I’m not sure if we’re heading the right direction. >>>>> Could you please give me some advice how to solve this problem ? >>>>> >>>>> Thank you, >>>>> Tomas >>>>> - >>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>>> For additional commands, e-mail: user-h...@cassandra.apache.org >>>>> >>>>> >>>>> >>>>> >>>>> The information in this Internet Email is confidential and may be >>>>> legally privileged. It is intended solely for the addressee. Access to >>>>> this >>>>> Email by anyone else is unauthorized. If you are not the intended >>>>> recipient, any disclosure, copying, distribution or any action taken or >>>>> omitted to be taken in reliance on it, is prohibited and may be unlawful. >>>>> When addressed to our clients any opinions or advice contained in this >>>>> Email are subject to the terms and conditions expressed in any applicable >>>>> governing The Home Depot terms of business or client engagement letter. >>>>> The >>>>> Home Depot disclaims all responsibility and liability for the accuracy and >>>>> content of this attachment and for any damages or losses arising from any >>>>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other >>>>> items of a destructive nature, which may be contained in this attachment >>>>> and shall not be liable for direct, indirect, consequential or special >>>>> damages in connection with this e-mail message or its attachment. >>>>> >>>>> - >>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>>>> For additional commands, e-mail: user-h...@cassandra.apache.org >>>>> >>>> -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
Hello, I beleive your approach is the same as using spark with " spark.cassandra.output.ignoreNulls=true" This will not cover the situation when a value have to be overwriten with null. I found one possible solution - change the schema to keep only primary key fields and move all other fields to frozen UDT. create table (year, month, day, id, frozen, primary key((year, month, day), id) ) In this way anything that is null inside event doesn't create tombstone, since event is serialized to BLOB. The penalty is in need of deserializing the whole Event when selecting only few columns. Can anyone confirm if this is good solution performance wise? Thank you, On Fri, 4 Jan 2019, 2:20 pm DuyHai Doan "The problem is I can't know the combination of set/unset values" --> Just > for this requirement, Achilles has a working solution for many years using > INSERT_NOT_NULL_FIELDS strategy: > > https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy > > Or you can use the Update API that by design only perform update on not > null fields: > https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity > > > Behind the scene, for each new combination of INSERT INTO table(x,y,z) > statement, Achilles will check its prepared statement cache and if the > statement does not exist yet, create a new prepared statement and put it > into the cache for later re-use for you > > Disclaiment: I'm the creator of Achilles > > > > On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos > wrote: > >> Hello, >> >> The problem is I can't know the combination of set/unset values. From my >> perspective every value should be set. The event from Kafka represents the >> complete state of the happening at certain point in time. In my table I >> want to store the latest event so the most recent state of the happening >> (in this table I don't care about the history). Actually I used wrong >> expression since its just the opposite of "incremental update", every event >> carries all data (state) for specific point of time. >> >> The event is represented with nested json structure. Top level elements >> of the json are table fields with type like text, boolean, timestamp, list >> and the nested elements are UDT fields. >> >> Simplified example: >> There is a new purchase for the happening, event: >> {total_amount: 50, items : [A, B, C, new_item], purchase_time : >> '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...} >> I don't know what actually happened for this event, maybe there is a new >> item purchased, maybe some customer info have been changed, maybe the >> specials have been revoked and I have to reset them. I just need to store >> the state as it artived from Kafka, there might already be an event for >> this happening saved before, or maybe this is the first one. >> >> BR, >> Tomas >> >> >> On Thu, 27 Dec 2018, 9:36 pm Eric Stevens > >>> Depending on the use case, creating separate prepared statements for >>> each combination of set / unset values in large INSERT/UPDATE statements >>> may be prohibitive. >>> >>> Instead, you can look into driver level support for UNSET values. >>> Requires Cassandra 2.2 or later IIRC. >>> >>> See: >>> Java Driver: >>> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding >>> Python Driver: >>> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values >>> Node Driver: >>> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset >>> >>> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R < >>> sean_r_dur...@homedepot.com> wrote: >>> >>>> You say the events are incremental updates. I am interpreting this to >>>> mean only some columns are updated. Others should keep their original >>>> values. >>>> >>>> You are correct that inserting null creates a tombstone. >>>> >>>> Can you only insert the columns that actually have new values? Just >>>> skip the columns with no information. (Make the insert generator a bit >>>> smarter.) >>>> >>>> Create table happening (id text primary key, event text, a text, b >>>> text, c text); >>>> Insert into table happening (id, event, a, b, c) values >>>> ("MainEvent","The most complete info we have right now","Priceless","10
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
"The problem is I can't know the combination of set/unset values" --> Just for this requirement, Achilles has a working solution for many years using INSERT_NOT_NULL_FIELDS strategy: https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy Or you can use the Update API that by design only perform update on not null fields: https://github.com/doanduyhai/Achilles/wiki/Quick-Reference#updating-all-non-null-fields-for-an-entity Behind the scene, for each new combination of INSERT INTO table(x,y,z) statement, Achilles will check its prepared statement cache and if the statement does not exist yet, create a new prepared statement and put it into the cache for later re-use for you Disclaiment: I'm the creator of Achilles On Thu, Dec 27, 2018 at 10:21 PM Tomas Bartalos wrote: > Hello, > > The problem is I can't know the combination of set/unset values. From my > perspective every value should be set. The event from Kafka represents the > complete state of the happening at certain point in time. In my table I > want to store the latest event so the most recent state of the happening > (in this table I don't care about the history). Actually I used wrong > expression since its just the opposite of "incremental update", every event > carries all data (state) for specific point of time. > > The event is represented with nested json structure. Top level elements of > the json are table fields with type like text, boolean, timestamp, list and > the nested elements are UDT fields. > > Simplified example: > There is a new purchase for the happening, event: > {total_amount: 50, items : [A, B, C, new_item], purchase_time : > '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...} > I don't know what actually happened for this event, maybe there is a new > item purchased, maybe some customer info have been changed, maybe the > specials have been revoked and I have to reset them. I just need to store > the state as it artived from Kafka, there might already be an event for > this happening saved before, or maybe this is the first one. > > BR, > Tomas > > > On Thu, 27 Dec 2018, 9:36 pm Eric Stevens >> Depending on the use case, creating separate prepared statements for each >> combination of set / unset values in large INSERT/UPDATE statements may be >> prohibitive. >> >> Instead, you can look into driver level support for UNSET values. >> Requires Cassandra 2.2 or later IIRC. >> >> See: >> Java Driver: >> https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding >> Python Driver: >> https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values >> Node Driver: >> https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset >> >> On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R < >> sean_r_dur...@homedepot.com> wrote: >> >>> You say the events are incremental updates. I am interpreting this to >>> mean only some columns are updated. Others should keep their original >>> values. >>> >>> You are correct that inserting null creates a tombstone. >>> >>> Can you only insert the columns that actually have new values? Just skip >>> the columns with no information. (Make the insert generator a bit smarter.) >>> >>> Create table happening (id text primary key, event text, a text, b text, >>> c text); >>> Insert into table happening (id, event, a, b, c) values >>> ("MainEvent","The most complete info we have right now","Priceless","10 >>> pm","Grand Ballroom"); >>> -- b changes >>> Insert into happening (id, b) values ("MainEvent","9:30 pm"); >>> >>> >>> Sean Durity >>> >>> >>> -Original Message- >>> From: Tomas Bartalos >>> Sent: Thursday, December 27, 2018 9:27 AM >>> To: user@cassandra.apache.org >>> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values >>> >>> Hello, >>> >>> I’d start with describing my use case and how I’d like to use Cassandra >>> to solve my storage needs. >>> We're processing a stream of events for various happenings. Every event >>> have a unique happening_id. >>> One happening may have many events, usually ~ 20-100 events. I’d like to >>> store only the latest event for the same happening (Event is an incremental >>> update and it contains all up-to date data about happening). >>> Technically the
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
Hello, The problem is I can't know the combination of set/unset values. From my perspective every value should be set. The event from Kafka represents the complete state of the happening at certain point in time. In my table I want to store the latest event so the most recent state of the happening (in this table I don't care about the history). Actually I used wrong expression since its just the opposite of "incremental update", every event carries all data (state) for specific point of time. The event is represented with nested json structure. Top level elements of the json are table fields with type like text, boolean, timestamp, list and the nested elements are UDT fields. Simplified example: There is a new purchase for the happening, event: {total_amount: 50, items : [A, B, C, new_item], purchase_time : '2018-12-27 13:30', specials: null, customer : {... }, fare_amount,...} I don't know what actually happened for this event, maybe there is a new item purchased, maybe some customer info have been changed, maybe the specials have been revoked and I have to reset them. I just need to store the state as it artived from Kafka, there might already be an event for this happening saved before, or maybe this is the first one. BR, Tomas On Thu, 27 Dec 2018, 9:36 pm Eric Stevens Depending on the use case, creating separate prepared statements for each > combination of set / unset values in large INSERT/UPDATE statements may be > prohibitive. > > Instead, you can look into driver level support for UNSET values. > Requires Cassandra 2.2 or later IIRC. > > See: > Java Driver: > https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding > Python Driver: > https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values > Node Driver: > https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset > > On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R < > sean_r_dur...@homedepot.com> wrote: > >> You say the events are incremental updates. I am interpreting this to >> mean only some columns are updated. Others should keep their original >> values. >> >> You are correct that inserting null creates a tombstone. >> >> Can you only insert the columns that actually have new values? Just skip >> the columns with no information. (Make the insert generator a bit smarter.) >> >> Create table happening (id text primary key, event text, a text, b text, >> c text); >> Insert into table happening (id, event, a, b, c) values ("MainEvent","The >> most complete info we have right now","Priceless","10 pm","Grand Ballroom"); >> -- b changes >> Insert into happening (id, b) values ("MainEvent","9:30 pm"); >> >> >> Sean Durity >> >> >> -Original Message- >> From: Tomas Bartalos >> Sent: Thursday, December 27, 2018 9:27 AM >> To: user@cassandra.apache.org >> Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values >> >> Hello, >> >> I’d start with describing my use case and how I’d like to use Cassandra >> to solve my storage needs. >> We're processing a stream of events for various happenings. Every event >> have a unique happening_id. >> One happening may have many events, usually ~ 20-100 events. I’d like to >> store only the latest event for the same happening (Event is an incremental >> update and it contains all up-to date data about happening). >> Technically the events are streamed from Kafka, processed with Spark an >> saved to Cassandra. >> In Cassandra we use upserts (insert with same primary key). So far so >> good, however there comes the tombstone... >> >> When I’m inserting field with NULL value, Cassandra creates tombstone for >> this field. As I understood this is due to space efficiency, Cassandra >> doesn’t have to remember there is a NULL value, she just deletes the >> respective column and a delete creates a ... tombstone. >> I was hoping there could be an option to tell Cassandra not to be so >> space effective and store “unset" info without generating tombstones. >> Something similar to inserting empty strings instead of null values: >> >> CREATE TABLE happening (id text PRIMARY KEY, event text); insert into >> happening (‘1’, ‘event1’); — tombstone is generated insert into happening >> (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’); >> >> Possible solutions: >> 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low >> value (1 hour ?)
Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values
Depending on the use case, creating separate prepared statements for each combination of set / unset values in large INSERT/UPDATE statements may be prohibitive. Instead, you can look into driver level support for UNSET values. Requires Cassandra 2.2 or later IIRC. See: Java Driver: https://docs.datastax.com/en/developer/java-driver/3.0/manual/statements/prepared/#parameters-and-binding Python Driver: https://www.datastax.com/dev/blog/python-driver-2-6-0-rc1-with-cassandra-2-2-features#distinguishing_between_null_and_unset_values Node Driver: https://docs.datastax.com/en/developer/nodejs-driver/3.5/features/datatypes/nulls/#unset On Thu, Dec 27, 2018 at 3:21 PM Durity, Sean R wrote: > You say the events are incremental updates. I am interpreting this to mean > only some columns are updated. Others should keep their original values. > > You are correct that inserting null creates a tombstone. > > Can you only insert the columns that actually have new values? Just skip > the columns with no information. (Make the insert generator a bit smarter.) > > Create table happening (id text primary key, event text, a text, b text, c > text); > Insert into table happening (id, event, a, b, c) values ("MainEvent","The > most complete info we have right now","Priceless","10 pm","Grand Ballroom"); > -- b changes > Insert into happening (id, b) values ("MainEvent","9:30 pm"); > > > Sean Durity > > > -Original Message- > From: Tomas Bartalos > Sent: Thursday, December 27, 2018 9:27 AM > To: user@cassandra.apache.org > Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values > > Hello, > > I’d start with describing my use case and how I’d like to use Cassandra to > solve my storage needs. > We're processing a stream of events for various happenings. Every event > have a unique happening_id. > One happening may have many events, usually ~ 20-100 events. I’d like to > store only the latest event for the same happening (Event is an incremental > update and it contains all up-to date data about happening). > Technically the events are streamed from Kafka, processed with Spark an > saved to Cassandra. > In Cassandra we use upserts (insert with same primary key). So far so > good, however there comes the tombstone... > > When I’m inserting field with NULL value, Cassandra creates tombstone for > this field. As I understood this is due to space efficiency, Cassandra > doesn’t have to remember there is a NULL value, she just deletes the > respective column and a delete creates a ... tombstone. > I was hoping there could be an option to tell Cassandra not to be so space > effective and store “unset" info without generating tombstones. > Something similar to inserting empty strings instead of null values: > > CREATE TABLE happening (id text PRIMARY KEY, event text); insert into > happening (‘1’, ‘event1’); — tombstone is generated insert into happening > (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’); > > Possible solutions: > 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low > value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore > NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not > good since this will never overwrite previously inserted event field with > “empty” one. > 3. On inserts with spark, find all NULL values and replace them with > “empty” equivalent (empty string for text, 0 for integer). Very inefficient > and problematic to find “empty” equivalent for some data types. > > Until tombstones appeared Cassandra was the right fit for our use case, > however now I’m not sure if we’re heading the right direction. > Could you please give me some advice how to solve this problem ? > > Thank you, > Tomas > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > > The information in this Internet Email is confidential and may be legally > privileged. It is intended solely for the addressee. Access to this Email > by anyone else is unauthorized. If you are not the intended recipient, any > disclosure, copying, distribution or any action taken or omitted to be > taken in reliance on it, is prohibited and may be unlawful. When addressed > to our clients any opinions or advice contained in this Email are subject > to the terms and conditions expressed in any applicable governing The Home > Depot terms of business or client engagement letter. The Home Depot > disclaims all responsibility and liability for th
RE: [EXTERNAL] Howto avoid tombstones when inserting NULL values
You say the events are incremental updates. I am interpreting this to mean only some columns are updated. Others should keep their original values. You are correct that inserting null creates a tombstone. Can you only insert the columns that actually have new values? Just skip the columns with no information. (Make the insert generator a bit smarter.) Create table happening (id text primary key, event text, a text, b text, c text); Insert into table happening (id, event, a, b, c) values ("MainEvent","The most complete info we have right now","Priceless","10 pm","Grand Ballroom"); -- b changes Insert into happening (id, b) values ("MainEvent","9:30 pm"); Sean Durity -Original Message- From: Tomas Bartalos Sent: Thursday, December 27, 2018 9:27 AM To: user@cassandra.apache.org Subject: [EXTERNAL] Howto avoid tombstones when inserting NULL values Hello, I’d start with describing my use case and how I’d like to use Cassandra to solve my storage needs. We're processing a stream of events for various happenings. Every event have a unique happening_id. One happening may have many events, usually ~ 20-100 events. I’d like to store only the latest event for the same happening (Event is an incremental update and it contains all up-to date data about happening). Technically the events are streamed from Kafka, processed with Spark an saved to Cassandra. In Cassandra we use upserts (insert with same primary key). So far so good, however there comes the tombstone... When I’m inserting field with NULL value, Cassandra creates tombstone for this field. As I understood this is due to space efficiency, Cassandra doesn’t have to remember there is a NULL value, she just deletes the respective column and a delete creates a ... tombstone. I was hoping there could be an option to tell Cassandra not to be so space effective and store “unset" info without generating tombstones. Something similar to inserting empty strings instead of null values: CREATE TABLE happening (id text PRIMARY KEY, event text); insert into happening (‘1’, ‘event1’); — tombstone is generated insert into happening (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’); Possible solutions: 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not good since this will never overwrite previously inserted event field with “empty” one. 3. On inserts with spark, find all NULL values and replace them with “empty” equivalent (empty string for text, 0 for integer). Very inefficient and problematic to find “empty” equivalent for some data types. Until tombstones appeared Cassandra was the right fit for our use case, however now I’m not sure if we’re heading the right direction. Could you please give me some advice how to solve this problem ? Thank you, Tomas - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment. - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Howto avoid tombstones when inserting NULL values
Hello, I’d start with describing my use case and how I’d like to use Cassandra to solve my storage needs. We're processing a stream of events for various happenings. Every event have a unique happening_id. One happening may have many events, usually ~ 20-100 events. I’d like to store only the latest event for the same happening (Event is an incremental update and it contains all up-to date data about happening). Technically the events are streamed from Kafka, processed with Spark an saved to Cassandra. In Cassandra we use upserts (insert with same primary key). So far so good, however there comes the tombstone... When I’m inserting field with NULL value, Cassandra creates tombstone for this field. As I understood this is due to space efficiency, Cassandra doesn’t have to remember there is a NULL value, she just deletes the respective column and a delete creates a ... tombstone. I was hoping there could be an option to tell Cassandra not to be so space effective and store “unset" info without generating tombstones. Something similar to inserting empty strings instead of null values: CREATE TABLE happening (id text PRIMARY KEY, event text); insert into happening (‘1’, ‘event1’); — tombstone is generated insert into happening (‘1’, null); — tombstone is not generated insert into happening (‘1’, '’); Possible solutions: 1. Disable tombstones with gc_grace_seconds = 0 or set to reasonable low value (1 hour ?) . Not good, since phantom data may re-appear 2. ignore NULLs on spark side with “spark.cassandra.output.ignoreNulls=true”. Not good since this will never overwrite previously inserted event field with “empty” one. 3. On inserts with spark, find all NULL values and replace them with “empty” equivalent (empty string for text, 0 for integer). Very inefficient and problematic to find “empty” equivalent for some data types. Until tombstones appeared Cassandra was the right fit for our use case, however now I’m not sure if we’re heading the right direction. Could you please give me some advice how to solve this problem ? Thank you, Tomas - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
RE: Inserting null values
I’ve added an option to prevent tombstone creation when using PreparedStatements to trunk, see CASSANDRA-7304. The problem is having tombstones in regular columns. When you perform a read request (range query or by PK): - Cassandra iterates over all the cells (all, not only the cells specified in the query) in the relevant rows while counting tombstone cells (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/filter/SliceQueryFilter.java#L199) - creates a ColumnFamily object instance with the rows - filters the selected columns from the internal CF (https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/statements/SelectStatement.java#L653) - returns the result If you have many unnecessary tombstones you read many unnecessary cells. From: Eric Stevens [mailto:migh...@gmail.com] Sent: Wednesday, May 06, 2015 4:37 PM To: user@cassandra.apache.org Subject: Re: Inserting null values I agree that inserting null is not as good as not inserting that column at all when you have confidence that you are not shadowing any underlying data. But pragmatically speaking it really doesn't sound like a small number of incidental nulls/tombstones ( 20% of columns, otherwise CASSANDRA-3442 takes over) is going to have any performance impact either in your query patterns or in compaction in any practical sense. If INSERT of null values is problematic for small portions of your data, then it stands to reason that an INSERT option containing an instruction to prevent tombstone creation would be an important performance optimization (and would also address the fact that non-null collections also generate tombstones on INSERT as well). INSERT INTO ... USING no_tombstones; There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number. tombstone_warn_threshold and tombstone_failure_threshold only apply to clustering scans right? I.E. tombstones don't count against those thresholds if they are not part of the clustering key column being considered for the non-EQ relation? The documentation certainly implies so: tombstone_warn_threshold¶http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__tombstone_warn_threshold (Default: 1000) The maximum number of tombstones a query can scan before warning. tombstone_failure_threshold¶http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__tombstone_failure_threshold (Default: 10) The maximum number of tombstones a query can scan before aborting. On Wed, Apr 29, 2015 at 12:42 PM, Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com wrote: On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens migh...@gmail.commailto:migh...@gmail.com wrote: In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead. Or am I missing something here? There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number. Given that tombstones are often smaller than data columns, sorta hard to understand conceptually? =Rob
Re: Inserting null values
I agree that inserting null is not as good as not inserting that column at all when you have confidence that you are not shadowing any underlying data. But pragmatically speaking it really doesn't sound like a small number of incidental nulls/tombstones ( 20% of columns, otherwise CASSANDRA-3442 takes over) is going to have any performance impact either in your query patterns or in compaction in any practical sense. If INSERT of null values is problematic for small portions of your data, then it stands to reason that an INSERT option containing an instruction to prevent tombstone creation would be an important performance optimization (and would also address the fact that non-null collections also generate tombstones on INSERT as well). INSERT INTO ... USING no_tombstones; There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number. tombstone_warn_threshold and tombstone_failure_threshold only apply to clustering scans right? I.E. tombstones don't count against those thresholds if they are not part of the clustering key column being considered for the non-EQ relation? The documentation certainly implies so: tombstone_warn_threshold¶ http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__tombstone_warn_threshold (Default: 1000) The maximum number of tombstones a query can scan before warning.tombstone_failure_threshold¶ http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__tombstone_failure_threshold (Default: 10) The maximum number of tombstones a query can scan before aborting. On Wed, Apr 29, 2015 at 12:42 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens migh...@gmail.com wrote: In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead. Or am I missing something here? There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number. Given that tombstones are often smaller than data columns, sorta hard to understand conceptually? =Rob
RE: Inserting null values
Inserting a null value creates a tombstone. Tombstones can have major performance implications. You can see the tombstones using sstable2json. If you have a small number of records with null values this seems OK, otherwise I recommend using the QueryBuilder (for Java clients) and waiting for https://issues.apache.org/jira/browse/CASSANDRA-7304 From: Matthew Johnson [mailto:matt.john...@algomi.com] Sent: Wednesday, April 29, 2015 11:37 AM To: user@cassandra.apache.org Subject: Inserting null values Hi all, I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT. I can see a few Jiras around CQL 3 supporting inserting nulls: https://issues.apache.org/jira/browse/CASSANDRA-3783 https://issues.apache.org/jira/browse/CASSANDRA-5648 But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null). Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns? Thanks! Matt
Re: Inserting null values
auto promotion mode on The problem of NULL insert is already solved long time ago with Insert Strategy in Achilles: https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy /auto promotion off However, it's nice to see there will be a flag on the protocol side to handle this problem On Wed, Apr 29, 2015 at 2:27 PM, Ali Akhtar ali.rac...@gmail.com wrote: Have you considered adding a 'toSafe' method which checks if the item is null, and if so, returns a default value? E.g String too = safe(bar, ); . On Apr 29, 2015 3:14 PM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT. I can see a few Jiras around CQL 3 supporting inserting nulls: https://issues.apache.org/jira/browse/CASSANDRA-3783 https://issues.apache.org/jira/browse/CASSANDRA-5648 But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase *null*). Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns? Thanks! Matt
Re: Inserting null values
Have you considered adding a 'toSafe' method which checks if the item is null, and if so, returns a default value? E.g String too = safe(bar, ); . On Apr 29, 2015 3:14 PM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT. I can see a few Jiras around CQL 3 supporting inserting nulls: https://issues.apache.org/jira/browse/CASSANDRA-3783 https://issues.apache.org/jira/browse/CASSANDRA-5648 But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase *null*). Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns? Thanks! Matt
Inserting null values
Hi all, I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT. I can see a few Jiras around CQL 3 supporting inserting nulls: https://issues.apache.org/jira/browse/CASSANDRA-3783 https://issues.apache.org/jira/browse/CASSANDRA-5648 But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase *null*). Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns? Thanks! Matt
Re: Inserting null values
Correct me if I'm wrong, but tombstones are only really problematic if you have them going into clustering keys, then perform a range select on that column, right (assuming it's not a symptom of the antipattern of indefinitely overwriting the same value)? I.E. you're deleting clusters off of a partition. A tombstone isn't any more costly, and in some ways less costly than a normal column (it's a smaller size at rest than, say, inserting an empty string or other default value as someone suggested). Tombstones stay around a little longer post-compaction than other values, so that's a downside, but they also would drop off the record as if it had never been set on the next compaction after gc grace period. Tombstones aren't intrinsically bad, but they can have some bad properties in certain situations. This doesn't strike me as one of them. If you have a way to avoid inserting null when you know you aren't occluding an underlying value, that would be ideal. But because the tombstone would sit adjacent on disk to other values from the same insert, even if you were on platters, the drive head is *already positioned* over the tombstone location when it's read, because it read the prior value and subsequent value which were written during the same insert. In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead. Or am I missing something here? On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson matt.john...@algomi.com wrote: Thank you all for the advice! I have decided to use the Insert query builder ( *com.datastax.driver.core.querybuilder.Insert*) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)). Thanks! Matt *From:* Robert Wille [mailto:rwi...@fold3.com] *Sent:* 29 April 2015 15:16 *To:* user@cassandra.apache.org *Subject:* Re: Inserting null values I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared. Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on. It’s annoying and not ideal, but what can you do? On Apr 29, 2015, at 2:36 AM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT. I can see a few Jiras around CQL 3 supporting inserting nulls: https://issues.apache.org/jira/browse/CASSANDRA-3783 https://issues.apache.org/jira/browse/CASSANDRA-5648 But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase *null*). Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns? Thanks! Matt
Re: Inserting null values
Enough tombstones can inflate the size of an SSTable causing issues during compaction (imagine a multi tb sstable w/ 99% tombstones) even if there's no clustering key defined. Perhaps an edge case, but worth considering. On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens migh...@gmail.com wrote: Correct me if I'm wrong, but tombstones are only really problematic if you have them going into clustering keys, then perform a range select on that column, right (assuming it's not a symptom of the antipattern of indefinitely overwriting the same value)? I.E. you're deleting clusters off of a partition. A tombstone isn't any more costly, and in some ways less costly than a normal column (it's a smaller size at rest than, say, inserting an empty string or other default value as someone suggested). Tombstones stay around a little longer post-compaction than other values, so that's a downside, but they also would drop off the record as if it had never been set on the next compaction after gc grace period. Tombstones aren't intrinsically bad, but they can have some bad properties in certain situations. This doesn't strike me as one of them. If you have a way to avoid inserting null when you know you aren't occluding an underlying value, that would be ideal. But because the tombstone would sit adjacent on disk to other values from the same insert, even if you were on platters, the drive head is *already positioned* over the tombstone location when it's read, because it read the prior value and subsequent value which were written during the same insert. In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead. Or am I missing something here? On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson matt.john...@algomi.com wrote: Thank you all for the advice! I have decided to use the Insert query builder ( *com.datastax.driver.core.querybuilder.Insert*) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)). Thanks! Matt *From:* Robert Wille [mailto:rwi...@fold3.com] *Sent:* 29 April 2015 15:16 *To:* user@cassandra.apache.org *Subject:* Re: Inserting null values I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared. Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on. It’s annoying and not ideal, but what can you do? On Apr 29, 2015, at 2:36 AM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT. I can see a few Jiras around CQL 3 supporting inserting nulls: https://issues.apache.org/jira/browse/CASSANDRA-3783 https://issues.apache.org/jira/browse/CASSANDRA-5648 But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase *null*). Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns? Thanks! Matt
Re: Inserting null values
On Wed, Apr 29, 2015 at 9:16 AM, Eric Stevens migh...@gmail.com wrote: In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead. Or am I missing something here? There's thresholds (log messages, etc.) which operate on tombstone counts over a certain number, but not on column counts over the same number. Given that tombstones are often smaller than data columns, sorta hard to understand conceptually? =Rob
Re: Inserting null values
But we're talking about a single tombstone on each of a finite (small) set of values, right? We're not talking about INSERTs which are 99% nulls (at least I don't think that's what Matthew was suggesting). Unless you're engaging in the antipattern of repeated overwrite, I'm still struggling to see why this is worse than an equivalent number of non-tombstoned writes. In fact from the description I don't think we're talking about these tombstones even occluding any value at all. imagine a multi tb sstable w/ 99% tombstones Let's play with this hypothetical, which doesn't seem like a probable consequence of the original question. You'd have to have taken enough writes *inside* gc grace period to have even produced a multi-TB sstable to come anywhere near this, and even then this either exceeds or comes really close to the recommended maximum total data size per node (let alone in a single sstable). If you did have such an sstable, it doesn't seem very likely to compact again inside gc grace period short of manually triggered major compaction. But let's assume you do that, you run cassandra stress inserting nothing but tombstones, and kick off major compaction periodically. If it compacted inside gc grace period, is this worse for compaction than the same number of non-tombstoned values (i.e. a multi-TB sstable is costly to compact no matter what the contents)? If it compacted outside gc grace period, then 99% of the work is just dropping tombstones, it seems like it would run really fast (for being an absurdly large sstable), as there would be just 1% of the contents to actually copy over to the new sstable. I'm still not clear on what I'm missing. Is a tombstone more expensive to compact than a non-tombstone? On Wed, Apr 29, 2015 at 10:06 AM, Jonathan Haddad j...@jonhaddad.com wrote: Enough tombstones can inflate the size of an SSTable causing issues during compaction (imagine a multi tb sstable w/ 99% tombstones) even if there's no clustering key defined. Perhaps an edge case, but worth considering. On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens migh...@gmail.com wrote: Correct me if I'm wrong, but tombstones are only really problematic if you have them going into clustering keys, then perform a range select on that column, right (assuming it's not a symptom of the antipattern of indefinitely overwriting the same value)? I.E. you're deleting clusters off of a partition. A tombstone isn't any more costly, and in some ways less costly than a normal column (it's a smaller size at rest than, say, inserting an empty string or other default value as someone suggested). Tombstones stay around a little longer post-compaction than other values, so that's a downside, but they also would drop off the record as if it had never been set on the next compaction after gc grace period. Tombstones aren't intrinsically bad, but they can have some bad properties in certain situations. This doesn't strike me as one of them. If you have a way to avoid inserting null when you know you aren't occluding an underlying value, that would be ideal. But because the tombstone would sit adjacent on disk to other values from the same insert, even if you were on platters, the drive head is *already positioned* over the tombstone location when it's read, because it read the prior value and subsequent value which were written during the same insert. In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead. Or am I missing something here? On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson matt.john...@algomi.com wrote: Thank you all for the advice! I have decided to use the Insert query builder ( *com.datastax.driver.core.querybuilder.Insert*) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)). Thanks! Matt *From:* Robert Wille [mailto:rwi...@fold3.com] *Sent:* 29 April 2015 15:16 *To:* user@cassandra.apache.org *Subject:* Re: Inserting null values I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared. Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones
Re: Inserting null values
In a way, yes. A tombstone will only be removed after gc_grace iff the compaction is sure that it contains all rows which that tombstone might shadow. When two non-tombstone conflicting rows are compacted, it's always just LWW. On Wed, Apr 29, 2015 at 2:42 PM, Eric Stevens migh...@gmail.com wrote: But we're talking about a single tombstone on each of a finite (small) set of values, right? We're not talking about INSERTs which are 99% nulls (at least I don't think that's what Matthew was suggesting). Unless you're engaging in the antipattern of repeated overwrite, I'm still struggling to see why this is worse than an equivalent number of non-tombstoned writes. In fact from the description I don't think we're talking about these tombstones even occluding any value at all. imagine a multi tb sstable w/ 99% tombstones Let's play with this hypothetical, which doesn't seem like a probable consequence of the original question. You'd have to have taken enough writes *inside* gc grace period to have even produced a multi-TB sstable to come anywhere near this, and even then this either exceeds or comes really close to the recommended maximum total data size per node (let alone in a single sstable). If you did have such an sstable, it doesn't seem very likely to compact again inside gc grace period short of manually triggered major compaction. But let's assume you do that, you run cassandra stress inserting nothing but tombstones, and kick off major compaction periodically. If it compacted inside gc grace period, is this worse for compaction than the same number of non-tombstoned values (i.e. a multi-TB sstable is costly to compact no matter what the contents)? If it compacted outside gc grace period, then 99% of the work is just dropping tombstones, it seems like it would run really fast (for being an absurdly large sstable), as there would be just 1% of the contents to actually copy over to the new sstable. I'm still not clear on what I'm missing. Is a tombstone more expensive to compact than a non-tombstone? On Wed, Apr 29, 2015 at 10:06 AM, Jonathan Haddad j...@jonhaddad.com wrote: Enough tombstones can inflate the size of an SSTable causing issues during compaction (imagine a multi tb sstable w/ 99% tombstones) even if there's no clustering key defined. Perhaps an edge case, but worth considering. On Wed, Apr 29, 2015 at 9:17 AM Eric Stevens migh...@gmail.com wrote: Correct me if I'm wrong, but tombstones are only really problematic if you have them going into clustering keys, then perform a range select on that column, right (assuming it's not a symptom of the antipattern of indefinitely overwriting the same value)? I.E. you're deleting clusters off of a partition. A tombstone isn't any more costly, and in some ways less costly than a normal column (it's a smaller size at rest than, say, inserting an empty string or other default value as someone suggested). Tombstones stay around a little longer post-compaction than other values, so that's a downside, but they also would drop off the record as if it had never been set on the next compaction after gc grace period. Tombstones aren't intrinsically bad, but they can have some bad properties in certain situations. This doesn't strike me as one of them. If you have a way to avoid inserting null when you know you aren't occluding an underlying value, that would be ideal. But because the tombstone would sit adjacent on disk to other values from the same insert, even if you were on platters, the drive head is *already positioned* over the tombstone location when it's read, because it read the prior value and subsequent value which were written during the same insert. In the end, inserting a tombstone into a non-clustered column shouldn't be appreciably worse (if it is at all) than inserting a value instead. Or am I missing something here? On Wed, Apr 29, 2015 at 7:53 AM, Matthew Johnson matt.john...@algomi.com wrote: Thank you all for the advice! I have decided to use the Insert query builder ( *com.datastax.driver.core.querybuilder.Insert*) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)). Thanks! Matt *From:* Robert Wille [mailto:rwi...@fold3.com] *Sent:* 29 April 2015 15:16 *To:* user@cassandra.apache.org *Subject:* Re: Inserting null values I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t
Re: Inserting null values
I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared. Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on. It’s annoying and not ideal, but what can you do? On Apr 29, 2015, at 2:36 AM, Matthew Johnson matt.john...@algomi.commailto:matt.john...@algomi.com wrote: Hi all, I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT. I can see a few Jiras around CQL 3 supporting inserting nulls: https://issues.apache.org/jira/browse/CASSANDRA-3783 https://issues.apache.org/jira/browse/CASSANDRA-5648 But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase null). Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns? Thanks! Matt
RE: Inserting null values
Thank you all for the advice! I have decided to use the Insert query builder ( *com.datastax.driver.core.querybuilder.Insert*) which allows me to dynamically insert as many or as few columns as I need, and doesn’t require multiple prepared statements. Then, I will look at Ali’s suggestion – I will create a small helper method like ‘addToInsertIfNotNull’ and pump all my values into that, which will then filter out the ones that are null. Should keep the code nice and neat – I will feed back if I find any problems with this approach (but please jump in if you have already spotted any :)). Thanks! Matt *From:* Robert Wille [mailto:rwi...@fold3.com] *Sent:* 29 April 2015 15:16 *To:* user@cassandra.apache.org *Subject:* Re: Inserting null values I’ve come across the same thing. I have a table with at least half a dozen columns that could be null, in any combination. Having a prepared statement for each permutation of null columns just isn’t going to happen. I don’t want to build custom queries each time because I have a really cool system of managing my queries that relies on them being prepared. Fortunately for me, I should have at most a handful of tombstones in each partition, and most of my records are written exactly once. So, I just let the tombstones get written and they’ll eventually get compacted out and life will go on. It’s annoying and not ideal, but what can you do? On Apr 29, 2015, at 2:36 AM, Matthew Johnson matt.john...@algomi.com wrote: Hi all, I have some fields that I am storing into Cassandra, but some of them could be null at any given point. As there are quite a lot of them, it makes the code much more readable if I don’t check each one for null before adding it to the INSERT. I can see a few Jiras around CQL 3 supporting inserting nulls: https://issues.apache.org/jira/browse/CASSANDRA-3783 https://issues.apache.org/jira/browse/CASSANDRA-5648 But I have tested inserting null and it seems to work fine (when querying the table with cqlsh, it shows up as a red lowercase *null*). Are there any obvious pitfalls to look out for that I have missed? Could it be a performance concern to insert a row with some nulls, as opposed to checking the values first and inserting the row and just omitting those columns? Thanks! Matt