You get my vote for the best subject line I've seen this week.
On Tue, Feb 11, 2025 at 1:20 PM Benedict <bened...@apache.org> wrote: > > Perhaps we should reify this in the type system? Introduce a MAYBE EMPTY > modifier for column types, or simply name them eg [int] vs int? > > I think the problem is today it’s all implicit and - as David says - > inconsistent. It would be nice to move away from this as the default for a > variety of reasons, but also nice to make the behaviour well defined for > those use cases we think we’re supporting. > > On 11 Feb 2025, at 21:16, David Capwell <dcapw...@apple.com> wrote: > > Thanks for the reply! > > AFAIK this EMPTY stuff goes back to thrift days. > > > This is what I was told, but the expected semantics are not clear so my goal > is to help flesh things out. > > We let people insert these zero length values back then, so we have to > support those zero length values existing for ever :/. > > > We allow this for some types but not all. I think where I am coming from is > write != select, so if we say empty = “no value” or “null” then why does > select treat it as a value? Is this the expected behavior? > > maybe we should be converting such values into a tombstone / NULL > > > Tombstones can be purged, where as empty can’t, so should it? > > Everything besides SAI, including the table based 2i and ALLOW FILTERING, > treat EMPTY as a distant value which can be inserted and queried on. We have > supported it in the past, so we should continue to support it into the > future, even if it is painful to do. > > > I guess where I come from here is what semantics do we expect. > > So lets say v0 is empty bytes int > > SELECT CAST(v0 AS text) > > Is this null or empty bytes? In our project this is null > > SELECT JSON v0 > > Is this null or empty bytes? In our project this is null > > SELECT avg(v0) … > > Is this null or empty bytes? In our project this is null > > So in most places you touch empty bytes we treat it as null, but only in > filtering do we not. > > On Feb 11, 2025, at 11:27 AM, Jeremiah Jordan <jeremiah.jor...@gmail.com> > wrote: > > AFAIK this EMPTY stuff goes back to thrift days. We let people insert these > zero length values back then, so we have to support those zero length values > existing for ever :/. > > How useful is such a distinction? I don’t know. Is anybody actually doing > this? Well Andres brought up > https://issues.apache.org/jira/browse/CASSANDRA-20313 as a problem because we > had an end user create an SAI index on a column which contained EMPTY values > in it. So people are inserting these into the database. Would they expect > to be able to query by EMPTY? I do not know. > > This is the first I have heard of the “isEmptyValueMeaningless” setting. The > meaning of EMPTY to me has always been the same for an Integer or a String, > “this column has a value of no value” vs NULL which means "this column is not > set/has no value”. If we truly want to follow the spirit of that setting, > then maybe we should be converting such values into a tombstone / NULL up > front when deserializing them, rather than storing the EMPTY byte buffer in > the DB? > > Anyway, I am kind of rambling here. I am of two minds. > I can see that this does seem like a silly distinction to have for some > types, so maybe we should just decide that in a CQL world, EMPTY means NULL > for some types, and actually just make that a tombstone. Maybe 6.0 would be > a good major version change to make such a “breaking” behavior change in. > > I can also see the “don’t screw up the legacy apps” use case. Everything > besides SAI, including the table based 2i and ALLOW FILTERING, treat EMPTY as > a distant value which can be inserted and queried on. We have supported it > in the past, so we should continue to support it into the future, even if it > is painful to do. > > Flip a coin and I can argue either side. So I would love to hear others > thoughts to convince me one way to the other. > > -Jeremiah > > > > On Feb 11, 2025 at 12:55:35 PM, Caleb Rackliffe <calebrackli...@gmail.com> > wrote: >> >> The case where allowsEmpty == true AND is meaningless == true is especially >> confusing. If I could design this from scratch, I would reject writes and >> filtering on EMPTY values for int and the other types where meaningless == >> true. (In other words, if we allow EMPTY, it is meaningful and queryable. If >> we don't, it isn't.) That avoids problems that can't have anything other >> than an arbitrary solution, like what we do with < and > for EMPTY for int. >> When we add IS [NOT] NULL support, that would preferably NOT match EMPTY >> values for the types where empty means something, like strings. For >> everything else, EMPTY could be equivalent to null and match IS NULL. >> >> The only real way to make SAI compatible with the current behavior is to add >> something like a special postings list to its data structures that >> corresponds to the rows where the indexed column value is EMPTY. >> >> On Tue, Feb 11, 2025 at 12:21 PM David Capwell <dcapw...@apple.com> wrote: >>> >>> Bringing this discussion to dev@ rather than Slack as we try to figure out >>> CASSANDRA-20313 and CASSANDRA-19461. >>> >>> In the type system, we have 2 different (but related) methods: >>> >>> AbstractType#allowsEmpty - if the user gives empty >>> bytes (new byte[0]) will the type reject it >>> AbstractType#isEmptyValueMeaningless - if the user gives empty bytes, >>> should this be handled like null? >>> >>> In practice, there are 2 cases that matter: >>> >>> allowsEmpty = true AND is meaningless = false - stuff like text and bytes >>> allowsEmpty = true AND is meaningless = true - many types, example "int" >>> >>> What this means is that users are able to use empty bytes when writing to >>> these types, but this leads to complexity in the filter path, and is >>> something we are trying to flesh out the “correct” semantics for SAI. >>> >>> Simple example: >>> >>> {code} >>> >>> @Test >>> public void test() throws IOException >>> { >>> try (Cluster cluster = Cluster.build(1).start()) >>> { >>> init(cluster); >>> cluster.schemaChange(withKeyspace("CREATE TABLE %s.tbl (pk int >>> primary key, v int)")); >>> IInvokableInstance node = cluster.get(1); >>> for (int i = 0; i < 10; i++) >>> node.executeInternal(withKeyspace("INSERT INTO %s.tbl (pk, v) >>> VALUES (?, ?)"), i, ByteBufferUtil.EMPTY_BYTE_BUFFER); >>> >>> var qr = node.executeInternalWithResult(withKeyspace("SELECT * FROM >>> %s.tbl WHERE v=? ALLOW FILTERING"), ByteBufferUtil.EMPTY_BYTE_BUFFER); >>> StringBuilder sb = new StringBuilder(); >>> sb.append(qr.names()); >>> while (qr.hasNext()) >>> { >>> var next = qr.next(); >>> sb.append('\n').append(next); >>> } >>> System.out.println(sb); >>> } >>> } >>> >>> {code} >>> >>> “Should” this return 10 rows or 0? In this case, the type is int, and int >>> defines empty as meaningless, which means it should act as a null; yet this >>> query returns 10 rows, which violates CQL as foo = null == false. >>> >>> Right now there really isn’t a way to query for NULL (CASSANDRA-10715 is >>> still open), but if we did add such a thing we would also need to figure >>> out the semantics with regard to these cases. > >