Re: Meaningless emptiness and filtering

Patrick McFadin Tue, 11 Feb 2025 14:00:25 -0800

You get my vote for the best subject line I've seen this week.


On Tue, Feb 11, 2025 at 1:20 PM Benedict <[email protected]> wrote:
>
> Perhaps we should reify this in the type system? Introduce a MAYBE EMPTY 
> modifier for column types, or simply name them eg [int] vs int?
>
> I think the problem is today it’s all implicit and - as David says - 
> inconsistent. It would be nice to move away from this as the default for a 
> variety of reasons, but also nice to make the behaviour well defined for 
> those use cases we think we’re supporting.
>
> On 11 Feb 2025, at 21:16, David Capwell <[email protected]> wrote:
>
> Thanks for the reply!
>
> AFAIK this EMPTY stuff goes back to thrift days.
>
>
> This is what I was told, but the expected semantics are not clear so my goal 
> is to help flesh things out.
>
> We let people insert these zero length values back then, so we have to 
> support those zero length values existing for ever :/.
>
>
> We allow this for some types but not all.  I think where I am coming from is 
> write != select, so if we say empty = “no value” or “null” then why does 
> select treat it as a value?  Is this the expected behavior?
>
> maybe we should be converting such values into a tombstone / NULL
>
>
> Tombstones can be purged, where as empty can’t, so should it?
>
> Everything besides SAI, including the table based 2i and ALLOW FILTERING, 
> treat EMPTY as a distant value which can be inserted and queried on.  We have 
> supported it in the past, so we should continue to support it into the 
> future, even if it is painful to do.
>
>
> I guess where I come from here is what semantics do we expect.
>
> So lets say v0 is empty bytes int
>
> SELECT CAST(v0 AS text)
>
> Is this null or empty bytes?  In our project this is null
>
> SELECT JSON v0
>
> Is this null or empty bytes?  In our project this is null
>
> SELECT avg(v0) …
>
> Is this null or empty bytes?  In our project this is null
>
> So in most places you touch empty bytes we treat it as null, but only in 
> filtering do we not.
>
> On Feb 11, 2025, at 11:27 AM, Jeremiah Jordan <[email protected]> 
> wrote:
>
> AFAIK this EMPTY stuff goes back to thrift days.  We let people insert these 
> zero length values back then, so we have to support those zero length values 
> existing for ever :/.
>
> How useful is such a distinction?  I don’t know.  Is anybody actually doing 
> this?  Well Andres brought up 
> https://issues.apache.org/jira/browse/CASSANDRA-20313 as a problem because we 
> had an end user create an SAI index on a column which contained EMPTY values 
> in it.  So people are inserting these into the database.  Would they expect 
> to be able to query by EMPTY?  I do not know.
>
> This is the first I have heard of the “isEmptyValueMeaningless” setting.  The 
> meaning of EMPTY to me has always been the same for an Integer or a String, 
> “this column has a value of no value” vs NULL which means "this column is not 
> set/has no value”.  If we truly want to follow the spirit of that setting, 
> then maybe we should be converting such values into a tombstone / NULL up 
> front when deserializing them, rather than storing the EMPTY byte buffer in 
> the DB?
>
> Anyway, I am kind of rambling here.  I am of two minds.
> I can see that this does seem like a silly distinction to have for some 
> types, so maybe we should just decide that in a CQL world, EMPTY means NULL 
> for some types, and actually just make that a tombstone.  Maybe 6.0 would be 
> a good major version change to make such a “breaking” behavior change in.
>
> I can also see the “don’t screw up the legacy apps” use case.  Everything 
> besides SAI, including the table based 2i and ALLOW FILTERING, treat EMPTY as 
> a distant value which can be inserted and queried on.  We have supported it 
> in the past, so we should continue to support it into the future, even if it 
> is painful to do.
>
> Flip a coin and I can argue either side.  So I would love to hear others 
> thoughts to convince me one way to the other.
>
> -Jeremiah
>
>
>
> On Feb 11, 2025 at 12:55:35 PM, Caleb Rackliffe <[email protected]> 
> wrote:
>>
>> The case where allowsEmpty == true AND is meaningless == true is especially 
>> confusing. If I could design this from scratch, I would reject writes and 
>> filtering on EMPTY values for int and the other types where meaningless == 
>> true. (In other words, if we allow EMPTY, it is meaningful and queryable. If 
>> we don't, it isn't.) That avoids problems that can't have anything other 
>> than an arbitrary solution, like what we do with < and > for EMPTY for int. 
>> When we add IS [NOT] NULL support, that would preferably NOT match EMPTY 
>> values for the types where empty means something, like strings. For 
>> everything else, EMPTY could be equivalent to null and match IS NULL.
>>
>> The only real way to make SAI compatible with the current behavior is to add 
>> something like a special postings list to its data structures that 
>> corresponds to the rows where the indexed column value is EMPTY.
>>
>> On Tue, Feb 11, 2025 at 12:21 PM David Capwell <[email protected]> wrote:
>>>
>>> Bringing this discussion to dev@ rather than Slack as we try to figure out 
>>> CASSANDRA-20313 and CASSANDRA-19461.
>>>
>>> In the type system, we have 2 different (but related) methods:
>>>
>>> AbstractType#allowsEmpty                        - if the user gives empty 
>>> bytes (new byte[0]) will the type reject it
>>> AbstractType#isEmptyValueMeaningless  - if the user gives empty bytes, 
>>> should this be handled like null?
>>>
>>> In practice, there are 2 cases that matter:
>>>
>>> allowsEmpty = true AND is meaningless = false - stuff like text and bytes
>>> allowsEmpty = true AND is meaningless = true  - many types, example "int"
>>>
>>> What this means is that users are able to use empty bytes when writing to 
>>> these types, but this leads to complexity in the filter path, and is 
>>> something we are trying to flesh out the “correct” semantics for SAI.
>>>
>>> Simple example:
>>>
>>> {code}
>>>
>>> @Test
>>> public void test() throws IOException
>>> {
>>>     try (Cluster cluster = Cluster.build(1).start())
>>>     {
>>>         init(cluster);
>>>         cluster.schemaChange(withKeyspace("CREATE TABLE %s.tbl (pk int 
>>> primary key, v int)"));
>>>         IInvokableInstance node = cluster.get(1);
>>>         for (int i = 0; i < 10; i++)
>>>             node.executeInternal(withKeyspace("INSERT INTO %s.tbl (pk, v) 
>>> VALUES (?, ?)"), i, ByteBufferUtil.EMPTY_BYTE_BUFFER);
>>>
>>>         var qr = node.executeInternalWithResult(withKeyspace("SELECT * FROM 
>>> %s.tbl WHERE v=? ALLOW FILTERING"), ByteBufferUtil.EMPTY_BYTE_BUFFER);
>>>         StringBuilder sb = new StringBuilder();
>>>         sb.append(qr.names());
>>>         while (qr.hasNext())
>>>         {
>>>             var next = qr.next();
>>>             sb.append('\n').append(next);
>>>         }
>>>         System.out.println(sb);
>>>     }
>>> }
>>>
>>> {code}
>>>
>>> “Should” this return 10 rows or 0?  In this case, the type is int, and int 
>>> defines empty as meaningless, which means it should act as a null; yet this 
>>> query returns 10 rows, which violates CQL as foo = null == false.
>>>
>>> Right now there really isn’t a way to query for NULL (CASSANDRA-10715 is 
>>> still open), but if we did add such a thing we would also need to figure 
>>> out the semantics with regard to these cases.
>
>

Re: Meaningless emptiness and filtering

Reply via email to