Re: Meaningless emptiness and filtering

David Capwell Tue, 11 Feb 2025 14:09:56 -0800

My planet was destroyed to make way for hyperspatial express route before I 
could get the answer, sorry… anyone interested in getting lunch at the end of 
the universe?


> On Feb 11, 2025, at 2:03 PM, Paulo Motta <[email protected]> wrote:
> 
> On Tue, Feb 11, 2025 at 5:00 PM Patrick McFadin <[email protected]> wrote:
>> 
>> You get my vote for the best subject line I've seen this week.
>> 
> 
> +1, I'm deeply saddened that despite this title this post does not
> contain the answer to the ultimate question of life, the universe, and
> everything. :-(
> 
> 
>> On Tue, Feb 11, 2025 at 1:20 PM Benedict <[email protected]> wrote:
>>> 
>>> Perhaps we should reify this in the type system? Introduce a MAYBE EMPTY 
>>> modifier for column types, or simply name them eg [int] vs int?
>>> 
>>> I think the problem is today it’s all implicit and - as David says - 
>>> inconsistent. It would be nice to move away from this as the default for a 
>>> variety of reasons, but also nice to make the behaviour well defined for 
>>> those use cases we think we’re supporting.
>>> 
>>> On 11 Feb 2025, at 21:16, David Capwell <[email protected]> wrote:
>>> 
>>> Thanks for the reply!
>>> 
>>> AFAIK this EMPTY stuff goes back to thrift days.
>>> 
>>> 
>>> This is what I was told, but the expected semantics are not clear so my 
>>> goal is to help flesh things out.
>>> 
>>> We let people insert these zero length values back then, so we have to 
>>> support those zero length values existing for ever :/.
>>> 
>>> 
>>> We allow this for some types but not all.  I think where I am coming from 
>>> is write != select, so if we say empty = “no value” or “null” then why does 
>>> select treat it as a value?  Is this the expected behavior?
>>> 
>>> maybe we should be converting such values into a tombstone / NULL
>>> 
>>> 
>>> Tombstones can be purged, where as empty can’t, so should it?
>>> 
>>> Everything besides SAI, including the table based 2i and ALLOW FILTERING, 
>>> treat EMPTY as a distant value which can be inserted and queried on.  We 
>>> have supported it in the past, so we should continue to support it into the 
>>> future, even if it is painful to do.
>>> 
>>> 
>>> I guess where I come from here is what semantics do we expect.
>>> 
>>> So lets say v0 is empty bytes int
>>> 
>>> SELECT CAST(v0 AS text)
>>> 
>>> Is this null or empty bytes?  In our project this is null
>>> 
>>> SELECT JSON v0
>>> 
>>> Is this null or empty bytes?  In our project this is null
>>> 
>>> SELECT avg(v0) …
>>> 
>>> Is this null or empty bytes?  In our project this is null
>>> 
>>> So in most places you touch empty bytes we treat it as null, but only in 
>>> filtering do we not.
>>> 
>>> On Feb 11, 2025, at 11:27 AM, Jeremiah Jordan <[email protected]> 
>>> wrote:
>>> 
>>> AFAIK this EMPTY stuff goes back to thrift days.  We let people insert 
>>> these zero length values back then, so we have to support those zero length 
>>> values existing for ever :/.
>>> 
>>> How useful is such a distinction?  I don’t know.  Is anybody actually doing 
>>> this?  Well Andres brought up 
>>> https://issues.apache.org/jira/browse/CASSANDRA-20313 as a problem because 
>>> we had an end user create an SAI index on a column which contained EMPTY 
>>> values in it.  So people are inserting these into the database.  Would they 
>>> expect to be able to query by EMPTY?  I do not know.
>>> 
>>> This is the first I have heard of the “isEmptyValueMeaningless” setting.  
>>> The meaning of EMPTY to me has always been the same for an Integer or a 
>>> String, “this column has a value of no value” vs NULL which means "this 
>>> column is not set/has no value”.  If we truly want to follow the spirit of 
>>> that setting, then maybe we should be converting such values into a 
>>> tombstone / NULL up front when deserializing them, rather than storing the 
>>> EMPTY byte buffer in the DB?
>>> 
>>> Anyway, I am kind of rambling here.  I am of two minds.
>>> I can see that this does seem like a silly distinction to have for some 
>>> types, so maybe we should just decide that in a CQL world, EMPTY means NULL 
>>> for some types, and actually just make that a tombstone.  Maybe 6.0 would 
>>> be a good major version change to make such a “breaking” behavior change in.
>>> 
>>> I can also see the “don’t screw up the legacy apps” use case.  Everything 
>>> besides SAI, including the table based 2i and ALLOW FILTERING, treat EMPTY 
>>> as a distant value which can be inserted and queried on.  We have supported 
>>> it in the past, so we should continue to support it into the future, even 
>>> if it is painful to do.
>>> 
>>> Flip a coin and I can argue either side.  So I would love to hear others 
>>> thoughts to convince me one way to the other.
>>> 
>>> -Jeremiah
>>> 
>>> 
>>> 
>>> On Feb 11, 2025 at 12:55:35 PM, Caleb Rackliffe <[email protected]> 
>>> wrote:
>>>> 
>>>> The case where allowsEmpty == true AND is meaningless == true is 
>>>> especially confusing. If I could design this from scratch, I would reject 
>>>> writes and filtering on EMPTY values for int and the other types where 
>>>> meaningless == true. (In other words, if we allow EMPTY, it is meaningful 
>>>> and queryable. If we don't, it isn't.) That avoids problems that can't 
>>>> have anything other than an arbitrary solution, like what we do with < and 
>>>> > for EMPTY for int. When we add IS [NOT] NULL support, that would 
>>>> preferably NOT match EMPTY values for the types where empty means 
>>>> something, like strings. For everything else, EMPTY could be equivalent to 
>>>> null and match IS NULL.
>>>> 
>>>> The only real way to make SAI compatible with the current behavior is to 
>>>> add something like a special postings list to its data structures that 
>>>> corresponds to the rows where the indexed column value is EMPTY.
>>>> 
>>>> On Tue, Feb 11, 2025 at 12:21 PM David Capwell <[email protected]> wrote:
>>>>> 
>>>>> Bringing this discussion to dev@ rather than Slack as we try to figure 
>>>>> out CASSANDRA-20313 and CASSANDRA-19461.
>>>>> 
>>>>> In the type system, we have 2 different (but related) methods:
>>>>> 
>>>>> AbstractType#allowsEmpty                        - if the user gives empty 
>>>>> bytes (new byte[0]) will the type reject it
>>>>> AbstractType#isEmptyValueMeaningless  - if the user gives empty bytes, 
>>>>> should this be handled like null?
>>>>> 
>>>>> In practice, there are 2 cases that matter:
>>>>> 
>>>>> allowsEmpty = true AND is meaningless = false - stuff like text and bytes
>>>>> allowsEmpty = true AND is meaningless = true  - many types, example "int"
>>>>> 
>>>>> What this means is that users are able to use empty bytes when writing to 
>>>>> these types, but this leads to complexity in the filter path, and is 
>>>>> something we are trying to flesh out the “correct” semantics for SAI.
>>>>> 
>>>>> Simple example:
>>>>> 
>>>>> {code}
>>>>> 
>>>>> @Test
>>>>> public void test() throws IOException
>>>>> {
>>>>>    try (Cluster cluster = Cluster.build(1).start())
>>>>>    {
>>>>>        init(cluster);
>>>>>        cluster.schemaChange(withKeyspace("CREATE TABLE %s.tbl (pk int 
>>>>> primary key, v int)"));
>>>>>        IInvokableInstance node = cluster.get(1);
>>>>>        for (int i = 0; i < 10; i++)
>>>>>            node.executeInternal(withKeyspace("INSERT INTO %s.tbl (pk, v) 
>>>>> VALUES (?, ?)"), i, ByteBufferUtil.EMPTY_BYTE_BUFFER);
>>>>> 
>>>>>        var qr = node.executeInternalWithResult(withKeyspace("SELECT * 
>>>>> FROM %s.tbl WHERE v=? ALLOW FILTERING"), 
>>>>> ByteBufferUtil.EMPTY_BYTE_BUFFER);
>>>>>        StringBuilder sb = new StringBuilder();
>>>>>        sb.append(qr.names());
>>>>>        while (qr.hasNext())
>>>>>        {
>>>>>            var next = qr.next();
>>>>>            sb.append('\n').append(next);
>>>>>        }
>>>>>        System.out.println(sb);
>>>>>    }
>>>>> }
>>>>> 
>>>>> {code}
>>>>> 
>>>>> “Should” this return 10 rows or 0?  In this case, the type is int, and 
>>>>> int defines empty as meaningless, which means it should act as a null; 
>>>>> yet this query returns 10 rows, which violates CQL as foo = null == false.
>>>>> 
>>>>> Right now there really isn’t a way to query for NULL (CASSANDRA-10715 is 
>>>>> still open), but if we did add such a thing we would also need to figure 
>>>>> out the semantics with regard to these cases.
>>> 
>>>

Re: Meaningless emptiness and filtering

Reply via email to