Re: [DISCUSS] Vector type and empty value

David Capwell Wed, 20 Sep 2023 09:38:04 -0700

> I don’t think we can readily migrate old types away from this however, 
> without breaking backwards compatibility.


Given that java driver has a different behavior from server, I wouldn’t be 
shocked to see that other drivers also have their own custom behaviors… so not 
clear how to migrate unless we actually hand a user facing standard per type… 
if all drivers use a “default value” and is consistent, I do think we could 
migrate, but would need to live with this till at least 6.0+

> We can only prevent its use in the CQL layer where support isn’t required.

+1

> On Sep 20, 2023, at 7:38 AM, Benedict <bened...@apache.org> wrote:
> 
> Yes, if this is what was meant by empty I agree. It’s nonsensical for most 
> types. Apologies for any confusion.
> 
> I don’t think we can readily migrate old types away from this however, 
> without breaking backwards compatibility. We can only prevent its use in the 
> CQL layer where support isn’t required. My understanding was that we had at 
> least tried to do this for all non-thrift schemas, but perhaps we did not do 
> so thoroughly and now may have some CQL legacy support requirements as well.
> 
>> On 20 Sep 2023, at 15:30, Aleksey Yeshchenko <alek...@apple.com> wrote:
>> 
>> Allowing zero-length byte arrays for most old types is just a legacy from 
>> Darker Days. It’s a distinct concern from columns being nullable or not.
>> 
>> There are a couple types where this makes sense: strings and blobs. All else 
>> should not allow this except for backward compatibility reasons. So, not for 
>> new types.
>> 
>>>> On 20 Sep 2023, at 00:08, David Capwell <dcapw...@apple.com> wrote:
>>>> 
>>>> When does empty mean null?
>>> 
>>> 
>>> Most types are this way
>>> 
>>> @Test
>>> public void nullExample()
>>> {
>>> createTable("CREATE TABLE %s (pk int primary key, cuteness int)");
>>> execute("INSERT INTO %s (pk, cuteness) VALUES (0, ?)", ByteBuffer.wrap(new 
>>> byte[0]));
>>> Row result = execute("SELECT * FROM %s WHERE pk=0").one();
>>> if (result.has("cuteness")) System.out.println("Cuteness score: " + 
>>> result.getInt("cuteness"));
>>> else System.out.println("Cuteness score is undefined");
>>> }
>>> 
>>> 
>>> This test will NPE in getInt as the returned BB is seen as “null” for int32 
>>> type, you can make it “safer” by changing to the following
>>> 
>>> if (result.has("cuteness")) System.out.println("Cuteness score: " + 
>>> Int32Type.instance.compose(result.getBlob("cuteness")));
>>> 
>>> Now we get the log "Cuteness score: null”
>>> 
>>> What’s even better (just found this out) is that client isn’t consistent or 
>>> correct in these cases!
>>> 
>>> com.datastax.driver.core.Row result = executeNet(ProtocolVersion.CURRENT, 
>>> "SELECT * FROM %s WHERE pk=0").one();
>>> if (result.getBytesUnsafe("cuteness") != null) System.out.println("Cuteness 
>>> score: " + result.getInt("cuteness"));
>>> else System.out.println("Cuteness score is undefined”);
>>> 
>>> This prints "Cuteness score: 0”
>>> 
>>> So for Cassandra we think the value is “null” but java driver thinks it’s 0?
>>> 
>>>> Do we have types where writing an empty value creates a tombstone?
>>> 
>>> Empty does not generate a tombstone for any type, but empty has a similar 
>>> user experience as we return null in both cases (but just found out that 
>>> the drivers may not be consistent with this…)
>>> 
>>>>> On Sep 19, 2023, at 3:33 PM, J. D. Jordan <jeremiah.jor...@gmail.com> 
>>>>> wrote:
>>>> 
>>>> When does empty mean null?  My understanding was that empty is a valid 
>>>> value for the types that support it, separate from null (aka a tombstone). 
>>>> Do we have types where writing an empty value creates a tombstone?
>>>> 
>>>> I agree with David that my preference would be for only blob and string 
>>>> like types to support empty. It’s too late for the existing types, but we 
>>>> should hold to this going forward. Which is what I think the idea was in 
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8951 as well?  That it was 
>>>> sad the existing numerics were emptiable, but too late to change, and we 
>>>> could correct it for newer types.
>>>> 
>>>>> On Sep 19, 2023, at 12:12 PM, David Capwell <dcapw...@apple.com> wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> When we introduced TINYINT and SMALLINT (CASSANDRA-8951) we started 
>>>>>> making types non -emptiable. This approach makes more sense to me as 
>>>>>> having to deal with empty value is error prone in my opinion.
>>>>> 
>>>>> I agree it’s confusing, and in the patch I found that different code 
>>>>> paths didn’t handle things correctly as we have some times (most) that 
>>>>> support empty bytes, and some that do not…. Empty also has different 
>>>>> meaning in different code paths; for most it means “null”, and for some 
>>>>> other types it means “empty”…. To try to make things more clear I added 
>>>>> org.apache.cassandra.db.marshal.AbstractType#isNull(V, 
>>>>> org.apache.cassandra.db.marshal.ValueAccessor<V>) to the type system so 
>>>>> each type can define if empty is null or not.
>>>>> 
>>>>>> I also think that it would be good to standardize on one approach to 
>>>>>> avoid confusion.
>>>>> 
>>>>> I agree, but also don’t feel it’s a perfect one-size-fits-all thing…. 
>>>>> Let’s say I have a “blob” type and I write an empty byte… what does this 
>>>>> mean?  What does it mean for "text" type?  The fact I get back a null in 
>>>>> both those cases was very confusing to me… I do feel that some types 
>>>>> should support empty, and the common code of empty == null I think is 
>>>>> very brittle (blob/text was not correct in different places due to 
>>>>> this...)… so I am cool with removing that relationship, but don’t think 
>>>>> we should have a rule blocking empty for all current / future types as it 
>>>>> some times does make sense.
>>>>> 
>>>>>> empty vector (I presume) for the vector type?
>>>>> 
>>>>> Empty vectors (vector[0]) are blocked at the type level, the smallest 
>>>>> vector is vector[1]
>>>>> 
>>>>>> as types that can never be null
>>>>> 
>>>>> One pro here is that “null” is cheaper (in some regards) than delete 
>>>>> (though we can never purge), but having 2 similar behaviors (write null, 
>>>>> do a delete) at the type level is a bit confusing… Right now I am allowed 
>>>>> to do the following (the below isn’t valid CQL, its a hybrid of CQL + 
>>>>> Java code…)
>>>>> 
>>>>> CREATE TABLE fluffykittens (pk int primary key, cuteness int);
>>>>> INSERT INTO fluffykittens (pk, cuteness) VALUES (0, new byte[0])
>>>>> 
>>>>> CREATE TABLE typesarehard (pk1 int, pk2 int, cuteness int, PRIMARY KEY 
>>>>> ((pk1, pk2));
>>>>> INSERT INTO typesarehard (pk1, pk2, cuteness) VALUES (new byte[0], new 
>>>>> byte[0], new byte[0]) — valid as the partition key is not empty as its a 
>>>>> composite of 2 empty values, this is the same as new byte[2]
>>>>> 
>>>>> The first time I ever found out that empty bytes was valid was when a 
>>>>> user was trying to abuse this in collections (also the fact collections 
>>>>> support null in some cases and not others is fun…)…. It was blowing up in 
>>>>> random places… good times!
>>>>> 
>>>>> I am personally not in favor of allowing empty bytes (other than for blob 
>>>>> / text as that is actually valid for the domain), but having similar 
>>>>> types having different semantics I feel is more problematic...
>>>>> 
>>>>>> On Sep 19, 2023, at 8:56 AM, Josh McKenzie <jmcken...@apache.org> wrote:
>>>>>> 
>>>>>>> I am strongly in favour of permitting the table definition forbidding 
>>>>>>> nulls - and perhaps even defaulting to this behaviour. But I don’t 
>>>>>>> think we should have types that are inherently incapable of being null.
>>>>>> I'm with Benedict. Seems like this could help prevent whatever "nulls in 
>>>>>> primary key columns" problems Aleksey was alluding to on those tickets 
>>>>>> back in the day that pushed us towards making the new types 
>>>>>> non-emptiable as well (i.e. primary keys are non-null in table 
>>>>>> definition).
>>>>>> 
>>>>>> Furthering Alex' question, having a default value for unset fields in 
>>>>>> any non-collection context seems... quite surprising to me in a 
>>>>>> database. I could see the argument for making container / collection 
>>>>>> types non-nullable, maybe, but that just keeps us in a potential 
>>>>>> straddle case (some types nullable, some not).
>>>>>> 
>>>>>> On Tue, Sep 19, 2023, at 8:22 AM, Benedict wrote:
>>>>>>> 
>>>>>>> If I understand this suggestion correctly it is a whole can of worms, 
>>>>>>> as types that can never be null prevent us ever supporting outer joins 
>>>>>>> that return these types.
>>>>>>> 
>>>>>>> I am strongly in favour of permitting the table definition forbidding 
>>>>>>> nulls - and perhaps even defaulting to this behaviour. But I don’t 
>>>>>>> think we should have types that are inherently incapable of being null. 
>>>>>>> I also certainly don’t think we should have bifurcated our behaviour 
>>>>>>> between types like this.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 19 Sep 2023, at 11:54, Alex Petrov <al...@coffeenco.de> wrote:
>>>>>>>> 
>>>>>>>> To make sure I understand this right; does that mean there will be a 
>>>>>>>> default value for unset fields? Like 0 for numerical values, and an 
>>>>>>>> empty vector (I presume) for the vector type?
>>>>>>>> 
>>>>>>>> On Fri, Sep 15, 2023, at 11:46 AM, Benjamin Lerer wrote:
>>>>>>>>> Hi everybody,
>>>>>>>>> 
>>>>>>>>> I noticed that the new Vector type accepts empty ByteBuffer values as 
>>>>>>>>> an input representing null.
>>>>>>>>> When we introduced TINYINT and SMALLINT (CASSANDRA-895) we started 
>>>>>>>>> making types non -emptiable. This approach makes more sense to me as 
>>>>>>>>> having to deal with empty value is error prone in my opinion.
>>>>>>>>> I also think that it would be good to standardize on one approach to 
>>>>>>>>> avoid confusion.
>>>>>>>>> 
>>>>>>>>> Should we make the Vector type non-emptiable and stick to it for the 
>>>>>>>>> new types?
>>>>>>>>> 
>>>>>>>>> I like to hear your opinion.
>>>>> 
>>>>> 
>>> 
>> 
>

Re: [DISCUSS] Vector type and empty value

Reply via email to