Re: [DISCUSS] Vector type and empty value

David Capwell Wed, 20 Sep 2023 09:26:27 -0700

> So, not for new types.

> Should we make the Vector type non-emptiable and stick to it for the new 
> types?


Yep, works for me.

We should also update the test 
org.apache.cassandra.db.marshal.AbstractTypeTest#empty to detect this for new 
types by making org.apache.cassandra.db.marshal.AbstractType#allowsEmpty 
default to false and override in all legacy types

More than glad to review any patch to fix this issue!

> On Sep 20, 2023, at 7:16 AM, Aleksey Yeshchenko <[email protected]> wrote:
> 
> Allowing zero-length byte arrays for most old types is just a legacy from 
> Darker Days. It’s a distinct concern from columns being nullable or not.
> 
> There are a couple types where this makes sense: strings and blobs. All else 
> should not allow this except for backward compatibility reasons. So, not for 
> new types.
> 
>> On 20 Sep 2023, at 00:08, David Capwell <[email protected]> wrote:
>> 
>>> When does empty mean null?
>> 
>> 
>> Most types are this way
>> 
>> @Test
>> public void nullExample()
>> {
>> createTable("CREATE TABLE %s (pk int primary key, cuteness int)");
>> execute("INSERT INTO %s (pk, cuteness) VALUES (0, ?)", ByteBuffer.wrap(new 
>> byte[0]));
>> Row result = execute("SELECT * FROM %s WHERE pk=0").one();
>> if (result.has("cuteness")) System.out.println("Cuteness score: " + 
>> result.getInt("cuteness"));
>> else System.out.println("Cuteness score is undefined");
>> }
>> 
>> 
>> This test will NPE in getInt as the returned BB is seen as “null” for int32 
>> type, you can make it “safer” by changing to the following
>> 
>> if (result.has("cuteness")) System.out.println("Cuteness score: " + 
>> Int32Type.instance.compose(result.getBlob("cuteness")));
>> 
>> Now we get the log "Cuteness score: null”
>> 
>> What’s even better (just found this out) is that client isn’t consistent or 
>> correct in these cases!
>> 
>> com.datastax.driver.core.Row result = executeNet(ProtocolVersion.CURRENT, 
>> "SELECT * FROM %s WHERE pk=0").one();
>> if (result.getBytesUnsafe("cuteness") != null) System.out.println("Cuteness 
>> score: " + result.getInt("cuteness"));
>> else System.out.println("Cuteness score is undefined”);
>> 
>> This prints "Cuteness score: 0”
>> 
>> So for Cassandra we think the value is “null” but java driver thinks it’s 0?
>> 
>>> Do we have types where writing an empty value creates a tombstone?
>> 
>> Empty does not generate a tombstone for any type, but empty has a similar 
>> user experience as we return null in both cases (but just found out that the 
>> drivers may not be consistent with this…)
>> 
>>> On Sep 19, 2023, at 3:33 PM, J. D. Jordan <[email protected]> wrote:
>>> 
>>> When does empty mean null?  My understanding was that empty is a valid 
>>> value for the types that support it, separate from null (aka a tombstone). 
>>> Do we have types where writing an empty value creates a tombstone?
>>> 
>>> I agree with David that my preference would be for only blob and string 
>>> like types to support empty. It’s too late for the existing types, but we 
>>> should hold to this going forward. Which is what I think the idea was in 
>>> https://issues.apache.org/jira/browse/CASSANDRA-8951 as well?  That it was 
>>> sad the existing numerics were emptiable, but too late to change, and we 
>>> could correct it for newer types.
>>> 
>>>> On Sep 19, 2023, at 12:12 PM, David Capwell <[email protected]> wrote:
>>>> 
>>>> 
>>>>> 
>>>>> When we introduced TINYINT and SMALLINT (CASSANDRA-8951) we started 
>>>>> making types non -emptiable. This approach makes more sense to me as 
>>>>> having to deal with empty value is error prone in my opinion.
>>>> 
>>>> I agree it’s confusing, and in the patch I found that different code paths 
>>>> didn’t handle things correctly as we have some times (most) that support 
>>>> empty bytes, and some that do not…. Empty also has different meaning in 
>>>> different code paths; for most it means “null”, and for some other types 
>>>> it means “empty”…. To try to make things more clear I added 
>>>> org.apache.cassandra.db.marshal.AbstractType#isNull(V, 
>>>> org.apache.cassandra.db.marshal.ValueAccessor<V>) to the type system so 
>>>> each type can define if empty is null or not.
>>>> 
>>>>> I also think that it would be good to standardize on one approach to 
>>>>> avoid confusion.
>>>> 
>>>> I agree, but also don’t feel it’s a perfect one-size-fits-all thing…. 
>>>> Let’s say I have a “blob” type and I write an empty byte… what does this 
>>>> mean?  What does it mean for "text" type?  The fact I get back a null in 
>>>> both those cases was very confusing to me… I do feel that some types 
>>>> should support empty, and the common code of empty == null I think is very 
>>>> brittle (blob/text was not correct in different places due to this...)… so 
>>>> I am cool with removing that relationship, but don’t think we should have 
>>>> a rule blocking empty for all current / future types as it some times does 
>>>> make sense.
>>>> 
>>>>> empty vector (I presume) for the vector type?
>>>> 
>>>> Empty vectors (vector[0]) are blocked at the type level, the smallest 
>>>> vector is vector[1]
>>>> 
>>>>> as types that can never be null
>>>> 
>>>> One pro here is that “null” is cheaper (in some regards) than delete 
>>>> (though we can never purge), but having 2 similar behaviors (write null, 
>>>> do a delete) at the type level is a bit confusing… Right now I am allowed 
>>>> to do the following (the below isn’t valid CQL, its a hybrid of CQL + Java 
>>>> code…)
>>>> 
>>>> CREATE TABLE fluffykittens (pk int primary key, cuteness int);
>>>> INSERT INTO fluffykittens (pk, cuteness) VALUES (0, new byte[0])
>>>> 
>>>> CREATE TABLE typesarehard (pk1 int, pk2 int, cuteness int, PRIMARY KEY 
>>>> ((pk1, pk2));
>>>> INSERT INTO typesarehard (pk1, pk2, cuteness) VALUES (new byte[0], new 
>>>> byte[0], new byte[0]) — valid as the partition key is not empty as its a 
>>>> composite of 2 empty values, this is the same as new byte[2]
>>>> 
>>>> The first time I ever found out that empty bytes was valid was when a user 
>>>> was trying to abuse this in collections (also the fact collections support 
>>>> null in some cases and not others is fun…)…. It was blowing up in random 
>>>> places… good times!
>>>> 
>>>> I am personally not in favor of allowing empty bytes (other than for blob 
>>>> / text as that is actually valid for the domain), but having similar types 
>>>> having different semantics I feel is more problematic...
>>>> 
>>>>> On Sep 19, 2023, at 8:56 AM, Josh McKenzie <[email protected]> wrote:
>>>>> 
>>>>>> I am strongly in favour of permitting the table definition forbidding 
>>>>>> nulls - and perhaps even defaulting to this behaviour. But I don’t think 
>>>>>> we should have types that are inherently incapable of being null.
>>>>> I'm with Benedict. Seems like this could help prevent whatever "nulls in 
>>>>> primary key columns" problems Aleksey was alluding to on those tickets 
>>>>> back in the day that pushed us towards making the new types non-emptiable 
>>>>> as well (i.e. primary keys are non-null in table definition).
>>>>> 
>>>>> Furthering Alex' question, having a default value for unset fields in any 
>>>>> non-collection context seems... quite surprising to me in a database. I 
>>>>> could see the argument for making container / collection types 
>>>>> non-nullable, maybe, but that just keeps us in a potential straddle case 
>>>>> (some types nullable, some not).
>>>>> 
>>>>> On Tue, Sep 19, 2023, at 8:22 AM, Benedict wrote:
>>>>>> 
>>>>>> If I understand this suggestion correctly it is a whole can of worms, as 
>>>>>> types that can never be null prevent us ever supporting outer joins that 
>>>>>> return these types.
>>>>>> 
>>>>>> I am strongly in favour of permitting the table definition forbidding 
>>>>>> nulls - and perhaps even defaulting to this behaviour. But I don’t think 
>>>>>> we should have types that are inherently incapable of being null. I also 
>>>>>> certainly don’t think we should have bifurcated our behaviour between 
>>>>>> types like this.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 19 Sep 2023, at 11:54, Alex Petrov <[email protected]> wrote:
>>>>>>> 
>>>>>>> To make sure I understand this right; does that mean there will be a 
>>>>>>> default value for unset fields? Like 0 for numerical values, and an 
>>>>>>> empty vector (I presume) for the vector type?
>>>>>>> 
>>>>>>> On Fri, Sep 15, 2023, at 11:46 AM, Benjamin Lerer wrote:
>>>>>>>> Hi everybody,
>>>>>>>> 
>>>>>>>> I noticed that the new Vector type accepts empty ByteBuffer values as 
>>>>>>>> an input representing null.
>>>>>>>> When we introduced TINYINT and SMALLINT (CASSANDRA-895) we started 
>>>>>>>> making types non -emptiable. This approach makes more sense to me as 
>>>>>>>> having to deal with empty value is error prone in my opinion.
>>>>>>>> I also think that it would be good to standardize on one approach to 
>>>>>>>> avoid confusion.
>>>>>>>> 
>>>>>>>> Should we make the Vector type non-emptiable and stick to it for the 
>>>>>>>> new types?
>>>>>>>> 
>>>>>>>> I like to hear your opinion.
>>>> 
>>>> 
>> 
>

Re: [DISCUSS] Vector type and empty value

Reply via email to