clintropolis commented on pull request #11888: URL: https://github.com/apache/druid/pull/11888#issuecomment-982044151
> 1) estimateSizeBytesNullable appears to be the only reference to nullability. The other methods say they deal with non-null data. That seems inconsistent. Here I think I'm in favor of what @cheddar is suggesting, that this interface should not be null-aware, and null handling should be left to the caller. (Callers shouldn't need to use this interface if they get a null.) Oops, this was a leftover from the first round of changes from @cheddar 's comment where I moved most of the null byte handling stuff out of `TypeStrategy` itself, but forgot to move this one out. Since then, I've moved all of the null byte handling into the special `NullableTypeStrategy` that should be used for callers that want to use the null byte from `NullHandling`, along with lots of javadocs to encourage other options if available since the byte per value is quite expensive. > 2) To simplify life for callers that don't want to care about null handling, we could make a NullableTypeStrategy that accepts nulls and stores a null byte. It'd be composable with all the other TypeStrategy implementations, and lets certain callers stay simple while also letting other callers be fancy. Did this, as well as adding a `getNullableStrategy` to `TypeSignature` that will automatically wrap the output of `getStrategy` with the `NullableTypeStrategy` > 3) I'm confused about the intent of estimateSizeBytes: is it meant to be a maximum size? Like, can the estimate be an underestimate? Some of the javadocs and comments seem to suggest that is the case. If it's meant to be a max, we should call it specifically "max" instead of "estimate". Before the most recent set of changes, estimate was _actually_ being used to get the exact size of some value in bytes, in order to check that variably sized values did not exceed this maximum. This has been reworked as suggested in 4) to instead pass `maxSizeBytes` directly into the `write` family of methods, and now has been relaxed to only needing to be an estimate. The only thing still using it in the current PR is the heap based `ExpressionLambdaAggregator`, which uses it to check that the aggregator does not explode the heap in an unbounded way. > 4) I'm not sure if anything could use it today, but, we should design this interface to support variable size serialization lengths, since we'll want it for various future data types. Those could work by having write take a max length, and returning -1 if it would need more than that. Then the caller could allocate more space and call write again. Or, maybe, write could return -the_amount_of_space_it_needs. In this case the "estimate" method really could return an estimate rather than a max. Variably sized serialization was previously already supported (but required externally verifying that there was enough space with the estimate method). But, I liked the ideas here, so I have modified the `write` methods to now accept a maxSizeBytes parameter and went with `-the_amount_of_extra_space_it_needs` approach if this value (or buffer size) is exceeded, so that callers could choose to explode if the value was not able to be completely written, as well as know how much additional space would be required to fully write the value. > 5) Even with variable size support like (4) suggest, we will still have callers that can't deal with that. So there will need to be some API that supports those callers. I'm not sure what that'd look like; maybe a getMaxBytes method that returns the max, or -1 if there really isn't a known max? And if the max is -1, then callers that only support constant size serialization can't use the API? Variably sized support is in place, I'm not sure we need a `getMaxBytes` method as attempting to call write now requires specifying a max bytes, even for constant sized values, callers with constant sized serialization can either specify the correct size or some size larger than the actual size, which is sort of ugly, but also primitive numerics shouldn't really be using `TypeStrategy` directly, instead they should be using the static methods to avoid converting the java primitives, so maybe it isn't a huge deal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
