gianm commented on a change in pull request #8894: document SQL compatible null 
handling mode
URL: https://github.com/apache/incubator-druid/pull/8894#discussion_r348224108
 
 

 ##########
 File path: docs/design/segments.md
 ##########
 @@ -143,6 +143,11 @@ the 'column data' is an array of values. Additionally, a 
row with *n*
 values in 'column data' will have *n* non-zero valued entries in
 bitmaps.
 
+## SQL Compatible Null Handling
+By default, Druid string dimension columns use the values `''` and `null` 
interchangeably and numeric and metric columns can not represent `null` at all, 
instead coercing nulls to `0`. However, Druid also provides an SQL compatible 
null handling mode, which must be enabled at the system level, through 
`druid.generic.useDefaultValueForNull`. This setting, when set to `false`, will 
allow Druid to _at ingestion time_ create segments whose string columns can 
distinguish `''` from `null`, and numeric columns which can represent `null` 
valued rows instead of `0`.
+
+String dimension columns contain no additional column structures in this mode, 
instead just reserving an additional dictionary entry for the `null` value. 
Numeric columns however will be stored in the segment with an additional 
`bitmap` whose set bits indicate `null` valued rows. In addition to slightly 
increased segment sizes, this also means that SQL compatible null handling 
comes at a query time cost for numeric columns too, which must now check 
whether or not the row is null valued during selection and aggregation. This 
overhead has been calculated to be approximately 10-20 nanoseconds _per row_ 
scanned in each query, so it is worth considering if the expressivity is worth 
the performance hit for your individual use case.
 
 Review comment:
   I think we need to dial down this block a bit:
   
   > In addition to slightly increased segment sizes, this also means that SQL 
compatible null handling comes at a query time cost for numeric columns too, 
which must now check whether or not the row is null valued during selection and 
aggregation. This overhead has been calculated to be approximately 10-20 
nanoseconds _per row_ scanned in each query, so it is worth considering if the 
expressivity is worth the performance hit for your individual use case.
   
   The reasons being: (1) we will eventually be wanting to make this mode 
default; (2) 10–20ns assumes certain things about amount of nulls per column, 
we might do further optimizations, etc. So I would skip the specific number and 
some of the warnings.
   
   How about: 
   
   > In addition to slightly increased segment sizes, SQL compatible null 
handling can incur a performance cost at query time as well, due to the need to 
check the null bitmap. This performance cost only occurs for columns that 
actually contain nulls.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to