nishantmonu51 opened a new issue #4349: [Proposal] Improve Handling of Nulls in 
Druid
URL: https://github.com/apache/incubator-druid/issues/4349
 
 
   ## Motivation  
   This proposal is to improve handling of null values in druid by treating 
nulls as missing values which don’t actually exist. This will make it more 
compatible to SQL standard and help with integration with other existing BI 
systems which support ODBC/JDBC.
   
   ## Current Behavior 
   Current Null handling in druid is as follows - 
   1. String Dimensions - Nulls and Empty Strings are considered equivalent 
   2. Long/Float columns Nulls are considered equivalent to 0.
   
   ## Proposed Behavior Changes for Null Handling
   ### Null semantics after the changes described in proposal -
   1. Null and Empty Strings will be handled differently
   2. For every column type, null value will be considered equivalent to an 
absent/missing value. 
   3. Aggregate functions will ignore null values. If the data set contains no 
rows, or contains only rows with nulls as arguments to the aggregate function, 
then the function returns null except count which will return 0 instead of 
null. 
   e.g
       1. a min aggregation over null and 5 will give a result of 5
       2. longSum/doubleSum over values (null, 1, null) will return 1. 
       3. Count over values (null, 1, null) will return 1
       4. Count over values (null, null) will return 0
   4. Rollup of metrics - if a null and a non-null value are rolled up using an 
aggregator at ingestion time the result will be defined by the behavior of 
aggregator specified above.
   5. Math expressions - the result value will be null if any of the components 
is null. 
   6. missing column will be considered equivalent to a column having null 
values for all the rows.
   7. In case of Sorting, by default nulls will be sorted as the lowest value. 
Users can specify to sort nulls as the highest value as part of the query. 
   8. Users will be able to specify default values to replace NULL values at 
ingestion time. 
   9. Comparisons between null and non-null values will follow [three-valued 
logic](https://en.wikipedia.org/wiki/Three-valued_logic) and the result of a 
comparison involving NULL is not a boolean value – it is a null value.
   
   ## Implementation Changes - 
   Ability to specify default values for dimensions/metrics while ingestion 
   This idea is to allow users to specify a default value for each column that 
can be used to replace nulls during ingestion.. e.g 
   
   ```json
   {
   "type" : "String", 
   "name" : "column1", 
   "replaceMissingValuesWith" : "abc"
   }, 
   {
   "type" : "Long", 
   "Name": "column2", 
   "replaceMissingValuesWith" : 5
   } 
   ```
   
   In this case, any null value in column1 will be replaced with String “abc” 
and any null value for long column will be replaced with 5. The final segment 
created will store these default values instead of nulls. 
   
   ## Storage Format changes 
   The current storage format is not sufficient for being able to differentiate 
between the fact that whether a value is null or not. We need to change the 
storage format to also store the information about the presence of a value. we 
can choose following options - 
   1. For String columns - we can choose one of the following - 
       1. In the dictionary encoding reserve 0 as being mapped to null value, 
any other value will get dictionary encoding value > 0. This will also work for 
multi value column, where an array of values can store 0 as the id. This will 
need incrementing segment version for backwards compatibility. 
       2. Add a presence bitmap which tells whether the value is present or 
absent in the row. For a row with multiple value, it may be possible that both 
the presence bit is not set and the row have some values, which will indicate 
that the value is a multi value including null. 
   2. For long/Float columns - Add a presence bitmap to the columns which tells 
whether the value is present or not.
   
   For String column, it seems approach (a) will be better as it will allow 
simplifying handling of multi values columns also.
   We can also skip creating a presence bitmap for the case when all the values 
in a column are known to be non-null. 
   
   ### After the changes the schema for the columns will be as follows - 
   1. String Column - 
      1. Dictionary Encoding starting will 1, 0 will be reserved for mapping 
null value. 
      2. Column Data
      3. Bitmap Indexes
   2. Numeric Column - 
       1. Column Data 
       2. Presence Bitmap 
   
   ## Nullability for Aggregators/Metrics 
   As per the current implementation, most aggregators are coded to replace 
null values with default  e.g.  sum treats them as zeroes, min treats them as 
positive infinity, etc. 
   To match the new semantics we need to make following changes - 
   1. Make aggregators nullable, where if an aggregator encounters only the 
null values the result will be null value. 
   2. The behavior will be same for max,min and sum except that they will 
return null in case all the encountered values are null. 
   3. Count aggregator will ignore null values and only count non-null values, 
If all the encountered values are null, the result will be 0.  This is 
different from current behavior where null values are also counted.
   
   ### Math Expressions Null Handling 
   Null handling for math expressions will be similar to the aggregators. For 
general calculations like sum, full expression will be considered as null if 
any of the components is null. Specifying a default value for null will be 
supported by the use of NVL or IF clause to assign default values at query time.
   
   ### Filtering on Null values 
   SelectorDimFilter currently specifies filtering on null values but the 
implementation assumes null and empty strings as equivalent. The implementation 
will be changed to consider null and empty string differently. Cache key for 
selectorDimFilter also assumes. 
   Generation of cache key for selectorDimFilter also needs to be modified to 
support null. 
   
   ### Changes to Druid build-in SQL layer 
   1. IS NULL and IS NOT NULL will be mapped to SelectorDimFilter with value 
null. 
   2. NVL, IFNULL and COALESCE will be mapped to DimensionSpec 
replaceMissingValueWith property. 
   
   ### Misc Changes
   Above are the major changes in the user facing APIs and behavior. Other than 
these there are multiple places in the code where we convert empty Strings to 
null and vice-versa. They will be changed accordingly in order to treat null 
and String values separately. 
   
   ### Backwards Compatibility 
   1. For backwards compatibility we can add a new runtime property to preserve 
old behavior. e.g 'druid.null.behavior' If this property is set to 'sql' we 
will follow the sql behavior, If this is not set we will preserve backwards 
compatibility. 
   2. Query execution backwards compatibility - There are lots of places in 
code that are based on current druid null handling e.g count aggregator. They 
will need to be made backwards compatible in order to preserve current behavior 
where null and "" the treated the same. This would mean having two branches of 
the code at all the places which currently assumes null and empty strings to be 
equivalent during query execution e.g filters, extraction functions, lookups, 
count aggregator and more.
   3. Ingestion time - At ingestion we replace null with empty strings, the 
same behavior will be preserved by having multiple code paths based on the 
'druid.null.behavior' property
   4. Increment the versions of columns for supporting backwards compatibility 
in storage changes and be able to read columns with older version. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to