[GitHub] [druid] jihoonson opened a new pull request #12279: Store null columns in the segments

GitBox Wed, 23 Feb 2022 22:52:24 -0800


jihoonson opened a new pull request #12279:
URL: https://github.com/apache/druid/pull/12279



   ### Terminology
   
   - Column: a logical column that can be stored across multiple segments.
   - Null column: a column that has only nulls in it. Druid is aware of this 
column.
   - Unknown column: a column that Druid is not aware of. In other words, it is 
a column that Druid is not tracking at all via segment metadata or any other 
methods.
   
   ### Description
   
   Today, null columns are not stored at ingestion time and thus they become 
unknown columns once the ingestion job is done. The druid native query engine 
uses segment-level schema and treats unknown columns as if they were null 
columns; reading unknown columns returns only nulls.
   
   Druid SQL is different. Druid is using Calcite SQL planner which requires 
valid column information at planning time. The column information is retrieved 
from datasource-level schema which is dynamically discovered by merging segment 
schemas. As a result, users cannot query unknown columns using SQL. This is 
causing a couple of issues. One of the main issues is the SQL queries failing 
against stream ingestion from time to time. While it creates segments, the 
realtime task announces a realtime segment that has all columns in the 
ingestion spec. This realtime task thus reports the broker even null columns 
for the realtime segment which can be used by the Druid SQL planner. Once the 
segment is handed off to a historical, it announces a historical segment that 
does not store any null columns in it. As a result, the same SQL query will no 
longer work after the segment is handed off.
   
   ### Proposed solution
   
   To make the SQL planner be aware of null columns, Druid needs to track of 
them. This PR proposes to store those null columns in the segment just like 
normal columns.
   
   #### Feature flag
   
   `druid.index.task.storeEmptyColumns` is added. This is on by default. A new 
task context, `storeEmptyColumns`, is added which can override the system 
property.
   
   #### Ingestion tasks
   
   When `storeEmptyColumns` is set, the task stores every column specified in 
`DimensionsSpec` in the segments that it creates. This applies to all kinds of 
ingestion except for Hadoop ingestion and Tranquility.
   
   #### Segment writes/reads
   
   For null columns, Druid stores a column name, column type, number of rows, 
and bitmapSerdeFactory. The first two are stored in `ColumnDescriptor` and the 
last two are stored in `NullColumnPartSerde`. `NullColumnPartSerde` has a no-op 
serializer and a deserializer that can dynamically create a bitmap index and a 
dictionary. Finally, the null column names are stored at the end of `index.drd` 
separately from normal columns. This is for the compatibility for older 
historicals. When they read a segment that has null column stored, they won't 
be aware of those columns but will just ignore them without exploding. 
   
   #### Test plan
   
   - Unit tests are added in this PR to verify the compatibility for older 
historicals.
   - Unit tests are added in this PR to verify that null columns are stored in 
segments.
   - Integration tests will be added in 
https://github.com/apache/druid/pull/12268.
   
   #### Future work
   
   - Currently, null numeric dimensions are always being stored even without 
this change. I would call this a bug as 1) its behavior doesn't match with 
string dimensions and 2) it currently stores all nulls in the segment file 
along with the null bitmap index, and reads them for query processing which is 
unnecessary and inefficient.
   - Hadoop ingestion may be supported later.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `NullColumnPartSerde`
    * `IndexIO`
    * `IndexMergerV9`
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not 
all of these items apply to every PR. Remove the items which are not done or 
not relevant to the PR. None of the items from the checklist below are strictly 
necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] jihoonson opened a new pull request #12279: Store null columns in the segments

Reply via email to