jihoonson opened a new pull request #12250:
URL: https://github.com/apache/druid/pull/12250


   ### Terminology
   
   - Column: a logical column that can be stored across multiple segments.
   - Null-only column: a column that has only nulls in it. Druid is aware of 
this column.
   - Unknown column: a column that Druid is not aware of. In other words, it is 
a column that Druid is not tracking at all via segment metadata or any other 
methods.
   
   ### The problem
   
   Today, null-only columns are not stored at ingestion time and thus they 
become unknown columns once the ingestion job is done. The druid native query 
engine uses segment-level schema and treat unknown columns as if they were 
null-only columns; reading unknown columns returns only nulls.
   
   Druid SQL is different. Druid is using Calcite SQL planner which requires 
valid column information upfront at planning time. The column information is 
retrieved from datasource-level schema which is dynamically discovered by 
merging segment schemas. As a result, users cannot query unknown columns using 
SQL.
   
   This introduces several issues. One of the biggest issues is that the same 
SQL query against streaming ingestion can fail from time to time. While it 
creates the segment, the streaming ingestion task announces a realtime segment 
that has all columns in the ingestion spec. This streaming ingestion task thus 
reports even null-only columns for the realtime segment to the broker which is 
used by the Druid SQL planner. Once the segment is handed off to a historical, 
however, the historical announces a historical segment that does not store any 
null-only columns in it. As a result, the SQL planner will stop recognizing 
null-only columns once the realtime segment is handed off to a historical 
segment.
   
   ### The proposal
   
   This PR stores null-only columns as well as normal columns in the segment, 
so that Druid's SQL layer can track of them properly. After this PR, ingestion 
jobs will store null-only columns in the historical segments and historicals 
will report those null-only columns along with others in the 
`SegmentMetadataQuery` result.
   
   #### Feature flag
   
   A new boolean system property, `druid.index.task.storeEmptyColumns`, is 
added. This is on by default for all native batch ingestion and Kafka/kinesis 
streaming ingestion. A new task context, `storeEmptyColumns` is also added 
which can override the system property per task.
   
   #### Ingestion jobs
   
   When `storeEmptyColumns` is set, the ingestion job will store every column 
explicitly defined in `DimensionsSpec` in the segments it creates.
   
   #### Segment writes/reads
   
   For null-only columns, only column name and its data type will be stored in 
the segment. 
   
   - Column type is stored in `ColumnDescriptor`. `ColumnDescriptor` also has a 
new `ColumnPartSerde` specialized for null-only columns, i.e., 
`NullColumnPartSerde`. `NullColumnPartSerde` is no-op for serialization but can 
create columnSupplier and indexes that returns only nulls for deserialization.
   - Column name is stored in `index.drd` along with `ColumnDescriptor`.
   
   Note that the new `NullColumnPartSerde` can cause some compatibility issue 
if a historical of an older version tries to deserialize it. To avoid this 
problem, the null column names are stored at the end of the `index.drd` section 
separately from the normal columns. Older historicals will skip this part when 
loading segments and treat null-only columns as unknown columns.
   
   ### Compatibility
   
   There should be no compatibility issue. As noted in the "Segment 
writes/reads" section, older historicals should be able to read the segments 
that have null-only columns. Newer historicals should be able to read any 
segments that have been created using any Druid version.
   
   ### Future work
   
   - Hadoop ingestion support. It's not supported in this PR as it will 
increase the PR even bigger. There is no plan to support Tranquility as of now.
   - A new mode for `dimensionsSpec` that allows both implicit and explicit 
dimensions. Currently, `dimensionsSpec` has only 2 modes, implicit dimensions 
only and explicit dimensions only. Allowing a hybrid mode that allows both 
explicit and implicit dimensions will be useful to store null-only columns with 
auto schema discovery. I'm planning to make a follow-up PR for this.
   - Null numeric dimensions are currently always being stored even when they 
are completely empty (null-only). This seems like a bug since 1) its behavior 
doesn't match with string dimensions and 2) it currently stores all nulls in 
the segment file along with a null bitmap index which seems unnecessary and 
inefficient for query processing. I'm also planning to make a PR for this.
   - Integration tests and documentation for the new feature.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `NullColumnPartSerde`
    * `V9IndexLoader`
    * `IndexMergerV9`
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to