jihoonson opened a new pull request #12279: URL: https://github.com/apache/druid/pull/12279
### Terminology - Column: a logical column that can be stored across multiple segments. - Null column: a column that has only nulls in it. Druid is aware of this column. - Unknown column: a column that Druid is not aware of. In other words, it is a column that Druid is not tracking at all via segment metadata or any other methods. ### Description Today, null columns are not stored at ingestion time and thus they become unknown columns once the ingestion job is done. The druid native query engine uses segment-level schema and treats unknown columns as if they were null columns; reading unknown columns returns only nulls. Druid SQL is different. Druid is using Calcite SQL planner which requires valid column information at planning time. The column information is retrieved from datasource-level schema which is dynamically discovered by merging segment schemas. As a result, users cannot query unknown columns using SQL. This is causing a couple of issues. One of the main issues is the SQL queries failing against stream ingestion from time to time. While it creates segments, the realtime task announces a realtime segment that has all columns in the ingestion spec. This realtime task thus reports the broker even null columns for the realtime segment which can be used by the Druid SQL planner. Once the segment is handed off to a historical, it announces a historical segment that does not store any null columns in it. As a result, the same SQL query will no longer work after the segment is handed off. ### Proposed solution To make the SQL planner be aware of null columns, Druid needs to track of them. This PR proposes to store those null columns in the segment just like normal columns. #### Feature flag `druid.index.task.storeEmptyColumns` is added. This is on by default. A new task context, `storeEmptyColumns`, is added which can override the system property. #### Ingestion tasks When `storeEmptyColumns` is set, the task stores every column specified in `DimensionsSpec` in the segments that it creates. This applies to all kinds of ingestion except for Hadoop ingestion and Tranquility. #### Segment writes/reads For null columns, Druid stores a column name, column type, number of rows, and bitmapSerdeFactory. The first two are stored in `ColumnDescriptor` and the last two are stored in `NullColumnPartSerde`. `NullColumnPartSerde` has a no-op serializer and a deserializer that can dynamically create a bitmap index and a dictionary. Finally, the null column names are stored at the end of `index.drd` separately from normal columns. This is for the compatibility for older historicals. When they read a segment that has null column stored, they won't be aware of those columns but will just ignore them without exploding. #### Test plan - Unit tests are added in this PR to verify the compatibility for older historicals. - Unit tests are added in this PR to verify that null columns are stored in segments. - Integration tests will be added in https://github.com/apache/druid/pull/12268. #### Future work - Currently, null numeric dimensions are always being stored even without this change. I would call this a bug as 1) its behavior doesn't match with string dimensions and 2) it currently stores all nulls in the segment file along with the null bitmap index, and reads them for query processing which is unnecessary and inefficient. - Hadoop ingestion may be supported later. <hr> ##### Key changed/added classes in this PR * `NullColumnPartSerde` * `IndexIO` * `IndexMergerV9` <hr> <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. --> This PR has: - [x] been self-reviewed. - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
