jihoonson opened a new pull request #12250:
URL: https://github.com/apache/druid/pull/12250
### Terminology
- Column: a logical column that can be stored across multiple segments.
- Null-only column: a column that has only nulls in it. Druid is aware of
this column.
- Unknown column: a column that Druid is not aware of. In other words, it is
a column that Druid is not tracking at all via segment metadata or any other
methods.
### The problem
Today, null-only columns are not stored at ingestion time and thus they
become unknown columns once the ingestion job is done. The druid native query
engine uses segment-level schema and treat unknown columns as if they were
null-only columns; reading unknown columns returns only nulls.
Druid SQL is different. Druid is using Calcite SQL planner which requires
valid column information upfront at planning time. The column information is
retrieved from datasource-level schema which is dynamically discovered by
merging segment schemas. As a result, users cannot query unknown columns using
SQL.
This introduces several issues. One of the biggest issues is that the same
SQL query against streaming ingestion can fail from time to time. While it
creates the segment, the streaming ingestion task announces a realtime segment
that has all columns in the ingestion spec. This streaming ingestion task thus
reports even null-only columns for the realtime segment to the broker which is
used by the Druid SQL planner. Once the segment is handed off to a historical,
however, the historical announces a historical segment that does not store any
null-only columns in it. As a result, the SQL planner will stop recognizing
null-only columns once the realtime segment is handed off to a historical
segment.
### The proposal
This PR stores null-only columns as well as normal columns in the segment,
so that Druid's SQL layer can track of them properly. After this PR, ingestion
jobs will store null-only columns in the historical segments and historicals
will report those null-only columns along with others in the
`SegmentMetadataQuery` result.
#### Feature flag
A new boolean system property, `druid.index.task.storeEmptyColumns`, is
added. This is on by default for all native batch ingestion and Kafka/kinesis
streaming ingestion. A new task context, `storeEmptyColumns` is also added
which can override the system property per task.
#### Ingestion jobs
When `storeEmptyColumns` is set, the ingestion job will store every column
explicitly defined in `DimensionsSpec` in the segments it creates.
#### Segment writes/reads
For null-only columns, only column name and its data type will be stored in
the segment.
- Column type is stored in `ColumnDescriptor`. `ColumnDescriptor` also has a
new `ColumnPartSerde` specialized for null-only columns, i.e.,
`NullColumnPartSerde`. `NullColumnPartSerde` is no-op for serialization but can
create columnSupplier and indexes that returns only nulls for deserialization.
- Column name is stored in `index.drd` along with `ColumnDescriptor`.
Note that the new `NullColumnPartSerde` can cause some compatibility issue
if a historical of an older version tries to deserialize it. To avoid this
problem, the null column names are stored at the end of the `index.drd` section
separately from the normal columns. Older historicals will skip this part when
loading segments and treat null-only columns as unknown columns.
### Compatibility
There should be no compatibility issue. As noted in the "Segment
writes/reads" section, older historicals should be able to read the segments
that have null-only columns. Newer historicals should be able to read any
segments that have been created using any Druid version.
### Future work
- Hadoop ingestion support. It's not supported in this PR as it will
increase the PR even bigger. There is no plan to support Tranquility as of now.
- A new mode for `dimensionsSpec` that allows both implicit and explicit
dimensions. Currently, `dimensionsSpec` has only 2 modes, implicit dimensions
only and explicit dimensions only. Allowing a hybrid mode that allows both
explicit and implicit dimensions will be useful to store null-only columns with
auto schema discovery. I'm planning to make a follow-up PR for this.
- Null numeric dimensions are currently always being stored even when they
are completely empty (null-only). This seems like a bug since 1) its behavior
doesn't match with string dimensions and 2) it currently stores all nulls in
the segment file along with a null bitmap index which seems unnecessary and
inefficient for query processing. I'm also planning to make a PR for this.
- Integration tests and documentation for the new feature.
<hr>
##### Key changed/added classes in this PR
* `NullColumnPartSerde`
* `V9IndexLoader`
* `IndexMergerV9`
<hr>
This PR has:
- [x] been self-reviewed.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [x] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]