clintropolis opened a new pull request, #18880:
URL: https://github.com/apache/druid/pull/18880
### Description
This PR introduces a new segment format, taking learning from years of
experience with the v9 format and designed to be able to allow partial segment
downloads to greatly improve the efficiency and responsiveness of the virtual
storage fabric functionality introduced in #18176 (partial segment downloads
are _not_ part of this PR). Overall the changes are more of a remix than any
major differences from v9. To streamline partial fetches, the base segment
contents are combined into a single file, currently named `druid.segment` in
this PR (thoughts on name welcome, i'm not terribly attached to this one).
Set `druid.indexer.task.buildV10=true` to make segments in the new format.
#### Layout
```
| version (byte) | meta compression (byte) | meta length (int) | meta json |
chunk 0 | chunk 1 | ... | chunk n |
```
**version**: equivalent to version.bin in v9 format, a byte that indicates
the segment version
**meta compression, length, blob**: unified segment metadata, the newly
added `SegmentFileMetadata`
**containers**: equivalent to smoosh chunks of v9 format (e.g. 00000.smoosh
etc), but concatenated together in favor of mapping ranges of the file based on
offsets stored in the unified metadata.
#### SegmentFileMetadata
One of the bigger changes when compared to the V9 format is the
consolidation of all the various metadata which is stored in the segment into a
single json blob, `SegmentFileMetadata`. In the V9 segment format, metadata is
split across a variety of places:
* meta.smoosh: The smoosh file has metadata about what internal files
are present, and their offsets within the smoosh containers
* index.drd: list non-null columns, list of non-null dimensions,
interval, bitmap factory, list of all columns including nulls, list of all
dimensions including null only columns
* metadata.drd: Metadata contains aggs, timestampSpec, query
granularity, rollup flag, ordering, list of projections
* `ColumnDescriptor` scattered across the internal files of the smoosh
which contain type information and how to load a column supplier
This metadata has all been consolidated into a single place to make it easy
to retrieve the metadata about both schema and layout which is the key to how
V10 will be able to support partial downloads. Schema information is expressed
as set of projections (including modeling the base table as a projection), and
the `ColumnDescriptor` are pulled out of the column files and instead live in
the metadata. In virtual storage mode, this metadata will be fetched on segment
load, and since this metadata contains both where in the file the data is
located and how to read it, will be able to fetch only the data which is
actually required to complete the query.
#### External files
V10 format also supports the concept of 'external' segment containers, which
can be 'attached' to the base segment to augment it with additional/optional
data, for which this PR has very rudimentary support. This is a very
experimental feature, our initial thinking is supporting use cases like
optional indexes that can be downloaded separately (or even constructed at load
time/on the fly). In the current implementation provided in this PR, column
serializers can specify additional 'external' segment files to write contents
to during segment creation, and readers can refer to these files during segment
mapping.
In its current form this is more of an organizational feature; if used the
external segment files will just be included and pushed to deep storage as part
of publishing, and downloaded on fetch, but there are no actual column
implementations using this at this time. Future work will expand on this
functionality to realize the ideas suggested above.
#### Release note
_todo_
<hr>
This PR has:
- [ ] been self-reviewed.
- [ ] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]