writer-jill commented on code in PR #13329:
URL: https://github.com/apache/druid/pull/13329#discussion_r1018004645
##########
docs/ingestion/ingestion-spec.md:
##########
@@ -477,35 +477,37 @@ The `indexSpec` object can include the following
properties:
|-----|-----------|-------|
|bitmap|Compression format for bitmap indexes. Should be a JSON object with
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property
`compressRunOnSerialization` (defaults to true) controls whether or not
run-length encoding will be used when it is determined to be more
space-efficient.|`{"type": "roaring"}`|
|dimensionCompression|Compression format for dimension columns. Options are
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING-typed column value
dictionaries. The default setting `utf8` suits most use cases.<br>Example to
enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize`
is the number of values to place in a bucket to perform delta encoding. Must be
a power of 2, maximum is 128. Defaults to 4.<br>See [Front
coding](#front-coding) for more information.|`{"type":"utf8"}`|
|metricCompression|Compression format for primitive type metric columns.
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more
efficient than `uncompressed`, but not supported by older versions of
Druid).|`lz4`|
|longEncoding|Encoding format for long-typed columns. Applies regardless of
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto`
encodes the values using offset or lookup table depending on column
cardinality, and store them with variable size. `longs` stores the value as-is
with 8 bytes each.|`longs`|
|jsonCompression|Compression format to use for nested column raw data. Options
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
+##### Front coding
-#### String Dictionary Encoding
+By default, Druid stores values in STRING-typed columns as uncompressed UTF-8
encoded bytes.
-##### UTF8
-By default, `STRING` typed column store the values as uncompressed UTF8
encoded bytes.
+Starting in version 25.0, Druid can store STRING columns using an incremental
encoding strategy called front coding. This allows Druid to create smaller
UTF-8 encoded segments with very little performance cost.
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"utf8"` .|n/a|
+If you enable front coding, Druid divides the column values into buckets,
storing the first value in each bucket as it is. Druid stores subsequent values
in the bucket using a number representing the length of the prefix and the
remainder of the value. This technique prevents Druid from storing duplicated
prefixes.
-##### Front Coding
-`STRING` columns can be stored using an incremental encoding strategy called
front coding.
-In the Druid implementation of front coding, the column values are first
divided into buckets,
-and the first value in each bucket is stored as is. The remaining values in
the bucket are stored
-using a number representing a prefix length and the remaining suffix bytes.
-This technique allows the prefix portion of the values in each bucket from
being duplicated.
-The values are still UTF-8 encoded, but front coding can often result in much
smaller segments at very little
-performance cost. Segments created with this encoding are not compatible with
Druid versions older than 25.0.0.
+If you set `bucketSize` to a larger number than the default, larger buckets
allow columns with a high degree of overlap to produce smaller segments. This
change causes a slight cost to read and search performance which scales with
bucket size.
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"frontCoded"` .|n/a|
-|bucketSize|The number of values to place in a bucket to perform delta
encoding, must be a power of 2, maximum is 128. Larger buckets allow columns
with a high degree of overlap to produce smaller segments at a slight cost to
read and search performance which scales with bucket size.|4|
+Example `indexSpec` snippet with front coding enabled:
+
+```plaintext
+"indexSpec": {
+ "bitmap": { "type": "roaring" },
+ "dimensionCompression": "lz4",
+ "metricCompression": "lz4",
+ "jsonCompression": "lz4",
+ "longEncoding": "auto",
+ "stringDictionaryEncoding": {
+ "type": "utf8"
+ }
+ }
Review Comment:
I think the additional detail in the table means that we can omit this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]