[GitHub] [druid] writer-jill commented on a diff in pull request #13329: Update ingestion spec doc

GitBox Wed, 09 Nov 2022 06:24:57 -0800


writer-jill commented on code in PR #13329:
URL: https://github.com/apache/druid/pull/13329#discussion_r1018004645



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -477,35 +477,37 @@ The `indexSpec` object can include the following 
properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property 
`compressRunOnSerialization` (defaults to true) controls whether or not 
run-length encoding will be used when it is determined to be more 
space-efficient.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value 
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING-typed column value 
dictionaries. The default setting `utf8` suits most use cases.<br>Example to 
enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` 
is the number of values to place in a bucket to perform delta encoding. Must be 
a power of 2, maximum is 128. Defaults to 4.<br>See [Front 
coding](#front-coding) for more information.|`{"type":"utf8"}`|
 |metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more 
efficient than `uncompressed`, but not supported by older versions of 
Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options 
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
+##### Front coding
 
-#### String Dictionary Encoding
+By default, Druid stores values in STRING-typed columns as uncompressed UTF-8 
encoded bytes.
 
-##### UTF8
-By default, `STRING` typed column store the values as uncompressed UTF8 
encoded bytes.
+Starting in version 25.0, Druid can store STRING columns using an incremental 
encoding strategy called front coding. This allows Druid to create smaller 
UTF-8 encoded segments with very little performance cost.
 
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"utf8"` .|n/a|
+If you enable front coding, Druid divides the column values into buckets, 
storing the first value in each bucket as it is. Druid stores subsequent values 
in the bucket using a number representing the length of the prefix and the 
remainder of the value. This technique prevents Druid from storing duplicated 
prefixes.
 
-##### Front Coding
-`STRING` columns can be stored using an incremental encoding strategy called 
front coding.
-In the Druid implementation of front coding, the column values are first 
divided into buckets,
-and the first value in each bucket is stored as is. The remaining values in 
the bucket are stored
-using a number representing a prefix length and the remaining suffix bytes.
-This technique allows the prefix portion of the values in each bucket from 
being duplicated.
-The values are still UTF-8 encoded, but front coding can often result in much 
smaller segments at very little
-performance cost. Segments created with this encoding are not compatible with 
Druid versions older than 25.0.0.
+If you set `bucketSize` to a larger number than the default, larger buckets 
allow columns with a high degree of overlap to produce smaller segments. This 
change causes a slight cost to read and search performance which scales with 
bucket size.
 
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"frontCoded"` .|n/a|
-|bucketSize|The number of values to place in a bucket to perform delta 
encoding, must be a power of 2, maximum is 128. Larger buckets allow columns 
with a high degree of overlap to produce smaller segments at a slight cost to 
read and search performance which scales with bucket size.|4|
+Example `indexSpec` snippet with front coding enabled:
+
+```plaintext
+"indexSpec": {
+    "bitmap": { "type": "roaring" },
+    "dimensionCompression": "lz4",
+    "metricCompression": "lz4",
+    "jsonCompression": "lz4",
+    "longEncoding": "auto",
+    "stringDictionaryEncoding": {
+      "type": "utf8"
+    }
+  }

Review Comment:
   I think the additional detail in the table means that we can omit this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] writer-jill commented on a diff in pull request #13329: Update ingestion spec doc

Reply via email to