[GitHub] [druid] clintropolis commented on a diff in pull request #13329: Update ingestion spec doc

GitBox Wed, 09 Nov 2022 05:22:26 -0800


clintropolis commented on code in PR #13329:
URL: https://github.com/apache/druid/pull/13329#discussion_r1017882146



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -477,35 +477,37 @@ The `indexSpec` object can include the following 
properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property 
`compressRunOnSerialization` (defaults to true) controls whether or not 
run-length encoding will be used when it is determined to be more 
space-efficient.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value 
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING-typed column value 
dictionaries. The default setting `utf8` suits most use cases.<br>Example to 
enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` 
is the number of values to place in a bucket to perform delta encoding. Must be 
a power of 2, maximum is 128. Defaults to 4.<br>See [Front 
coding](#front-coding) for more information.|`{"type":"utf8"}`|
 |metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more 
efficient than `uncompressed`, but not supported by older versions of 
Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options 
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
+##### Front coding
 
-#### String Dictionary Encoding
+By default, Druid stores values in STRING-typed columns as uncompressed UTF-8 
encoded bytes.
 
-##### UTF8
-By default, `STRING` typed column store the values as uncompressed UTF8 
encoded bytes.
+Starting in version 25.0, Druid can store STRING columns using an incremental 
encoding strategy called front coding. This allows Druid to create smaller 
UTF-8 encoded segments with very little performance cost.
 
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"utf8"` .|n/a|
+If you enable front coding, Druid divides the column values into buckets, 
storing the first value in each bucket as it is. Druid stores subsequent values 
in the bucket using a number representing the length of the prefix and the 
remainder of the value. This technique prevents Druid from storing duplicated 
prefixes.
 
-##### Front Coding
-`STRING` columns can be stored using an incremental encoding strategy called 
front coding.
-In the Druid implementation of front coding, the column values are first 
divided into buckets,
-and the first value in each bucket is stored as is. The remaining values in 
the bucket are stored
-using a number representing a prefix length and the remaining suffix bytes.
-This technique allows the prefix portion of the values in each bucket from 
being duplicated.
-The values are still UTF-8 encoded, but front coding can often result in much 
smaller segments at very little
-performance cost. Segments created with this encoding are not compatible with 
Druid versions older than 25.0.0.
+If you set `bucketSize` to a larger number than the default, larger buckets 
allow columns with a high degree of overlap to produce smaller segments. This 
change causes a slight cost to read and search performance which scales with 
bucket size.
 
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"frontCoded"` .|n/a|
-|bucketSize|The number of values to place in a bucket to perform delta 
encoding, must be a power of 2, maximum is 128. Larger buckets allow columns 
with a high degree of overlap to produce smaller segments at a slight cost to 
read and search performance which scales with bucket size.|4|
+Example `indexSpec` snippet with front coding enabled:
+
+```plaintext
+"indexSpec": {
+    "bitmap": { "type": "roaring" },
+    "dimensionCompression": "lz4",
+    "metricCompression": "lz4",
+    "jsonCompression": "lz4",
+    "longEncoding": "auto",
+    "stringDictionaryEncoding": {
+      "type": "utf8"
+    }
+  }
+```
 
+> In most cases the default stringDictionaryEncoding setting `{"type":"utf8"}` 
is suitable. Enable front coding in a test system and fully test your use case 
before you change the default in production. Segments created with front coding 
enabled are not compatible with Druid versions older than 25.0.

Review Comment:
   
   ```suggestion
   > Front coding is new to Druid 25.0 so the current recommendation is to 
enable it in a staging environment and fully testing your use case before using 
in production. Segments created with front coding enabled are not compatible 
with Druid versions older than 25.0.
   ```



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -477,35 +477,37 @@ The `indexSpec` object can include the following 
properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property 
`compressRunOnSerialization` (defaults to true) controls whether or not 
run-length encoding will be used when it is determined to be more 
space-efficient.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value 
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING-typed column value 
dictionaries. The default setting `utf8` suits most use cases.<br>Example to 
enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` 
is the number of values to place in a bucket to perform delta encoding. Must be 
a power of 2, maximum is 128. Defaults to 4.<br>See [Front 
coding](#front-coding) for more information.|`{"type":"utf8"}`|
 |metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more 
efficient than `uncompressed`, but not supported by older versions of 
Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options 
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
+##### Front coding
 
-#### String Dictionary Encoding
+By default, Druid stores values in STRING-typed columns as uncompressed UTF-8 
encoded bytes.
 
-##### UTF8
-By default, `STRING` typed column store the values as uncompressed UTF8 
encoded bytes.
+Starting in version 25.0, Druid can store STRING columns using an incremental 
encoding strategy called front coding. This allows Druid to create smaller 
UTF-8 encoded segments with very little performance cost.
 
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"utf8"` .|n/a|
+If you enable front coding, Druid divides the column values into buckets, 
storing the first value in each bucket as it is. Druid stores subsequent values 
in the bucket using a number representing the length of the prefix and the 
remainder of the value. This technique prevents Druid from storing duplicated 
prefixes.
 
-##### Front Coding
-`STRING` columns can be stored using an incremental encoding strategy called 
front coding.
-In the Druid implementation of front coding, the column values are first 
divided into buckets,
-and the first value in each bucket is stored as is. The remaining values in 
the bucket are stored
-using a number representing a prefix length and the remaining suffix bytes.
-This technique allows the prefix portion of the values in each bucket from 
being duplicated.
-The values are still UTF-8 encoded, but front coding can often result in much 
smaller segments at very little
-performance cost. Segments created with this encoding are not compatible with 
Druid versions older than 25.0.0.
+If you set `bucketSize` to a larger number than the default, larger buckets 
allow columns with a high degree of overlap to produce smaller segments. This 
change causes a slight cost to read and search performance which scales with 
bucket size.
 
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"frontCoded"` .|n/a|
-|bucketSize|The number of values to place in a bucket to perform delta 
encoding, must be a power of 2, maximum is 128. Larger buckets allow columns 
with a high degree of overlap to produce smaller segments at a slight cost to 
read and search performance which scales with bucket size.|4|
+Example `indexSpec` snippet with front coding enabled:
+
+```plaintext
+"indexSpec": {
+    "bitmap": { "type": "roaring" },
+    "dimensionCompression": "lz4",
+    "metricCompression": "lz4",
+    "jsonCompression": "lz4",
+    "longEncoding": "auto",
+    "stringDictionaryEncoding": {
+      "type": "utf8"
+    }
+  }

Review Comment:
   this indexSpec example doesn't actually have `frontCoded` set as the type. 
What do you think about moving the example indexSpec to just be a general 
example for indexSpec and locate it near the table that defines all of the 
options?
   
   If you think we really need an example in this section, i suggest using a 
shorter form
   ```
         "indexSpec":{
           "stringDictionaryEncoding": {
             "type": "frontCoded", "bucketSize":4
           }
         }
   ```
   
   since all options on indexSpec are optional and will be filled in with 
defaults if not present.



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -477,35 +477,37 @@ The `indexSpec` object can include the following 
properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property 
`compressRunOnSerialization` (defaults to true) controls whether or not 
run-length encoding will be used when it is determined to be more 
space-efficient.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value 
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING-typed column value 
dictionaries. The default setting `utf8` suits most use cases.<br>Example to 
enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` 
is the number of values to place in a bucket to perform delta encoding. Must be 
a power of 2, maximum is 128. Defaults to 4.<br>See [Front 
coding](#front-coding) for more information.|`{"type":"utf8"}`|

Review Comment:
   suggest clarifying that applies to `STRING` and `COMPLEX<json>` columns
   ```suggestion
   |stringDictionaryEncoding|Encoding format for STRING value dictionaries used 
by `STRING` and `COMPLEX<json>` columns. <br>Example to enable front coding: 
`{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` is the number of 
values to place in a bucket to perform delta encoding. Must be a power of 2, 
maximum is 128. Defaults to 4.<br>See [Front coding](#front-coding) for more 
information.|`{"type":"utf8"}`|
   ```



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -477,35 +477,37 @@ The `indexSpec` object can include the following 
properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property 
`compressRunOnSerialization` (defaults to true) controls whether or not 
run-length encoding will be used when it is determined to be more 
space-efficient.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value 
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING-typed column value 
dictionaries. The default setting `utf8` suits most use cases.<br>Example to 
enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` 
is the number of values to place in a bucket to perform delta encoding. Must be 
a power of 2, maximum is 128. Defaults to 4.<br>See [Front 
coding](#front-coding) for more information.|`{"type":"utf8"}`|

Review Comment:
   hmm, thinking a bit more about this, I wonder if the additional details here 
would be enough if we also added the bit about being new to Druid 25.0, and 
instead drop all of the extra sections dedicated to `stringDictionaryEncoding`.
   
   My reasoning is that these `indexSpec` settings provide direct control over 
low level segment storage details. They are primarily intended for 
experimentation by _the Druid developers_ to assist in determining the best 
default values for these settings, though i imagine advanced cluster operators 
also use them in the process of trying to maximally optimize a cluster.
   
   We have to have all of this text to describe implementation details of the 
`frontCoded` option so that the operator can begin to understand how to 
correctly set the `bucketSize` parameter, and the implications of different 
sizes also require a decent bit of understanding of both the data and the way 
Druid works internally.
   
   Did we add enough information to make an informed decision? I'm not sure if 
I'm a good judge of that 😅 If it is actually enough to understand reasonably 
well I can be convinced to leave all of these details.
   
   



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -477,35 +477,37 @@ The `indexSpec` object can include the following 
properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property 
`compressRunOnSerialization` (defaults to true) controls whether or not 
run-length encoding will be used when it is determined to be more 
space-efficient.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value 
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING-typed column value 
dictionaries. The default setting `utf8` suits most use cases.<br>Example to 
enable front coding: `{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` 
is the number of values to place in a bucket to perform delta encoding. Must be 
a power of 2, maximum is 128. Defaults to 4.<br>See [Front 
coding](#front-coding) for more information.|`{"type":"utf8"}`|
 |metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more 
efficient than `uncompressed`, but not supported by older versions of 
Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options 
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
+##### Front coding
 
-#### String Dictionary Encoding
+By default, Druid stores values in STRING-typed columns as uncompressed UTF-8 
encoded bytes.
 
-##### UTF8
-By default, `STRING` typed column store the values as uncompressed UTF8 
encoded bytes.
+Starting in version 25.0, Druid can store STRING columns using an incremental 
encoding strategy called front coding. This allows Druid to create smaller 
UTF-8 encoded segments with very little performance cost.

Review Comment:
   This option applies to both regular `STRING` typed columns, but also the 
nested `STRING` columns of `COMPLEX<json>` columns, since they both have string 
value dictionaries internally. I'm not sure the best way to clearly describe 
this here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on a diff in pull request #13329: Update ingestion spec doc

Reply via email to