[druid] branch master updated: Update ingestion spec doc (#13329)

cwylie Thu, 10 Nov 2022 02:54:49 -0800

This is an automated email from the ASF dual-hosted git repository.

cwylie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new c2210c4e09 Update ingestion spec doc (#13329)
c2210c4e09 is described below

commit c2210c4e098a90999678ed05807f7cd59f149362
Author: Jill Osborne <[email protected]>
AuthorDate: Thu Nov 10 10:54:35 2022 +0000

    Update ingestion spec doc (#13329)
    
    * Update ingestion spec doc
    
    * Updated
    
    * Updated
    
    * Update docs/ingestion/ingestion-spec.md
    
    Co-authored-by: Clint Wylie <[email protected]>
    
    * Updated
    
    * Updated
    
    Co-authored-by: Clint Wylie <[email protected]>
---
 docs/ingestion/ingestion-spec.md | 27 +++++----------------------
 1 file changed, 5 insertions(+), 22 deletions(-)

diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md
index 058858292f..56f7630b23 100644
--- a/docs/ingestion/ingestion-spec.md
+++ b/docs/ingestion/ingestion-spec.md
@@ -477,35 +477,18 @@ The `indexSpec` object can include the following 
properties:
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property 
`compressRunOnSerialization` (defaults to true) controls whether or not 
run-length encoding will be used when it is determined to be more 
space-efficient.|`{"type": "roaring"}`|
 |dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value 
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING value dictionaries used 
by STRING and COMPLEX&lt;json&gt; columns. <br>Example to enable front coding: 
`{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` is the number of 
values to place in a bucket to perform delta encoding. Must be a power of 2, 
maximum is 128. Defaults to 4.<br>See [Front coding](#front-coding) for more 
information.|`{"type":"utf8"}`|
 |metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more 
efficient than `uncompressed`, but not supported by older versions of 
Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options 
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
+##### Front coding
 
-#### String Dictionary Encoding
+Starting in version 25.0, Druid can store STRING and 
[COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns using an 
incremental encoding strategy called front coding. This allows Druid to create 
smaller UTF-8 encoded segments with very little performance cost.
 
-##### UTF8
-By default, `STRING` typed column store the values as uncompressed UTF8 
encoded bytes.
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"utf8"` .|n/a|
-
-##### Front Coding
-`STRING` columns can be stored using an incremental encoding strategy called 
front coding.
-In the Druid implementation of front coding, the column values are first 
divided into buckets,
-and the first value in each bucket is stored as is. The remaining values in 
the bucket are stored
-using a number representing a prefix length and the remaining suffix bytes.
-This technique allows the prefix portion of the values in each bucket from 
being duplicated.
-The values are still UTF-8 encoded, but front coding can often result in much 
smaller segments at very little
-performance cost. Segments created with this encoding are not compatible with 
Druid versions older than 25.0.0.
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"frontCoded"` .|n/a|
-|bucketSize|The number of values to place in a bucket to perform delta 
encoding, must be a power of 2, maximum is 128. Larger buckets allow columns 
with a high degree of overlap to produce smaller segments at a slight cost to 
read and search performance which scales with bucket size.|4|
+To enable front coding with SQL-based ingestion, define an `indexSpec` in a 
query context. See [SQL-based ingestion 
reference](../multi-stage-query/reference.md#context-parameters) for more 
information.
 
+> Front coding is new to Druid 25.0 so the current recommendation is to enable 
it in a staging environment and fully test your use case before using in 
production. Segments created with front coding enabled are not compatible with 
Druid versions older than 25.0.
 
 Beyond these properties, each ingestion method has its own specific tuning 
properties. See the documentation for each
 [ingestion method](./index.md#ingestion-methods) for details.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: Update ingestion spec doc (#13329)

Reply via email to