This is an automated email from the ASF dual-hosted git repository.
cwylie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new c2210c4e09 Update ingestion spec doc (#13329)
c2210c4e09 is described below
commit c2210c4e098a90999678ed05807f7cd59f149362
Author: Jill Osborne <[email protected]>
AuthorDate: Thu Nov 10 10:54:35 2022 +0000
Update ingestion spec doc (#13329)
* Update ingestion spec doc
* Updated
* Updated
* Update docs/ingestion/ingestion-spec.md
Co-authored-by: Clint Wylie <[email protected]>
* Updated
* Updated
Co-authored-by: Clint Wylie <[email protected]>
---
docs/ingestion/ingestion-spec.md | 27 +++++----------------------
1 file changed, 5 insertions(+), 22 deletions(-)
diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md
index 058858292f..56f7630b23 100644
--- a/docs/ingestion/ingestion-spec.md
+++ b/docs/ingestion/ingestion-spec.md
@@ -477,35 +477,18 @@ The `indexSpec` object can include the following
properties:
|-----|-----------|-------|
|bitmap|Compression format for bitmap indexes. Should be a JSON object with
`type` set to `roaring` or `concise`. For type `roaring`, the boolean property
`compressRunOnSerialization` (defaults to true) controls whether or not
run-length encoding will be used when it is determined to be more
space-efficient.|`{"type": "roaring"}`|
|dimensionCompression|Compression format for dimension columns. Options are
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for string typed column value
dictionaries.|`{"type":"utf8"}`|
+|stringDictionaryEncoding|Encoding format for STRING value dictionaries used
by STRING and COMPLEX<json> columns. <br>Example to enable front coding:
`{"type":"frontCoded", "bucketSize": 4}`<br>`bucketSize` is the number of
values to place in a bucket to perform delta encoding. Must be a power of 2,
maximum is 128. Defaults to 4.<br>See [Front coding](#front-coding) for more
information.|`{"type":"utf8"}`|
|metricCompression|Compression format for primitive type metric columns.
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more
efficient than `uncompressed`, but not supported by older versions of
Druid).|`lz4`|
|longEncoding|Encoding format for long-typed columns. Applies regardless of
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto`
encodes the values using offset or lookup table depending on column
cardinality, and store them with variable size. `longs` stores the value as-is
with 8 bytes each.|`longs`|
|jsonCompression|Compression format to use for nested column raw data. Options
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
+##### Front coding
-#### String Dictionary Encoding
+Starting in version 25.0, Druid can store STRING and
[COMPLEX<json>](../querying/nested-columns.md) columns using an
incremental encoding strategy called front coding. This allows Druid to create
smaller UTF-8 encoded segments with very little performance cost.
-##### UTF8
-By default, `STRING` typed column store the values as uncompressed UTF8
encoded bytes.
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"utf8"` .|n/a|
-
-##### Front Coding
-`STRING` columns can be stored using an incremental encoding strategy called
front coding.
-In the Druid implementation of front coding, the column values are first
divided into buckets,
-and the first value in each bucket is stored as is. The remaining values in
the bucket are stored
-using a number representing a prefix length and the remaining suffix bytes.
-This technique allows the prefix portion of the values in each bucket from
being duplicated.
-The values are still UTF-8 encoded, but front coding can often result in much
smaller segments at very little
-performance cost. Segments created with this encoding are not compatible with
Druid versions older than 25.0.0.
-
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Must be `"frontCoded"` .|n/a|
-|bucketSize|The number of values to place in a bucket to perform delta
encoding, must be a power of 2, maximum is 128. Larger buckets allow columns
with a high degree of overlap to produce smaller segments at a slight cost to
read and search performance which scales with bucket size.|4|
+To enable front coding with SQL-based ingestion, define an `indexSpec` in a
query context. See [SQL-based ingestion
reference](../multi-stage-query/reference.md#context-parameters) for more
information.
+> Front coding is new to Druid 25.0 so the current recommendation is to enable
it in a staging environment and fully test your use case before using in
production. Segments created with front coding enabled are not compatible with
Druid versions older than 25.0.
Beyond these properties, each ingestion method has its own specific tuning
properties. See the documentation for each
[ingestion method](./index.md#ingestion-methods) for details.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]