Re: [PR] docs: add concurent compaction docs (druid)

via GitHub Wed, 25 Oct 2023 02:54:52 -0700


kfaraz commented on code in PR #15218:
URL: https://github.com/apache/druid/pull/15218#discussion_r1371336412



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -530,3 +530,11 @@ You can enable front coding with all types of ingestion. 
For information on defi
 
 Beyond these properties, each ingestion method has its own specific tuning 
properties. See the documentation for each
 [ingestion method](./index.md#ingestion-methods) for details.
+
+## Context

Review Comment:
   We should skip adding this section for now as this is an experimental 
feature.
   If we are adding a task context section to this doc, it would first need to 
talk about other more important parameters.



##########
docs/development/extensions-core/kafka-supervisor-reference.md:
##########
@@ -258,4 +258,12 @@ The following table outlines the configuration options for 
`indexSpec`:
 |`bitmap`|Object|Compression format for bitmap indexes. Druid supports roaring 
and concise bitmap types.|No|Roaring|
 |`dimensionCompression`|String|Compression format for dimension columns. 
Choose from `LZ4`, `LZF`, `ZSTD` or `uncompressed`.|No|`LZ4`|
 |`metricCompression`|String|Compression format for primitive type metric 
columns. Choose from `LZ4`, `LZF`, `ZSTD`, `uncompressed` or `none`.|No|`LZ4`|
-|`longEncoding`|String|Encoding format for metric and dimension columns with 
type long. Choose from `auto` or `longs`. `auto` encodes the values using 
offset or lookup table depending on column cardinality, and store them with 
variable size. `longs` stores the value as is with 8 bytes each.|No|`longs`|
\ No newline at end of file
+|`longEncoding`|String|Encoding format for metric and dimension columns with 
type long. Choose from `auto` or `longs`. `auto` encodes the values using 
offset or lookup table depending on column cardinality, and store them with 
variable size. `longs` stores the value as is with 8 bytes each.|No|`longs`|
+
+## Context

Review Comment:
   We should not add a separate section for this right now. We can do this 
later when the feature is more well-baked.



##########
docs/data-management/automatic-compaction.md:
##########
@@ -203,6 +203,85 @@ The following auto-compaction configuration compacts 
updates the `wikipedia` seg
 }
 ```
 
+## Concurrent append and replace
+
+:::info
+Concurrent append and replace is an [experimental 
feature](../development/experimental.md) and is not currently available for 
SQL-based ingestion.
+:::
+
+If you enable automatic compaction, you can use concurrent append and replace 
to concurrently compact data as you ingest it for streaming and legacy 
JSON-based batch ingestion. 

Review Comment:
   ```suggestion
   This feature allows you to safely replace the existing data in an interval 
of a datasource while new data is being appended to that interval. One of the 
most common applications of this is appending new data (using say streaming 
ingestion) to an interval while compaction of that interval is already in 
progress.
   ```



##########
docs/data-management/automatic-compaction.md:
##########
@@ -203,6 +203,85 @@ The following auto-compaction configuration compacts 
updates the `wikipedia` seg
 }
 ```
 
+## Concurrent append and replace
+
+:::info
+Concurrent append and replace is an [experimental 
feature](../development/experimental.md) and is not currently available for 
SQL-based ingestion.
+:::
+
+If you enable automatic compaction, you can use concurrent append and replace 
to concurrently compact data as you ingest it for streaming and legacy 
JSON-based batch ingestion. 
+
+Setting up concurrent append and replace is a two-step process. The first is 
to update your datasource and the second is to update your ingestion job.

Review Comment:
   This is not exactly correct. It doesn't make a lot of sense to "update a 
datasource" unless you mean adding data to a datasource.
   
   Moreover, we shouldn't even look at this as a two step process, rather as an 
opt-in behaviour. Any ingestion job that wants to run concurrently with other 
ingestion jobs needs to use the correct lock types.
   
   Please see the other suggestion.



##########
docs/data-management/compaction.md:
##########
@@ -43,18 +44,20 @@ By default, compaction does not modify the underlying data 
of the segments. Howe
 
 Compaction does not improve performance in all situations. For example, if you 
rewrite your data with each ingestion task, you don't need to use compaction. 
See [Segment optimization](../operations/segment-optimization.md) for 
additional guidance to determine if compaction will help in your environment.
 
-## Types of compaction
+## Choose your compaction type

Review Comment:
   I don't think this heading aligns with the rest of headings.
   
   Also, the type of compaction is not really much of a choice as say how 
partioning type is a choice (range or hashed or dynamic, where we are choosing 
three different paths that give you 3 different results).
   
   We should just call this `Ways to run compaction` or something in a similar 
vein.



##########
docs/data-management/automatic-compaction.md:
##########
@@ -203,6 +203,85 @@ The following auto-compaction configuration compacts 
updates the `wikipedia` seg
 }
 ```
 
+## Concurrent append and replace
+
+:::info
+Concurrent append and replace is an [experimental 
feature](../development/experimental.md) and is not currently available for 
SQL-based ingestion.
+:::
+
+If you enable automatic compaction, you can use concurrent append and replace 
to concurrently compact data as you ingest it for streaming and legacy 
JSON-based batch ingestion. 
+
+Setting up concurrent append and replace is a two-step process. The first is 
to update your datasource and the second is to update your ingestion job.
+
+Using concurrent append and replace in the following scenarios can be 
beneficial:
+
+- If the job with an `APPEND` task and the job with a `REPLACE` task have the 
same segment granularity. For example, when a datasource and its streaming 
ingestion job have the same granularity.
+- If the job with an `APPEND` task  has a finer segment granularity than the 
replacing job.

Review Comment:
   ```suggestion
   You can enable concurrent append and replace by ensuring the following:
   - The append task (with `appendToExisting` set to `true`) has `taskLockType` 
set to `APPEND` in the task context.
   - The replace task (with `appendToExisting` set to `false`) has 
`taskLockType` set to `REPLACE` in the task context.
   - The segment granularity of the append task is equal to or finer than the 
segment granularity of the replace task.
   ```



##########
docs/data-management/automatic-compaction.md:
##########
@@ -203,6 +203,85 @@ The following auto-compaction configuration compacts 
updates the `wikipedia` seg
 }
 ```
 
+## Concurrent append and replace
+
+:::info
+Concurrent append and replace is an [experimental 
feature](../development/experimental.md) and is not currently available for 
SQL-based ingestion.
+:::
+
+If you enable automatic compaction, you can use concurrent append and replace 
to concurrently compact data as you ingest it for streaming and legacy 
JSON-based batch ingestion. 
+
+Setting up concurrent append and replace is a two-step process. The first is 
to update your datasource and the second is to update your ingestion job.
+
+Using concurrent append and replace in the following scenarios can be 
beneficial:
+
+- If the job with an `APPEND` task and the job with a `REPLACE` task have the 
same segment granularity. For example, when a datasource and its streaming 
ingestion job have the same granularity.
+- If the job with an `APPEND` task  has a finer segment granularity than the 
replacing job.
+
+We do not recommend using concurrent append and replace when the job with an 
`APPEND` task has a coarser granularity than the job with a `REPLACE` task. For 
example, if the `APPEND` job has a yearly granularity and the `REPLACE` job has 
a monthly granularity. The job that finishes second will fail.

Review Comment:
   This point should be in a note or warning block.
   
   Two more points to call out are that:
   
   ```
   At any point in time
   - There can only be a single task that holds a `REPLACE` lock on a given 
interval of a datasource.
   - There may be multiple tasks that hold `APPEND` locks on a given interval 
of a datasource and append data to that interval simultaneously.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] docs: add concurent compaction docs (druid)

Reply via email to