[GitHub] [druid] techdocsmith commented on a diff in pull request #12983: msq: add multi-stage-query docs

GitBox Wed, 31 Aug 2022 16:01:57 -0700


techdocsmith commented on code in PR #12983:
URL: https://github.com/apache/druid/pull/12983#discussion_r960085431



##########
docs/multi-stage-query/msq-known-issues.md:
##########
@@ -0,0 +1,99 @@
+---
+id: known-issues
+title: SQL-based ingestion known issues
+sidebar_label: Known issues
+---
+
+> SQL-based ingestion using the multi-stage query task engine is our 
recommended solution starting in Druid 24.0. Alternative ingestion solutions, 
such as native batch and Hadoop-based ingestion systems, will still be 
supported. We recommend you read all [known issues](./msq-known-issues.md) and 
test the feature in a development environment before rolling it out in 
production. Using the multi-stage query task engine with `SELECT` statements 
that do not write to a datasource is experimental.
+
+## General query execution
+
+- There's no fault tolerance. If any task fails, the entire query fails. 
+
+- Only one local file system per server is used for stage output data during 
multi-stage query
+  execution. If your servers have multiple local file systems, this causes 
queries to exhaust
+  available disk space earlier than expected. 
+
+- When `msqMaxNumTasks` is higher than the total
+  capacity of the cluster, more tasks may be launched than can run at once. 
This leads to a
+  [TaskStartTimeout](./msq-reference.md#context-parameters) error code, as 
there is never enough capacity to run the query.
+  To avoid this, set `msqMaxNumTasks` to a number of tasks that can run 
simultaneously on your cluster.
+
+- When `msqTaskAssignment` is set to `auto`, the system generates one task per 
input file for certain splittable
+  input sources where file sizes are not known ahead of time. This includes 
the `http` input source, where the system
+  generates one task per URI.
+
+## Memory usage
+
+- INSERT queries can consume excessive memory when using complex types due to 
inaccurate footprint
+  estimation. This can appear as an OutOfMemoryError during the 
SegmentGenerator stage when using
+  sketches. If you run into this issue, try manually lowering the value of the
+  [`msqRowsInMemory`](./msq-reference.md#context-parameters) parameter.
+
+- EXTERN loads an entire row group into memory at once when reading from 
Parquet files. Row groups
+  can be up to 1 GB in size, which can lead to excessive heap usage when 
reading many files in
+  parallel. This can appear as an OutOfMemoryError during stages that read 
Parquet input files. If
+  you run into this issue, try using a smaller number of worker tasks or you 
can increase the heap
+  size of your Indexers or of your Middle Manager-launched indexing tasks.
+
+- Ingesting a very long row may consume excessive memory and result in an 
OutOfMemoryError. If a row is read 
+  which requires more memory than is available, the service might throw 
OutOfMemoryError. If you run into this
+  issue, allocate enough memory to be able to store the largest row to the 
indexer. 
+
+## SELECT queries
+
+- SELECT query results do not include real-time data until it has been 
published.
+
+- TIMESTAMP types are formatted as numbers rather than ISO8601 timestamp
+  strings, which differs from Druid's standard result format. 
+
+- BOOLEAN types are formatted as numbers like `1` and `0` rather
+  than `true` or `false`, which differs from Druid's standard result
+  format. 
+
+- TopN is not implemented. The context parameter
+  `useApproximateTopN` is ignored and always treated as if it
+  were `false`. Therefore, topN-shaped queries will
+  always run using the groupBy engine. There is no loss of
+  functionality, but there may be a performance impact, since
+  these queries will run using an exact algorithm instead of an
+  approximate one.
+- GROUPING SETS is not implemented. Queries that use GROUPING SETS
+  will fail.
+- The numeric flavors of the EARLIEST and LATEST aggregators do not work 
properly. Attempting to use the numeric flavors of these aggregators will lead 
to an error like `java.lang.ClassCastException: class java.lang.Double cannot 
be cast to class org.apache.druid.collections.SerializablePair`. The string 
flavors, however, do work properly.
+
+##  INSERT queries
+
+- The [schemaless 
dimensions](../ingestion/ingestion-spec.md#inclusions-and-exclusions)
+feature is not available. All columns and their types must be specified 
explicitly.
+
+- [Segment metadata queries](../querying/segmentmetadataquery.md)
+  on datasources ingested with the Multi-Stage Query Engine will return values 
for`timestampSpec` that are not usable
+  for introspection.
+
+- When INSERT with GROUP BY does the match the criteria mentioned in [GROUP 
BY](./index.md#group-by),  the multi-stage engine generates segments that 
Druid's compaction
+  functionality is not able to further roll up. This applies to automatic 
compaction as well as manually
+  issued `compact` tasks. Individual queries executed with the multi-stage 
engine always guarantee
+  perfect rollup for their output, so this only matters if you are performing 
a sequence of INSERT
+  queries that each append data to the same time chunk. If necessary, you can 
compact such data
+  using another SQL query instead of a `compact` task.
+
+- When using INSERT with GROUP BY, splitting of large partitions is not 
currently
+  implemented. If a single partition key appears in a
+  very large number of rows, an oversized segment will be created.
+  You can mitigate this by adding additional columns to your
+  partition key. Note that partition splitting _does_ work properly
+  when performing INSERT without GROUP BY.
+
+- INSERT with column lists, like
+  `INSERT INTO tbl (a, b, c) SELECT ...`, is not implemented.
+
+## EXTERN queries
+
+- EXTERN does not accept `druid` input sources.
+
+## Missing guardrails
+
+- Maximum number of input files. Since there's no 

Review Comment:
   ```suggestion
   - Maximum number of input files. Since there's no limit, the controller can 
potentially run out of memory tracking all input files
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] techdocsmith commented on a diff in pull request #12983: msq: add multi-stage-query docs

Reply via email to