abhishekrb19 commented on code in PR #18231: URL: https://github.com/apache/druid/pull/18231#discussion_r2220966986
########## docs/release-info/release-notes.md: ########## @@ -57,6 +57,125 @@ For tips about how to write a good release note, see [Release notes](https://git This section contains important information about new and existing features. +### Hadoop-based ingestion + +Hadoop-based ingestion has been deprecated since Druid 32.0. You must now opt-in to using the deprecated `index_hadoop` task type. If you don't do this, your Hadoop-based ingestion tasks will fail. + +To opt-in, set `druid.indexer.task.allowHadoopTaskExecution` to `true` in your `common.runtime.properties` file. + +[#18239](https://github.com/apache/druid/pull/18239) + +### Use SET statements for query context parameters + +You can now use SET statements to define query context parameters for a query through the [Druid console](#set-statements-in-the-druid-console) or the [API](#set-statements-with-the-api). + +[#17894](https://github.com/apache/druid/pull/17894) [#17974](https://github.com/apache/druid/pull/17974) + +#### SET statements in the Druid console + +The web console now supports using SET statements to specify query context parameters. For example, if you include `SET timeout = 20000;` in your query, the timeout query context parameter is set: + +```sql +SET timeout = 20000; +SELECT "channel", "page", sum("added") from "wikipedia" GROUP BY 1, 2 +``` + +[#17966](https://github.com/apache/druid/pull/17966) + +#### SET statements with the API + +SQL queries issued to `/druid/v2/sql` can now include multiple SET statements to build up context for the final statement. For example, the following SQL query results includes the `timeout`, `useCache`, `populateCache`, `vectorize`, and `engine` query context parameters: + +```sql +SET timeout = 20000; +SET useCache = false; +SET populateCache = false; +SET vectorize = 'force'; +SET engine = 'msq-dart' +SELECT "channel", "page", sum("added") from "wikipedia" GROUP BY 1, 2 +``` + +The API call for this query looks like the following: + +```curl +curl --location 'http://HOST:PORT/druid/v2/sql' \ +--header 'Content-Type: application/json' \ +--data '{ + "query": "SET timeout=20000; SET useCache=false; SET populateCache=false; SET engine='\''msq-dart'\'';SELECT user, commentLength,COUNT(*) AS \"COUNT\" FROM wikipedia GROUP BY 1, 2 ORDER BY 2 DESC", + "resultFormat": "array", + "header": true, + "typesHeader": true, + "sqlTypesHeader": true +}' +``` + + + +This improvement also works for INSERT and REPLACE queries using the MSQ task engine. Note that JDBC isn't supported. + +#### Improved HTTP endpoints + +You can now use raw SQL in the HTTP body for `/druid/v2/sql` endpoints. You can set `Content-Type` to `text/plain` instead of `application/json`, so you can provide raw text that isn't escaped. + + + [#17937](https://github.com/apache/druid/pull/17937) + +### Cloning Historicals + +You can now configure clones for Historicals using the dynamic Coordinator configuration `cloneServers`. Cloned Historicals are useful for situations such as rolling updates where you want to launch a new Historical as a replacement for an existing one. + +Set the config to a map from the target Historical server to the source Historical: + +``` + "cloneServers": {"historicalClone":"historicalOriginal"} +``` + +The clone doesn't participate in regular segment assignment or balancing. Instead, the Coordinator mirrors any segment assignment made to the original Historical onto the clone, so that the clone becomes an exact copy of the source. Segments on the clone Historical do not count towards replica counts either. If the original Historical disappears, the clone remains in the last known state of the source server until removed from the `cloneServers` config. + +When you query your data using the native query engine, you can prefer (`preferClones`), exclude (`excludeClones`), or include (`includeClones`) clones by setting the query context parameter `cloneQueryMode`. By default, clones are excluded. + +As part of this change, new Coordinator APIs are available. For more information, see [Coordinator APIs for clones](#coordinator-apis-for-clones). + +[#17863](https://github.com/apache/druid/pull/17863) [#17899](https://github.com/apache/druid/pull/17899) [#17956](https://github.com/apache/druid/pull/17956) +### Embedded kill tasks on the Overlord (Experimental) + +You can now run kill tasks directly on the Overlord itself. Embedded kill tasks provide several benefits; they: + +- Kill segments as soon as they're eligible +- Don't take up tasks slot +- finish faster since they use optimized metadata queries and don't launch a new JVM +- Kill a small number of segments per task, ensuring locks on an interval aren't held for too long +- Skip locked intervals to avoid head-of-line blocking +- Require minimal configuration +- Can keep up with a large number of unused segments in the cluster + +This feature is controlled by the following configs: + +- `druid.manager.segments.killUnused.enabled` - Whether the feature is enabled or not +- `druid.manager.segments.killUnused.bufferPeriod` - The amount of time that a segment must be unused before it is able to be permanently removed from metadata and deep storage. This can serve as a buffer period to prevent data loss if data ends up being needed after being marked unused. + +T use embedded kill tasks, you need to have segment metadata cache enabled. Review Comment: ```suggestion To use embedded kill tasks, you need to have segment metadata cache enabled. ``` ########## docs/release-info/release-notes.md: ########## @@ -57,6 +57,125 @@ For tips about how to write a good release note, see [Release notes](https://git This section contains important information about new and existing features. +### Hadoop-based ingestion + +Hadoop-based ingestion has been deprecated since Druid 32.0. You must now opt-in to using the deprecated `index_hadoop` task type. If you don't do this, your Hadoop-based ingestion tasks will fail. + +To opt-in, set `druid.indexer.task.allowHadoopTaskExecution` to `true` in your `common.runtime.properties` file. + +[#18239](https://github.com/apache/druid/pull/18239) + +### Use SET statements for query context parameters + +You can now use SET statements to define query context parameters for a query through the [Druid console](#set-statements-in-the-druid-console) or the [API](#set-statements-with-the-api). + +[#17894](https://github.com/apache/druid/pull/17894) [#17974](https://github.com/apache/druid/pull/17974) + +#### SET statements in the Druid console + +The web console now supports using SET statements to specify query context parameters. For example, if you include `SET timeout = 20000;` in your query, the timeout query context parameter is set: + +```sql +SET timeout = 20000; +SELECT "channel", "page", sum("added") from "wikipedia" GROUP BY 1, 2 +``` + +[#17966](https://github.com/apache/druid/pull/17966) + +#### SET statements with the API + +SQL queries issued to `/druid/v2/sql` can now include multiple SET statements to build up context for the final statement. For example, the following SQL query results includes the `timeout`, `useCache`, `populateCache`, `vectorize`, and `engine` query context parameters: + +```sql +SET timeout = 20000; +SET useCache = false; +SET populateCache = false; +SET vectorize = 'force'; +SET engine = 'msq-dart' +SELECT "channel", "page", sum("added") from "wikipedia" GROUP BY 1, 2 +``` + +The API call for this query looks like the following: + +```curl +curl --location 'http://HOST:PORT/druid/v2/sql' \ +--header 'Content-Type: application/json' \ +--data '{ + "query": "SET timeout=20000; SET useCache=false; SET populateCache=false; SET engine='\''msq-dart'\'';SELECT user, commentLength,COUNT(*) AS \"COUNT\" FROM wikipedia GROUP BY 1, 2 ORDER BY 2 DESC", + "resultFormat": "array", + "header": true, + "typesHeader": true, + "sqlTypesHeader": true +}' +``` + + + +This improvement also works for INSERT and REPLACE queries using the MSQ task engine. Note that JDBC isn't supported. + +#### Improved HTTP endpoints + +You can now use raw SQL in the HTTP body for `/druid/v2/sql` endpoints. You can set `Content-Type` to `text/plain` instead of `application/json`, so you can provide raw text that isn't escaped. + + + [#17937](https://github.com/apache/druid/pull/17937) + +### Cloning Historicals + +You can now configure clones for Historicals using the dynamic Coordinator configuration `cloneServers`. Cloned Historicals are useful for situations such as rolling updates where you want to launch a new Historical as a replacement for an existing one. + +Set the config to a map from the target Historical server to the source Historical: + +``` + "cloneServers": {"historicalClone":"historicalOriginal"} +``` + +The clone doesn't participate in regular segment assignment or balancing. Instead, the Coordinator mirrors any segment assignment made to the original Historical onto the clone, so that the clone becomes an exact copy of the source. Segments on the clone Historical do not count towards replica counts either. If the original Historical disappears, the clone remains in the last known state of the source server until removed from the `cloneServers` config. + +When you query your data using the native query engine, you can prefer (`preferClones`), exclude (`excludeClones`), or include (`includeClones`) clones by setting the query context parameter `cloneQueryMode`. By default, clones are excluded. + +As part of this change, new Coordinator APIs are available. For more information, see [Coordinator APIs for clones](#coordinator-apis-for-clones). + +[#17863](https://github.com/apache/druid/pull/17863) [#17899](https://github.com/apache/druid/pull/17899) [#17956](https://github.com/apache/druid/pull/17956) +### Embedded kill tasks on the Overlord (Experimental) + +You can now run kill tasks directly on the Overlord itself. Embedded kill tasks provide several benefits; they: + +- Kill segments as soon as they're eligible +- Don't take up tasks slot +- finish faster since they use optimized metadata queries and don't launch a new JVM +- Kill a small number of segments per task, ensuring locks on an interval aren't held for too long +- Skip locked intervals to avoid head-of-line blocking +- Require minimal configuration +- Can keep up with a large number of unused segments in the cluster + +This feature is controlled by the following configs: + +- `druid.manager.segments.killUnused.enabled` - Whether the feature is enabled or not +- `druid.manager.segments.killUnused.bufferPeriod` - The amount of time that a segment must be unused before it is able to be permanently removed from metadata and deep storage. This can serve as a buffer period to prevent data loss if data ends up being needed after being marked unused. Review Comment: Do you think it's worth mentioning what the defaults for these configs are? (false and `P30D`) ########## docs/release-info/release-notes.md: ########## @@ -65,55 +184,234 @@ This section contains detailed release notes separated by areas. #### Other web console improvements +- You can now assign tiered replications to tiers that aren't currently online [#18050](https://github.com/apache/druid/pull/18050) +- You can now filter tasks by the error in the Task view [#18057](https://github.com/apache/druid/pull/18057) +- Improved SQL autocomplete and added JSON autocomplete [#18126](https://github.com/apache/druid/pull/18126) +- Updated the web console to use the Overlord APIs instead of Coordinator APIs when managing segments, such as marking them as unused [#18172](https://github.com/apache/druid/pull/18172) + ### Ingestion +- Improved concurrency for batch and streaming ingestion tasks [#17828](https://github.com/apache/druid/pull/17828) +- Removed the `useMaxMemoryEstimates` config. When set to false, Druid used a much more accurate memory estimate that was introduced in Druid 0.23.0. That more accurate method is the only available method now. The config has defaulted to false for several releases [#17936](https://github.com/apache/druid/pull/17936) + #### SQL-based ingestion ##### Other SQL-based ingestion improvements #### Streaming ingestion +##### Multi-stream supervisors (experimental) + +You can now use more than one supervisor to ingest data into the same datasource. Use the `id` field to distinguish between supervisors ingesting into the same datasource (identified by `spec.dataSchema.dataSource` for streaming supervisors). + +When using this feature, make sure you set `useConcurrentLocks` to `true` for the `context` field in the supervisor spec. + +[#18149](https://github.com/apache/druid/pull/18149) [#18082](https://github.com/apache/druid/pull/18082) + +##### Supervisors and the underlying input stream + +Seekable stream supervisors (Kafka, Kinesis, and Rabbit) can no longer be updated to ingest from a different input stream (such as a topic for Kafka). Since such a change is not fully supported by the underlying system, a request to make such a change will result in a 400 error. + +[#17955](https://github.com/apache/druid/pull/17955) [#17975](https://github.com/apache/druid/pull/17975) + ##### Other streaming ingestion improvements +- Improved streaming ingestion so that it automatically determine the maximum number of columns to merge [#17917](https://github.com/apache/druid/pull/17917) + + ### Querying +#### Metadata query for segments + +You can use a segment metadata query to find the list of projections attached to a segment. + +[#18119](https://github.com/apache/druid/pull/18119) + + #### Other querying improvements +- You can now perform big decimal aggregations using the MSQ task engine [#18164](https://github.com/apache/druid/pull/18164) +- Changed `MV_OVERLAP` and `MV_CONTAINS` functions now aligns more closely with the native `inType` filter [#18084](https://github.com/apache/druid/pull/18084) +- Improved query handling when some segments are missing on Historicals. Druid no longer incorrectly returns partial results [#18025](https://github.com/apache/druid/pull/18025) Review Comment: Reworded since this is a bit more specific: ```suggestion - Improved query handling when segments are temporarily missing on Historicals but not detected by Brokers. Druid doesn't return partial results incorrectly in such cases. [#18025](https://github.com/apache/druid/pull/18025) ``` ########## docs/release-info/release-notes.md: ########## @@ -65,55 +184,234 @@ This section contains detailed release notes separated by areas. #### Other web console improvements +- You can now assign tiered replications to tiers that aren't currently online [#18050](https://github.com/apache/druid/pull/18050) +- You can now filter tasks by the error in the Task view [#18057](https://github.com/apache/druid/pull/18057) +- Improved SQL autocomplete and added JSON autocomplete [#18126](https://github.com/apache/druid/pull/18126) +- Updated the web console to use the Overlord APIs instead of Coordinator APIs when managing segments, such as marking them as unused [#18172](https://github.com/apache/druid/pull/18172) + ### Ingestion +- Improved concurrency for batch and streaming ingestion tasks [#17828](https://github.com/apache/druid/pull/17828) +- Removed the `useMaxMemoryEstimates` config. When set to false, Druid used a much more accurate memory estimate that was introduced in Druid 0.23.0. That more accurate method is the only available method now. The config has defaulted to false for several releases [#17936](https://github.com/apache/druid/pull/17936) + #### SQL-based ingestion ##### Other SQL-based ingestion improvements #### Streaming ingestion +##### Multi-stream supervisors (experimental) + +You can now use more than one supervisor to ingest data into the same datasource. Use the `id` field to distinguish between supervisors ingesting into the same datasource (identified by `spec.dataSchema.dataSource` for streaming supervisors). + +When using this feature, make sure you set `useConcurrentLocks` to `true` for the `context` field in the supervisor spec. + +[#18149](https://github.com/apache/druid/pull/18149) [#18082](https://github.com/apache/druid/pull/18082) + +##### Supervisors and the underlying input stream + +Seekable stream supervisors (Kafka, Kinesis, and Rabbit) can no longer be updated to ingest from a different input stream (such as a topic for Kafka). Since such a change is not fully supported by the underlying system, a request to make such a change will result in a 400 error. + +[#17955](https://github.com/apache/druid/pull/17955) [#17975](https://github.com/apache/druid/pull/17975) + ##### Other streaming ingestion improvements +- Improved streaming ingestion so that it automatically determine the maximum number of columns to merge [#17917](https://github.com/apache/druid/pull/17917) + + ### Querying +#### Metadata query for segments + +You can use a segment metadata query to find the list of projections attached to a segment. + +[#18119](https://github.com/apache/druid/pull/18119) + Review Comment: Looks like a release note for https://github.com/apache/druid/pull/17983 is missing ########## docs/release-info/release-notes.md: ########## @@ -57,6 +57,125 @@ For tips about how to write a good release note, see [Release notes](https://git This section contains important information about new and existing features. +### Hadoop-based ingestion + +Hadoop-based ingestion has been deprecated since Druid 32.0. You must now opt-in to using the deprecated `index_hadoop` task type. If you don't do this, your Hadoop-based ingestion tasks will fail. + +To opt-in, set `druid.indexer.task.allowHadoopTaskExecution` to `true` in your `common.runtime.properties` file. + +[#18239](https://github.com/apache/druid/pull/18239) + +### Use SET statements for query context parameters + +You can now use SET statements to define query context parameters for a query through the [Druid console](#set-statements-in-the-druid-console) or the [API](#set-statements-with-the-api). + +[#17894](https://github.com/apache/druid/pull/17894) [#17974](https://github.com/apache/druid/pull/17974) + +#### SET statements in the Druid console + +The web console now supports using SET statements to specify query context parameters. For example, if you include `SET timeout = 20000;` in your query, the timeout query context parameter is set: + +```sql +SET timeout = 20000; +SELECT "channel", "page", sum("added") from "wikipedia" GROUP BY 1, 2 +``` + +[#17966](https://github.com/apache/druid/pull/17966) + +#### SET statements with the API + +SQL queries issued to `/druid/v2/sql` can now include multiple SET statements to build up context for the final statement. For example, the following SQL query results includes the `timeout`, `useCache`, `populateCache`, `vectorize`, and `engine` query context parameters: + +```sql +SET timeout = 20000; +SET useCache = false; +SET populateCache = false; +SET vectorize = 'force'; +SET engine = 'msq-dart' +SELECT "channel", "page", sum("added") from "wikipedia" GROUP BY 1, 2 +``` + +The API call for this query looks like the following: + +```curl +curl --location 'http://HOST:PORT/druid/v2/sql' \ +--header 'Content-Type: application/json' \ +--data '{ + "query": "SET timeout=20000; SET useCache=false; SET populateCache=false; SET engine='\''msq-dart'\'';SELECT user, commentLength,COUNT(*) AS \"COUNT\" FROM wikipedia GROUP BY 1, 2 ORDER BY 2 DESC", + "resultFormat": "array", + "header": true, + "typesHeader": true, + "sqlTypesHeader": true +}' +``` + + + +This improvement also works for INSERT and REPLACE queries using the MSQ task engine. Note that JDBC isn't supported. + +#### Improved HTTP endpoints + +You can now use raw SQL in the HTTP body for `/druid/v2/sql` endpoints. You can set `Content-Type` to `text/plain` instead of `application/json`, so you can provide raw text that isn't escaped. + + + [#17937](https://github.com/apache/druid/pull/17937) + +### Cloning Historicals + +You can now configure clones for Historicals using the dynamic Coordinator configuration `cloneServers`. Cloned Historicals are useful for situations such as rolling updates where you want to launch a new Historical as a replacement for an existing one. + +Set the config to a map from the target Historical server to the source Historical: + +``` + "cloneServers": {"historicalClone":"historicalOriginal"} +``` + +The clone doesn't participate in regular segment assignment or balancing. Instead, the Coordinator mirrors any segment assignment made to the original Historical onto the clone, so that the clone becomes an exact copy of the source. Segments on the clone Historical do not count towards replica counts either. If the original Historical disappears, the clone remains in the last known state of the source server until removed from the `cloneServers` config. + +When you query your data using the native query engine, you can prefer (`preferClones`), exclude (`excludeClones`), or include (`includeClones`) clones by setting the query context parameter `cloneQueryMode`. By default, clones are excluded. + +As part of this change, new Coordinator APIs are available. For more information, see [Coordinator APIs for clones](#coordinator-apis-for-clones). + +[#17863](https://github.com/apache/druid/pull/17863) [#17899](https://github.com/apache/druid/pull/17899) [#17956](https://github.com/apache/druid/pull/17956) +### Embedded kill tasks on the Overlord (Experimental) + +You can now run kill tasks directly on the Overlord itself. Embedded kill tasks provide several benefits; they: + +- Kill segments as soon as they're eligible +- Don't take up tasks slot +- finish faster since they use optimized metadata queries and don't launch a new JVM +- Kill a small number of segments per task, ensuring locks on an interval aren't held for too long +- Skip locked intervals to avoid head-of-line blocking +- Require minimal configuration +- Can keep up with a large number of unused segments in the cluster + +This feature is controlled by the following configs: + +- `druid.manager.segments.killUnused.enabled` - Whether the feature is enabled or not +- `druid.manager.segments.killUnused.bufferPeriod` - The amount of time that a segment must be unused before it is able to be permanently removed from metadata and deep storage. This can serve as a buffer period to prevent data loss if data ends up being needed after being marked unused. + +T use embedded kill tasks, you need to have segment metadata cache enabled. + +As part of this feature, [new metrics](#overlord-kill-task-metrics) have been added. + +[#18028](https://github.com/apache/druid/pull/18028) Review Comment: ```suggestion [#18028](https://github.com/apache/druid/pull/18028) [#18124](https://github.com/apache/druid/pull/18124) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
