gianm opened a new pull request, #12918: URL: https://github.com/apache/druid/pull/12918
### Functionality This patch adds the final pieces to be able to run SQL SELECT, INSERT, and REPLACE statements as multi-stage queries using indexing service tasks. This is enough to provide an initial implementation of #12262. The functionality in this patch is provided as an extension. When the extension is loaded: - Broker endpoint `/druid/v2/sql/task` is added, and accepts SQL SELECT, INSERT, and REPLACE statements. The EXTERN function is usable through this endpoint. - Task types `query_controller` and `query_worker` are added. These provide the runtime for SQL queries submitted to the new endpoint. ### Notes for reviewers Developer notes are at: https://github.com/gianm/druid/blob/multi-stage-query-tasks/extensions-core/multi-stage-query/README.md. Please refer to this doc for a description of the key classes and concepts. ### Try this at home You can try this out yourself. First, build Druid from this branch and run `bin/start-micro-quickstart`. Then, save the following content to `insert.json`. (Human-readable version: https://gist.github.com/gianm/9183b1fdecdb3e9894a344c40bcb7d4a). ```json { "query": "REPLACE INTO \"kttm_simple\"\nOVERWRITE ALL\nWITH kttm_data AS (\nSELECT * FROM TABLE(\n EXTERN(\n '{\"type\":\"http\",\"uris\":[\"https://static.imply.io/data/kttm/kttm-v2-2019-08-25.json.gz\"]}',\n '{\"type\":\"json\"}',\n '[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"agent_category\",\"type\":\"string\"},{\"name\":\"agent_type\",\"type\":\"string\"},{\"name\":\"browser\",\"type\":\"string\"},{\"name\":\"browser_version\",\"type\":\"string\"},{\"name\":\"city\",\"type\":\"string\"},{\"name\":\"continent\",\"type\":\"string\"},{\"name\":\"country\",\"type\":\"string\"},{\"name\":\"version\",\"type\":\"string\"},{\"name\":\"event_type\",\"type\":\"string\"},{\"name\":\"event_subtype\",\"type\":\"string\"},{\"name\":\"loaded_image\",\"type\":\"string\"},{\"name\":\"adblock_list\",\"type\":\"string\"},{\"name\":\"forwarded_for\",\"type\":\"string\"},{\"name\":\"language\",\"type\":\"string\"},{\"name\":\"number\",\"type\":\"long\"},{\"name\":\"os\" ,\"type\":\"string\"},{\"name\":\"path\",\"type\":\"string\"},{\"name\":\"platform\",\"type\":\"string\"},{\"name\":\"referrer\",\"type\":\"string\"},{\"name\":\"referrer_host\",\"type\":\"string\"},{\"name\":\"region\",\"type\":\"string\"},{\"name\":\"remote_address\",\"type\":\"string\"},{\"name\":\"screen\",\"type\":\"string\"},{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"long\"},{\"name\":\"timezone\",\"type\":\"string\"},{\"name\":\"timezone_offset\",\"type\":\"long\"},{\"name\":\"window\",\"type\":\"string\"}]'\n )\n))\n\nSELECT\n FLOOR(TIME_PARSE(\"timestamp\") TO MINUTE) AS __time,\n session,\n agent_category,\n agent_type,\n browser,\n browser_version,\n MV_TO_ARRAY(\"language\") AS \"language\",\n os,\n city,\n country,\n forwarded_for AS ip_address\nFROM kttm_data\nWHERE os = 'iOS'\nPARTITIONED BY HOUR\nCLUSTERED BY browser, session" } ``` Run the curl command: ``` curl -H'Content-Type: application/json' --data-binary @insert.json http://localhost:8888/druid/v2/sql/task ``` You'll get a task ID back. When the task finishes, the datasource `kttm_simple` will have been created. ### Current state of the extension Future PRs (by other people 🙂) will add a web console UI and documentation, rounding out the feature. Once this patch and the web console and docs changes are in place, I believe SQL INSERT and REPLACE functionality would be ready for prime time with one caveat. Fault tolerance isn't yet implemented, so if one worker task fails, the entire query fails. I think this is OK for many batch ingest use cases, but of course isn't OK for all of them. Fault tolerance is one of the areas we'll want to work on next. SQL SELECT functionality has a couple of additional caveats, beyond the lack of fault tolerance. One is that results are stored in the task report, which isn't partitioned, so it becomes a bottleneck for very large resultsets. Another is that there is no caching for Druid segments, sot hey must be downloaded anew on each query. But the SELECT functionality _does work_! Try it; it's fun. I like doing COUNT DISTINCT queries with `useApproximateCountDistinct: false` and topN-style queries with `useApproximateTopN: false` on high-cardinality columns. These queries generate a ton of intermediate data, and are a good exercise for the multi-stage system. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
