[GitHub] [druid] gianm opened a new pull request, #12918: Add druid-multi-stage-query extension.

GitBox Wed, 17 Aug 2022 23:24:46 -0700


gianm opened a new pull request, #12918:
URL: https://github.com/apache/druid/pull/12918


   ### Functionality
   
   This patch adds the final pieces to be able to run SQL SELECT, INSERT, and 
REPLACE statements as multi-stage queries using indexing service tasks. This is 
enough to provide an initial implementation of #12262. 
   
   The functionality in this patch is provided as an extension. When the 
extension is loaded:
   
   - Broker endpoint `/druid/v2/sql/task` is added, and accepts SQL SELECT, 
INSERT, and REPLACE statements. The EXTERN function is usable through this 
endpoint.
   - Task types `query_controller` and `query_worker` are added. These provide 
the runtime for SQL queries submitted to the new endpoint.
   
   ### Notes for reviewers
   
   Developer notes are at: 
https://github.com/gianm/druid/blob/multi-stage-query-tasks/extensions-core/multi-stage-query/README.md.
 Please refer to this doc for a description of the key classes and concepts.
   
   ### Try this at home
   
   You can try this out yourself. First, build Druid from this branch and run 
`bin/start-micro-quickstart`. Then, save the following content to 
`insert.json`. (Human-readable version: 
https://gist.github.com/gianm/9183b1fdecdb3e9894a344c40bcb7d4a).
   
   ```json
   {
     "query": "REPLACE INTO \"kttm_simple\"\nOVERWRITE ALL\nWITH kttm_data AS 
(\nSELECT * FROM TABLE(\n  EXTERN(\n    
'{\"type\":\"http\",\"uris\":[\"https://static.imply.io/data/kttm/kttm-v2-2019-08-25.json.gz\"]}',\n
    '{\"type\":\"json\"}',\n    
'[{\"name\":\"timestamp\",\"type\":\"string\"},{\"name\":\"agent_category\",\"type\":\"string\"},{\"name\":\"agent_type\",\"type\":\"string\"},{\"name\":\"browser\",\"type\":\"string\"},{\"name\":\"browser_version\",\"type\":\"string\"},{\"name\":\"city\",\"type\":\"string\"},{\"name\":\"continent\",\"type\":\"string\"},{\"name\":\"country\",\"type\":\"string\"},{\"name\":\"version\",\"type\":\"string\"},{\"name\":\"event_type\",\"type\":\"string\"},{\"name\":\"event_subtype\",\"type\":\"string\"},{\"name\":\"loaded_image\",\"type\":\"string\"},{\"name\":\"adblock_list\",\"type\":\"string\"},{\"name\":\"forwarded_for\",\"type\":\"string\"},{\"name\":\"language\",\"type\":\"string\"},{\"name\":\"number\",\"type\":\"long\"},{\"name\":\"os\"
 
,\"type\":\"string\"},{\"name\":\"path\",\"type\":\"string\"},{\"name\":\"platform\",\"type\":\"string\"},{\"name\":\"referrer\",\"type\":\"string\"},{\"name\":\"referrer_host\",\"type\":\"string\"},{\"name\":\"region\",\"type\":\"string\"},{\"name\":\"remote_address\",\"type\":\"string\"},{\"name\":\"screen\",\"type\":\"string\"},{\"name\":\"session\",\"type\":\"string\"},{\"name\":\"session_length\",\"type\":\"long\"},{\"name\":\"timezone\",\"type\":\"string\"},{\"name\":\"timezone_offset\",\"type\":\"long\"},{\"name\":\"window\",\"type\":\"string\"}]'\n
  )\n))\n\nSELECT\n  FLOOR(TIME_PARSE(\"timestamp\") TO MINUTE) AS __time,\n  
session,\n  agent_category,\n  agent_type,\n  browser,\n  browser_version,\n  
MV_TO_ARRAY(\"language\") AS \"language\",\n  os,\n  city,\n  country,\n  
forwarded_for AS ip_address\nFROM kttm_data\nWHERE os = 'iOS'\nPARTITIONED BY 
HOUR\nCLUSTERED BY browser, session"
   }
   ```
   
   Run the curl command:
   
   ```
   curl -H'Content-Type: application/json' --data-binary @insert.json  
http://localhost:8888/druid/v2/sql/task
   ```
   
   You'll get a task ID back. When the task finishes, the datasource 
`kttm_simple` will have been created.
   
   ### Current state of the extension
   
   Future PRs (by other people 🙂) will add a web console UI and documentation, 
rounding out the feature.
   
   Once this patch and the web console and docs changes are in place, I believe 
SQL INSERT and REPLACE functionality would be ready for prime time with one 
caveat. Fault tolerance isn't yet implemented, so if one worker task fails, the 
entire query fails. I think this is OK for many batch ingest use cases, but of 
course isn't OK for all of them. Fault tolerance is one of the areas we'll want 
to work on next.
   
   SQL SELECT functionality has a couple of additional caveats, beyond the lack 
of fault tolerance. One is that results are stored in the task report, which 
isn't partitioned, so it becomes a bottleneck for very large resultsets. 
Another is that there is no caching for Druid segments, sot hey must be 
downloaded anew on each query. But the SELECT functionality _does work_! Try 
it; it's fun. I like doing COUNT DISTINCT queries with 
`useApproximateCountDistinct: false` and topN-style queries with 
`useApproximateTopN: false` on high-cardinality columns. These queries generate 
a ton of intermediate data, and are a good exercise for the multi-stage system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm opened a new pull request, #12918: Add druid-multi-stage-query extension.

Reply via email to