[GitHub] [druid] clintropolis opened a new pull request, #13653: allow using nested column indexer for schema discovery

GitBox Tue, 10 Jan 2023 00:01:27 -0800


clintropolis opened a new pull request, #13653:
URL: https://github.com/apache/druid/pull/13653


   ### Description
   This PR introduces a new experimental mode for schema discovery which is 
powered by the 'nested' column indexer. To accompany this, there are also some 
changes to the nested column selector behavior in the case that the column 
consists of a single typed 'root' literal column (so no nested data), to allow 
nested columns to mimic the column type of this root literal. 
   
   The result is a schema discovery mode which can produce columns of the 
correct type rather than being limited to all columns being `STRING` typed with 
the current schemaless behavior. Like existing schemaless ingestion, the 
`timestampSpec` must still be defined, perhaps future enhancements could add 
automatic time column selection. Also in this PR all discovered columns are 
writing out full nested columns, a future PR will add optimizations to the 
nested column serializer to only store what is necessary.
   
   I think the most compelling use case for this is with streaming ingestion, 
since it allows for effortless support of schema evolution.
   
   #### Example
   For example, imagine I have a kafka topic, `schemafree`. With the changes in 
this PR, we can define a very minimal ingestion spec:
   
   ```json
   {
     "type": "kafka",
     "spec": {
       "ioConfig": {
         "type": "kafka",
         "consumerProperties": {
           "bootstrap.servers": "localhost:9092"
         },
         "topic": "schemafree",
         "inputFormat": {
           "type": "json"
         }
       },
       "tuningConfig": {
         "type": "kafka",
         "appendableIndexSpec": {
           "type": "onheap",
           "useNestedColumnIndexerForSchemaDiscovery": true
         }
       },
       "dataSchema": {
         "dataSource": "schemafree",
         "timestampSpec": {
           "column": "time",
           "format": "iso"
         },
         "dimensionsSpec":{}
       }
     }
   }
   ```
   
   If we send a first batch of events to our topic:
   
   ```json
   {"time":"2023-01-07T00:00:00Z", "some_long":1234, "some_double":1.23, 
"some_string":"a", "some_variant":"a"}
   {"time":"2023-01-07T01:00:00Z", "some_long":5678, "some_double":4.56, 
"some_string":"b", "some_variant":1}
   {"time":"2023-01-07T01:10:00Z", "some_long":1111, "some_string":"c", 
"some_variant":2.2}
   {"time":"2023-01-07T01:20:00Z", "some_double":11.11, "some_variant":1}
   ```
   
   `useNestedColumnIndexerForSchemaDiscovery` set on `appendableIndexSpec` of 
the `tuningConfig` tells the `IncrementalIndex` to use a `NestedColumnIndexer` 
for any discovered dimensions instead of a `StringDimensionIndexer`. The new 
mimic behavior of nested column selectors then allows queries to see these 
discovered columns as their correct type:
   
   <img width="1215" alt="Screen Shot 2023-01-09 at 10 46 21 PM" 
src="https://user-images.githubusercontent.com/1577461/211491200-8414de07-aed6-4cf2-b8ea-046d9da3a5fd.png";>
   
   Adding additional events:
   
   ```json
   {"time":"2023-01-07T00:00:00Z", "other_long": 1111, "other_double":2.22, 
"other_string": "zz"}
   {"time":"2023-01-07T00:00:00Z", "other_long": 2222, "other_double":3.33, 
"other_string": "yy"}
   {"time":"2023-01-07T00:00:00Z", "other_long": 3333, "other_double":4.44, 
"other_string": "xx"}
   {"time":"2023-01-07T00:00:00Z", "other_long": 4444, "other_double":5.55, 
"other_string": "ww"}
   ```
   
   these are picked up as well:
   
   <img width="1215" alt="Screen Shot 2023-01-09 at 10 47 19 PM" 
src="https://user-images.githubusercontent.com/1577461/211492154-fcf3950c-1a28-4c10-8345-cf2e30879114.png";>
   
   The nested column selectors are using the same nested literal column 
selectors that are used for the nested virtual columns that back SQL functions 
like `JSON_VALUE`, so the performance is approximately the same as if these 
were regular literal columns, and we can query them as if they were such, 
grouping and aggregating and so on:
   
   <img width="1217" alt="Screen Shot 2023-01-09 at 11 29 48 PM" 
src="https://user-images.githubusercontent.com/1577461/211492438-7a763a2c-ee65-4933-bfc9-bd584276bf07.png";>
   
   #### Follow-up work
   The most important pieces to follow are:
   * improving nested column serializer to optimize the root literal case
   * refactor flattener machinery which provides column discovery to include 
'nested' columns... ironically right now these are filtered out so that columns 
with actual nested data will not be automatically ingested even though the 
nested column indexer was literally built for this
   * refactor column merging code to use something more purpose built than 
`ColumnCapabilities` for segment merging/picking column handlers/etc, something 
like `ColumnFormat` or `ColumnShape`. For now I have added the concept of 
'handler' capabilities as a crutch to allow merging to choose the nested column 
merger even though the column capabilities reports as a `STRING` or `LONG` or 
whatever with nested columns, but going forward i think something nicer can be 
built, more on this later
   * i'm sure i'm forgetting other things 🙃 
   
   <hr>
   
   This PR has:
   
   - [ ] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis opened a new pull request, #13653: allow using nested column indexer for schema discovery

Reply via email to