clintropolis opened a new pull request, #13672:
URL: https://github.com/apache/druid/pull/13672

   ### Description
   Following up on #13653, this PR improves the flattener machinery to allow 
discovering nested columns when using druid schemaless ingestion powered by the 
nested column indexer for discovered columns.
   
   Effectively, whenever 
   ```
    "tuningConfig": {
   ...
         "appendableIndexSpec": {
           "type": "onheap",
   ...
           "useNestedColumnIndexerForSchemaDiscovery": true
         }
   ...
       }
   ```
   is set, this value is pushed down to the `FlattenerMaker` implementations 
which power the column discovery. `InputRowSchema` has a new property, 
`discoverNestedColumns` which is set to true whenever the tuning config 
`useNestedColumnIndexerForSchemaDiscovery` is set to true.
   
   `InputEntityReader` implementations can then feed this `InputRowSchema` 
value into `FlattenerMaker.create`, and the `FlattenerMaker` interface itself 
has been updated to accept this flag when discovery the columns:
   ```java
       Iterable<String> discoverRootFields(T obj, boolean discoverNestedFields);
   ```
   
   Tagging with release notes since it makes changes to the `FlattenerMaker` 
which is marked with `@ExtensionPoint`.
   
   This PR also adds a set of integration tests to test schemaless ingestion 
using `useNestedColumnIndexerForSchemaDiscovery` set to true with a variety of 
input formats. This test does not actually exercise the changes in this PR 
since the batch tests contain no nested data, but does at least cover string 
and numbers. I plan to add streaming integration tests in the future once the 
streaming tests are moved over to the new integration framework, and since 
those datas are generated I should be able to add some nested structure and 
provide integration test coverage for the full set of functionality.
   
   #### Release note
   Extension point `FlattenerMaker` method `discoverRootFields` has had its 
signature change from
   
   ```java
       Iterable<String> discoverRootFields(T obj);
   ```
    to 
   ```java
       Iterable<String> discoverRootFields(T obj, boolean discoverNestedFields);
   ```
   Any extension implementing this interface will not be compatible with Druid 
26.0.0+ until updated with this new implementation. Implementors should treat 
the new `discoverNestedFields` as requesting the entire set of top level column 
names from an input, instead of the traditional behavior which expects this set 
to be filtered to simple flat literal types.
   
   
   <hr>
   
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to