jasonk000 opened a new pull request #12177:
URL: https://github.com/apache/druid/pull/12177


   ### Overview and benefits
   
   This introduces a `tree` to the `flattenSpec` which allows faster JSON 
parsing than `jq` and `path` parsing types in the event that a simple 
hierarchical lookup is required.
   
   For one of our high-volume ingestion tasks, this reduces middle manager CPU 
load by 25%. This task presently has 3 `root` fields and 8 additional `path` 
fields of the form 
   
   ```
   {
     "type": "path",
     "name": "task_detail",
     "expr": "$.metadata.detail"
   },
   ```
   
   Replacing this with a hierarchical lookup as follows:
   ```
   {
     "type": "tree",
     "name": "task_detail",
     "exprs": ["metadata", "detail"]
   },
   ```
   
   In this sample, a switch to the new `tree` spec is submitted at 15.54, and 
after a moment or two the CPU shifts down at a notably lower total utilization 
from 3 cores to 2.3 cores.
   
![image](https://user-images.githubusercontent.com/3196528/150242320-61fedc3d-3c1f-419d-a6b6-273cc47658a8.png)
   
   
   ### Implementation decision
   
   The goal is to introduce the ability to (1) provide a list of expressions 
instead of a single expression, and, a new `tree` type. 
   ```
   {
     "type": "tree",                     <-- new type
     "name": "task_detail",
     "exprs": ["metadata", "detail"]     <-- name of json keys to traverse as a 
List<String>
   },
   ```
   
   The crux of the change is this tree extractor for loop (below) and the 
requisite parser configuration to ensure it is called.
   ```
   public Function<JsonNode, Object> makeJsonTreeExtractor(final List<String> 
exprs)
   {
     String[] exprsArray = exprs.toArray(new String[0]);
     return node -> {
       JsonNode target = node;
       for (String expr : exprsArray) {
         target = target.get(expr);
         if (target == null) {
           break;
         }
       }
       return valueConversionFunction(target);
     };
   }
   ```
   
   Key consideration was to avoid any backwards compatibility issues with 
deployments, so I opted for using a separate `exprs` parameter rather than 
overload the `expr` parameter.
   
   ### Changes required
   
   The change adds enum entries, the required field spec to parse `List<String> 
exprs` to configuration, and validate that `exprs` is only provided for the 
`tree` type.
   
   Then, tie it all into the `ObjectFlatteners` and `JSONFlattenerMaker` with a 
new `makeJsonTreeExtractor` that iterates each of the keys as a hierarchy in 
the dataset.
   
   #### Supporting changes
   
   - Update a lot of tests to explicitly verify the parameters are loaded.
   - Update `docs/ingestion-data-formats.md` to include a description of the 
`tree` type and the `exprs` parameter.
   - Update Avro, Orc, and Parquet flatteners to acknowledge `tree` but only 
work with a depth of 1; (would welcome guidance here).
   - Added benchmarks to `FlattenJSONBenchmark` to confirm improvements in 
flattener layer
   
   ### Additional supporting data
   
   On flamegraphs, `Flattener` related code before the change makes up ~49% of 
the `task-runner-xx` threads utilisation; after the change only ~16% of the 
`task-runner-xx` thread is used running `Flattener` related code.
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] N/A added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] N/A added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to