clintropolis opened a new pull request, #13568:
URL: https://github.com/apache/druid/pull/13568

   ### Description
   This PR updates the nested column storage format to store nested fields in 
files named by their position in the fields list, rather than using a 
stringified 'jq' syntax version of the path itself. This fixes an issue caused 
by processing nested data which contains newlines or commas in the path itself, 
which breaks the `meta.smoosh` csv file which stores the offsets of all "files" 
contained within a smoosh. This bumps the format version to 4, a "v3" version 
of the column has been left in place to read segments that have already been 
written.
   
   After the changes in this patch, the `meta.smoosh` looks something like this 
for a nested column
   ```
   ...
   agent,0,12702347,12702715
   agent.__doubleDictionary,0,12709430,12709436
   agent.__field_0,0,16298575,16814042
   agent.__field_1,0,16814042,17816369
   agent.__field_2,0,17816369,18124530
   agent.__field_3,0,18124530,18804682
   agent.__field_4,0,18804682,19361484
   agent.__field_5,0,19361484,19635903
   agent.__longDictionary,0,12709424,12709430
   agent.__raw,0,16298570,16298575
   agent.__raw_compressed,0,14674079,16298570
   agent.__raw_offsets,0,12709436,14674079
   agent.__stringDictionary,0,12702715,12709424
   ...
   ```
   
   There is no user facing aspect of this change, other than not writing out 
broken segments whenever the comma or newline are present in any of the paths.
   
   I have also updated `StructuredDataProcessor` to no longer directly create a 
stringified form of the field path, instead using `NestedPathPart` to be 
consistent with all of the other places that abstractly handle nested data 
stuffs. While I was here I've also sorted the output of `JSON_PATHS` so that it 
produces a stable result, making testing a bit easier.
   
   I'm not sure the best way to write tests to ensure that we can still read 
the 'v3' format, I suppose I could construct one by hand to make sure the 
reader can still find nested files, but .. haven't done this yet.
   
   <hr>
   
   This PR has:
   
   - [ ] been self-reviewed.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to