clintropolis opened a new pull request, #13568: URL: https://github.com/apache/druid/pull/13568
### Description This PR updates the nested column storage format to store nested fields in files named by their position in the fields list, rather than using a stringified 'jq' syntax version of the path itself. This fixes an issue caused by processing nested data which contains newlines or commas in the path itself, which breaks the `meta.smoosh` csv file which stores the offsets of all "files" contained within a smoosh. This bumps the format version to 4, a "v3" version of the column has been left in place to read segments that have already been written. After the changes in this patch, the `meta.smoosh` looks something like this for a nested column ``` ... agent,0,12702347,12702715 agent.__doubleDictionary,0,12709430,12709436 agent.__field_0,0,16298575,16814042 agent.__field_1,0,16814042,17816369 agent.__field_2,0,17816369,18124530 agent.__field_3,0,18124530,18804682 agent.__field_4,0,18804682,19361484 agent.__field_5,0,19361484,19635903 agent.__longDictionary,0,12709424,12709430 agent.__raw,0,16298570,16298575 agent.__raw_compressed,0,14674079,16298570 agent.__raw_offsets,0,12709436,14674079 agent.__stringDictionary,0,12702715,12709424 ... ``` There is no user facing aspect of this change, other than not writing out broken segments whenever the comma or newline are present in any of the paths. I have also updated `StructuredDataProcessor` to no longer directly create a stringified form of the field path, instead using `NestedPathPart` to be consistent with all of the other places that abstractly handle nested data stuffs. While I was here I've also sorted the output of `JSON_PATHS` so that it produces a stable result, making testing a bit easier. I'm not sure the best way to write tests to ensure that we can still read the 'v3' format, I suppose I could construct one by hand to make sure the reader can still find nested files, but .. haven't done this yet. <hr> This PR has: - [ ] been self-reviewed. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [x] been tested in a test Druid cluster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
