clintropolis commented on PR #12753: URL: https://github.com/apache/druid/pull/12753#issuecomment-1182535667
>Is it intentionally undocumented in this PR? Do you plan to add documentation? I was planning to add documentation in a follow-up PR since I thought this one was already big enough 😅 >Are there any impediments to maintaining forwards compatibility of the storage format, such that new versions of Druid will always be able to read JSON columns written by older versions? Do you foresee any reason we might want to break compatibility? I modeled the column after existing Druid columns so most things are decorated with a version byte which should allow us to make changes in the future while still being able to continue reading the existing data. For the specific list of what is versioned: * `NestedDataColumnSerializer` for the complex column itself (currently on v3 actually, i removed the reader code for older versions from prototyping to get rid of dead code) * `GlobalDictionaryEncodedFieldColumnWriter` which writes the nested columns and is currently re-using `DictionaryEncodedColumnPartSerde.VERSION` (i should probably decouple this at some point in the future...) * `FixedIndexed` (building block used to store local to global dictionary mapping and long and double value dictionaries) * `CompressedVariableSizedBlobColumnSerializer` (used to compress raw data) * `CompressedBlockSerializer` (used internally by `CompressedVariableSizedBlobColumnSerializer`) In the "Future work" section of #12695 I mention storage format as an area that we can iterate on in the future, the biggest things I have in mind right now are storing arrays of literal values as array typed columns instead of broken out as they currently are, as well as customization such as allowing skipping building indexes on certain columns or storing them all-together also probably falls into this. Nothing about the current code should block this afaik, nor should those future enhancements interfere with our ability to read data that is stored with the current versions of stuff, so long as we practice good version hygiene whenever we make changes. > Would you recommend we present this feature in its current state as experimental or production-ready, & why? This is a hard one to answer, though I am hesitant to call it production ready right from the start, I think the answer might vary a bit per use case. The surface area here is huge since it essentially provides all of the normal Druid column functionality within these `COMPLEX<json>` columns, and I definitely won't claim this to be bug free. That said, quite a lot of internal testing has been done at this point, even at scale and with complicated nested schemas, which has allowed this codebase to be iterated on to get it to the place it currently is. There are some rough spots which I'm looking to improve in the near future, such as ingest time memory footprint, better array handling, etc, but I think if we get the documentation in a good enough state and can list out the limitations it could be used today. The use cases I would feel most comfortable with are replacements for what can currently be done via flattening, meaning not heavily centered on nested arrays. I do have ideas of how to better support nested arrays and my goal is to allow arrays extracted from nested columns to be exposed as druid `ARRAY` types, but I am not there yet, so I'm not sure I would recommend most array use cases unless they are more like vectors which have expected lengths and array positions are known/meaningful (and such that most queries would be extracting specific array positions, not entire arrays). There is also the matter of different performance characteristics at both ingest and query time for these columns. Ingestion time segment merge is pretty heavy right now because the global value dictionary is stored in heap. Query performance can vary a fair bit with nested columns compared to flat columns, especially with numbers due to the existence of indexes on these numeric columns, which currently at least sometimes results in dramatically faster but also sometimes slower query performance. I'm still exploring this quite a bit, besides documentation follow-up I also have been working on doing some benchmarking to see where things currently stand and plan on sharing those results relatively soon. So, long story short, due to the unknowns I think the answer for right now is that operators should experiment with `COMPLEX<json>` columns to see if they work well for their use case, and use them in production if so, otherwise provide feedback so that we can continue to make improvements and expand the use cases this is good for? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
