[GitHub] [druid] clintropolis commented on pull request #12753: druid nested data column type

GitBox Tue, 12 Jul 2022 14:55:07 -0700


clintropolis commented on PR #12753:
URL: https://github.com/apache/druid/pull/12753#issuecomment-1182535667


   >Is it intentionally undocumented in this PR? Do you plan to add 
documentation?
   
   I was planning to add documentation in a follow-up PR since I thought this 
one was already big enough 😅 
   
   >Are there any impediments to maintaining forwards compatibility of the 
storage format, such that new versions of Druid will always be able to read 
JSON columns written by older versions? Do you foresee any reason we might want 
to break compatibility?
   
   I modeled the column after existing Druid columns so most things are 
decorated with a version byte which should allow us to make changes in the 
future while still being able to continue reading the existing data. For the 
specific list of what is versioned: 
   * `NestedDataColumnSerializer` for the complex column itself (currently on 
v3 actually, i removed the reader code for older versions from prototyping to 
get rid of dead code)
   * `GlobalDictionaryEncodedFieldColumnWriter` which writes the nested columns 
and is currently re-using `DictionaryEncodedColumnPartSerde.VERSION` (i should 
probably decouple this at some point in the future...)
   * `FixedIndexed` (building block used to store local to global dictionary 
mapping and long and double value dictionaries)
   * `CompressedVariableSizedBlobColumnSerializer` (used to compress raw data)
   * `CompressedBlockSerializer` (used internally by 
`CompressedVariableSizedBlobColumnSerializer`)
   
   In the "Future work" section of #12695 I mention storage format as an area 
that we can iterate on in the future, the biggest things I have in mind right 
now are storing arrays of literal values as array typed columns instead of 
broken out as they currently are, as well as customization such as allowing 
skipping building indexes on certain columns or storing them all-together also 
probably falls into this. Nothing about the current code should block this 
afaik, nor should those future enhancements interfere with our ability to read 
data that is stored with the current versions of stuff, so long as we practice 
good version hygiene whenever we make changes.
   
   > Would you recommend we present this feature in its current state as 
experimental or production-ready, & why?
   This is a hard one to answer, though I am hesitant to call it production 
ready right from the start, I think the answer might vary a bit per use case.
   
   The surface area here is huge since it essentially provides all of the 
normal Druid column functionality within these `COMPLEX<json>` columns, and I 
definitely won't claim this to be bug free. That said, quite a lot of internal 
testing has been done at this point, even at scale and with complicated nested 
schemas, which has allowed this codebase to be iterated on to get it to the 
place it currently is. There are some rough spots which I'm looking to improve 
in the near future, such as ingest time memory footprint, better array 
handling, etc, but I think if we get the documentation in a good enough state 
and can list out the limitations it could be used today.
   
   The use cases I would feel most comfortable with are replacements for what 
can currently be done via flattening, meaning not heavily centered on nested 
arrays. I do have ideas of how to better support nested arrays and my goal is 
to allow arrays extracted from nested columns to be exposed as druid `ARRAY` 
types, but I am not there yet, so I'm not sure I would recommend most array use 
cases unless they are more like vectors which have expected lengths and array 
positions are known/meaningful (and such that most queries would be extracting 
specific array positions, not entire arrays).
   
   There is also the matter of different performance characteristics at both 
ingest and query time for these columns. Ingestion time segment merge is pretty 
heavy right now because the global value dictionary is stored in heap. Query 
performance can vary a fair bit with nested columns compared to flat columns, 
especially with numbers due to the existence of indexes on these numeric 
columns, which currently at least sometimes results in dramatically faster but 
also sometimes slower query performance. I'm still exploring this quite a bit, 
besides documentation follow-up I also have been working on doing some 
benchmarking to see where things currently stand and plan on sharing those 
results relatively soon.
   
   So, long story short, due to the unknowns I think the answer for right now 
is that operators should experiment with `COMPLEX<json>` columns to see if they 
work well for their use case, and use them in production if so, otherwise 
provide feedback so that we can continue to make improvements and expand the 
use cases this is good for?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] clintropolis commented on pull request #12753: druid nested data column type

Reply via email to