Can we have a hybrid approach between #2 and #3? I mean, if schema_inference_fn(includeInMemoryData:bool) is called with true, we take #2. If not, we take #3.
Best, Taewoo On Thu, Apr 3, 2025 at 2:59 PM Mike Carey <dtab...@gmail.com> wrote: > Without having an accompanying "difficulty score", I'd much prefer > option 2 over option 1 - flushes create immutable components, so it > would seem kind of too bad to end up creating unnecessary components as > a side effect of calling the schema inference function. Also, it seems > like it would be much cleaner for function calls to be side-effect > free. (Even though this would be an invisible side effect from a user's > view.) > > There is a third approach that it would be interesting to get folks' > take on - namely, we could simply document the fact that the schema > function returns the schema for almost all of the data, but not > including the most most recent objects that are so new that they are > still in memory. To me this is not necessarily unacceptable - hey, 10 > msec after the schema function returns its result a new object could > come in that didn't get considered in its schema computation, making the > just-returned result outdated anyway. Note that the main use of > inferred schema information will be to guide query writers by allowing > the UI to request and show that information. For collections whose type > structures are largely stable, it seems unlikely that this more relaxed > approach would miss much or be unacceptable. Others' thoughts on that? > > Note that the storage layer periodically flushes components that have > been sitting around for awhile (does anyone know the period?) to avoid > issues where a component isn't quite full enough to justify a flush > otherwise and could linger unflushed for "too long" otherwise. This > should limit the degree to which things can become outdated. > > Cheers, > > Mike > > On 4/3/25 12:39 PM, Calvin Dani wrote: > > Hi, > > > > I’m exploring schema inference from columnar storage, where tuple > > compaction infers the schema and stores it in the LSM Tree. I’ve found a > > way to aggregate the inferred schemas from all LSM Trees across each NC > and > > data partition. > > > > The concern now is handling unflushed data. There seem to be two possible > > approaches: > > > > Force a flush and then aggregate all inferred schemas. > > > > Infer schema from unflushed data and aggregate it with the existing > schema. > > > > Would this be the right direction, or is there a better alternative? > Also, > > for option 2, is there a mechanism to efficiently read only unflushed > > records? > > > > Looking forward to your thoughts. > > > > Best regards, > > > > Calvin Dani > >