Re: Thoughts on Schema Inference from Columnar Storage

Taewoo Kim Fri, 04 Apr 2025 08:59:41 -0700

Can we have a hybrid approach between #2 and #3? I mean, if
schema_inference_fn(includeInMemoryData:bool) is called with true, we take
#2. If not, we take #3.


Best,
Taewoo


On Thu, Apr 3, 2025 at 2:59 PM Mike Carey <dtab...@gmail.com> wrote:

> Without having an accompanying "difficulty score", I'd much prefer
> option 2 over option 1 - flushes create immutable components, so it
> would seem kind of too bad to end up creating unnecessary components as
> a side effect of calling the schema inference function.  Also, it seems
> like it would be much cleaner for function calls to be side-effect
> free.  (Even though this would be an invisible side effect from a user's
> view.)
>
> There is a third approach that it would be interesting to get folks'
> take on - namely, we could simply document the fact that the schema
> function returns the schema for almost all of the data, but not
> including the most most recent objects that are so new that they are
> still in memory. To me this is not necessarily unacceptable - hey, 10
> msec after the schema function returns its result a new object could
> come in that didn't get considered in its schema computation, making the
> just-returned result outdated anyway.  Note that the main use of
> inferred schema information will be to guide query writers by allowing
> the UI to request and show that information.  For collections whose type
> structures are largely stable, it seems unlikely that this more relaxed
> approach would miss much or be unacceptable.  Others' thoughts on that?
>
> Note that the storage layer periodically flushes components that have
> been sitting around for awhile (does anyone know the period?) to avoid
> issues where a component isn't quite full enough to justify a flush
> otherwise and could linger unflushed for "too long" otherwise.  This
> should limit the degree to which things can become outdated.
>
> Cheers,
>
> Mike
>
> On 4/3/25 12:39 PM, Calvin Dani wrote:
> > Hi,
> >
> > I’m exploring schema inference from columnar storage, where tuple
> > compaction infers the schema and stores it in the LSM Tree. I’ve found a
> > way to aggregate the inferred schemas from all LSM Trees across each NC
> and
> > data partition.
> >
> > The concern now is handling unflushed data. There seem to be two possible
> > approaches:
> >
> > Force a flush and then aggregate all inferred schemas.
> >
> > Infer schema from unflushed data and aggregate it with the existing
> schema.
> >
> > Would this be the right direction, or is there a better alternative?
> Also,
> > for option 2, is there a mechanism to efficiently read only unflushed
> > records?
> >
> > Looking forward to your thoughts.
> >
> > Best regards,
> >
> > Calvin Dani
> >

Re: Thoughts on Schema Inference from Columnar Storage

Reply via email to