That sounds reasonable.  I was about to argue strongly for just doing 3, because I think most steady state uses of the function will be fine with that - because of the periodic flushing of quiet components, the on-disk schema info won't be more than an hour or so out of date.  HOWEVER:  An important use case for more current information is probably the getting-started case where a user loads some collections with small-ish data and then wants to know their schema - waiting an hour before asking for that would not be a good user experience.  (That use case just occurred to me)  SO:  I vote for doing further investigation into what it would take to get the schema info including the in-memory component(s) but WITHOUT flushing them......

Cheers,

Mike

On 4/4/25 8:58 AM, Taewoo Kim wrote:
Can we have a hybrid approach between #2 and #3? I mean, if
schema_inference_fn(includeInMemoryData:bool) is called with true, we take
#2. If not, we take #3.

Best,
Taewoo


On Thu, Apr 3, 2025 at 2:59 PM Mike Carey<dtab...@gmail.com> wrote:

Without having an accompanying "difficulty score", I'd much prefer
option 2 over option 1 - flushes create immutable components, so it
would seem kind of too bad to end up creating unnecessary components as
a side effect of calling the schema inference function.  Also, it seems
like it would be much cleaner for function calls to be side-effect
free.  (Even though this would be an invisible side effect from a user's
view.)

There is a third approach that it would be interesting to get folks'
take on - namely, we could simply document the fact that the schema
function returns the schema for almost all of the data, but not
including the most most recent objects that are so new that they are
still in memory. To me this is not necessarily unacceptable - hey, 10
msec after the schema function returns its result a new object could
come in that didn't get considered in its schema computation, making the
just-returned result outdated anyway.  Note that the main use of
inferred schema information will be to guide query writers by allowing
the UI to request and show that information.  For collections whose type
structures are largely stable, it seems unlikely that this more relaxed
approach would miss much or be unacceptable.  Others' thoughts on that?

Note that the storage layer periodically flushes components that have
been sitting around for awhile (does anyone know the period?) to avoid
issues where a component isn't quite full enough to justify a flush
otherwise and could linger unflushed for "too long" otherwise.  This
should limit the degree to which things can become outdated.

Cheers,

Mike

On 4/3/25 12:39 PM, Calvin Dani wrote:
Hi,

I’m exploring schema inference from columnar storage, where tuple
compaction infers the schema and stores it in the LSM Tree. I’ve found a
way to aggregate the inferred schemas from all LSM Trees across each NC
and
data partition.

The concern now is handling unflushed data. There seem to be two possible
approaches:

Force a flush and then aggregate all inferred schemas.

Infer schema from unflushed data and aggregate it with the existing
schema.
Would this be the right direction, or is there a better alternative?
Also,
for option 2, is there a mechanism to efficiently read only unflushed
records?

Looking forward to your thoughts.

Best regards,

Calvin Dani

Reply via email to