Without having an accompanying "difficulty score", I'd much prefer option 2 over option 1 - flushes create immutable components, so it would seem kind of too bad to end up creating unnecessary components as a side effect of calling the schema inference function.  Also, it seems like it would be much cleaner for function calls to be side-effect free.  (Even though this would be an invisible side effect from a user's view.)

There is a third approach that it would be interesting to get folks' take on - namely, we could simply document the fact that the schema function returns the schema for almost all of the data, but not including the most most recent objects that are so new that they are still in memory. To me this is not necessarily unacceptable - hey, 10 msec after the schema function returns its result a new object could come in that didn't get considered in its schema computation, making the just-returned result outdated anyway.  Note that the main use of inferred schema information will be to guide query writers by allowing the UI to request and show that information.  For collections whose type structures are largely stable, it seems unlikely that this more relaxed approach would miss much or be unacceptable.  Others' thoughts on that?

Note that the storage layer periodically flushes components that have been sitting around for awhile (does anyone know the period?) to avoid issues where a component isn't quite full enough to justify a flush otherwise and could linger unflushed for "too long" otherwise.  This should limit the degree to which things can become outdated.

Cheers,

Mike

On 4/3/25 12:39 PM, Calvin Dani wrote:
Hi,

I’m exploring schema inference from columnar storage, where tuple
compaction infers the schema and stores it in the LSM Tree. I’ve found a
way to aggregate the inferred schemas from all LSM Trees across each NC and
data partition.

The concern now is handling unflushed data. There seem to be two possible
approaches:

Force a flush and then aggregate all inferred schemas.

Infer schema from unflushed data and aggregate it with the existing schema.

Would this be the right direction, or is there a better alternative? Also,
for option 2, is there a mechanism to efficiently read only unflushed
records?

Looking forward to your thoughts.

Best regards,

Calvin Dani

Reply via email to