Without having an accompanying "difficulty score", I'd much prefer
option 2 over option 1 - flushes create immutable components, so it
would seem kind of too bad to end up creating unnecessary components as
a side effect of calling the schema inference function. Also, it seems
like it would be much cleaner for function calls to be side-effect
free. (Even though this would be an invisible side effect from a user's
view.)
There is a third approach that it would be interesting to get folks'
take on - namely, we could simply document the fact that the schema
function returns the schema for almost all of the data, but not
including the most most recent objects that are so new that they are
still in memory. To me this is not necessarily unacceptable - hey, 10
msec after the schema function returns its result a new object could
come in that didn't get considered in its schema computation, making the
just-returned result outdated anyway. Note that the main use of
inferred schema information will be to guide query writers by allowing
the UI to request and show that information. For collections whose type
structures are largely stable, it seems unlikely that this more relaxed
approach would miss much or be unacceptable. Others' thoughts on that?
Note that the storage layer periodically flushes components that have
been sitting around for awhile (does anyone know the period?) to avoid
issues where a component isn't quite full enough to justify a flush
otherwise and could linger unflushed for "too long" otherwise. This
should limit the degree to which things can become outdated.
Cheers,
Mike
On 4/3/25 12:39 PM, Calvin Dani wrote:
Hi,
I’m exploring schema inference from columnar storage, where tuple
compaction infers the schema and stores it in the LSM Tree. I’ve found a
way to aggregate the inferred schemas from all LSM Trees across each NC and
data partition.
The concern now is handling unflushed data. There seem to be two possible
approaches:
Force a flush and then aggregate all inferred schemas.
Infer schema from unflushed data and aggregate it with the existing schema.
Would this be the right direction, or is there a better alternative? Also,
for option 2, is there a mechanism to efficiently read only unflushed
records?
Looking forward to your thoughts.
Best regards,
Calvin Dani