Re: Thoughts on Schema Inference from Columnar Storage

Mike Carey Thu, 03 Apr 2025 14:38:20 -0700

Without having an accompanying "difficulty score", I'd much preferoption 2 over option 1 - flushes create immutable components, so itwould seem kind of too bad to end up creating unnecessary components asa side effect of calling the schema inference function. Also, it seemslike it would be much cleaner for function calls to be side-effectfree. (Even though this would be an invisible side effect from a user'sview.)

There is a third approach that it would be interesting to get folks'take on - namely, we could simply document the fact that the schemafunction returns the schema for almost all of the data, but notincluding the most most recent objects that are so new that they arestill in memory. To me this is not necessarily unacceptable - hey, 10msec after the schema function returns its result a new object couldcome in that didn't get considered in its schema computation, making thejust-returned result outdated anyway. Note that the main use ofinferred schema information will be to guide query writers by allowingthe UI to request and show that information. For collections whose typestructures are largely stable, it seems unlikely that this more relaxedapproach would miss much or be unacceptable. Others' thoughts on that?

Note that the storage layer periodically flushes components that havebeen sitting around for awhile (does anyone know the period?) to avoidissues where a component isn't quite full enough to justify a flushotherwise and could linger unflushed for "too long" otherwise. Thisshould limit the degree to which things can become outdated.


Cheers,

Mike

On 4/3/25 12:39 PM, Calvin Dani wrote:

Hi,

I’m exploring schema inference from columnar storage, where tuple
compaction infers the schema and stores it in the LSM Tree. I’ve found a
way to aggregate the inferred schemas from all LSM Trees across each NC and
data partition.

The concern now is handling unflushed data. There seem to be two possible
approaches:

Force a flush and then aggregate all inferred schemas.

Infer schema from unflushed data and aggregate it with the existing schema.

Would this be the right direction, or is there a better alternative? Also,
for option 2, is there a mechanism to efficiently read only unflushed
records?

Looking forward to your thoughts.

Best regards,

Calvin Dani

Re: Thoughts on Schema Inference from Columnar Storage

Reply via email to