I think this may need some more discussion.

To me, a "serialized IR" is another form of a "dialect". In this case, this
dialect will be mostly specific to Iceberg, and compute engines will still
support reading views in their native SQL. There are some data points on
this from the Trino community in a previous discussion [1]. In addition to
being not directly consumable by engines, a serialized IR will be hard to
consume by humans too.

>From that perspective, even if Iceberg adopts some form of a serialized IR,
we will end up again doing translation, from that IR to the engine's
dialect on view read time, and from the engine's dialect to that IR on the
view write time. So serialized IR cannot eliminate translation.

I think it is better to not quickly adopt the serialized IR path until it
is proven to work and there is sufficient tooling and support around it,
else it will end up being a constraint.

For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are some
fundamental differences between their approaches, mainly around the
intermediate representation abstraction. Coral models both the AST and the
logical plan of a query, making it able to capture the query semantics more
accurately and hence perform precise transformations. On the flip side,
SQLGlot abstraction is at the AST level only. Data type inference would be
a major gap in any solution that does not capture the logical plan for
example, yet very important to perform successful translation. This is
backed up by some experiments we performed on actual queries and their
translation results (from Spark to Trino, comparing results of Coral and
SQLGlot).

For the IR: Any translation solution (including Coral) must rely on an IR,
and it has to be decoupled from any of the input and output dialects. This
is true in the Coral case today. Such IR is the way to represent both
the intermediate AST and logical plans. Therefore, I do not think we can
necessarily split projects as "IR projects" vs not, since all solutions
must use an IR. With that said, IR serialization is a matter of
staging/milestones of the project. Serialized IR is next on Coral's
roadmap. If Iceberg ends up adopting an IR, it might be a good idea to make
Iceberg interoperable with a Coral-based serialized IR. This will make the
compatibility with engines that adopt Coral (like Trino) much more robust
and straightforward.

[1] https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002

Thanks,
Walaa.

Reply via email to