Hi, I have no experience with Substrait, but i agree it looks like the tool for the job. Or, as I proposed earlier, we define our own Iceberg IR for the views.
We can experiment with serialized IR being stored as a String with new dialect name, and this is how we should get this started. It's probably good end solution as well, but the important value-add is if we manage to converge towards one shared IR that's "native to iceberg". This would be best for the users -- more views would just work. And best for long-term evolution of the project -- standardized IR would help things like incremental refreshes (for materialized views). Best Piotr On Mon, 28 Oct 2024 at 18:30, Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Hi Fokko, > > We can implement Python/Rust/Go clients to interop with the serialized > Coral IR. Not sure if it makes sense to have all front-end and back-end > implementations (e.g., Spark to Coral IR or Coral IR to Trino, etc) be > reimplemented in those languages. Such implementations actually depend on > the reuse of the native parsers of those dialects which are typically in > Java (also this is to your point about the language coverage -- > reusing native parsers is a principle that Coral follows to be compliant > with the source dialect). I think making Python/Rust/Go interop/handle the > IR (i.e., convert the serialized IR to in-memory IR and the other way > around) would be a good start. For example, Python-specific backends and > front-end implementations can follow from that. > > Thanks, > Walaa. > > > On Mon, Oct 28, 2024 at 6:05 AM Fokko Driesprong <fo...@apache.org> wrote: > >> Hey everyone, >> >> Views in PyIceberg are not yet as mature as in Java, mostly because >> tooling in Python tends to work with data frames, rather than SQL. I do >> think it would be valuable to extend support there. >> >> I have a bit of experience in turning SQL into ASTs and extending >> grammar, and I'm confident to say that it is nearly impossible to cover all >> the grammar of a specific dialect. My main question is, what will SQLGlot >> do when we try to translate a dialect that it doesn't fully understand? >> Will it error out, or will it produce faulty SQL? A simple example can be >> functions that are not supported in other engines up to recursive CTE's. In >> this case, not failing upfront would cause correctness issues. >> >> Regarding Substrait. Within PyIceberg there was also successful >> experimentation of having a DuckDB query, sending it to PyIceberg to do the >> Iceberg query planning, and returning a physical plan to DuckDB to do the >> actual execution. This was still an early stage and required a lot of work >> around credentials and field-IDs, but it was quite promising. Using >> Substrait as views looks easier to me, and would also translate to a >> dataframe-based world. Walaa, do you have any outlook on Coral >> Python/Rust/Go support? >> >> Kind regards, >> Fokko >> >> >> Op vr 25 okt 2024 om 22:16 schreef Walaa Eldin Moustafa < >> wa.moust...@gmail.com>: >> >>> I think this may need some more discussion. >>> >>> To me, a "serialized IR" is another form of a "dialect". In this case, >>> this dialect will be mostly specific to Iceberg, and compute engines >>> will still support reading views in their native SQL. There are some data >>> points on this from the Trino community in a previous discussion [1]. In >>> addition to being not directly consumable by engines, a serialized IR will >>> be hard to consume by humans too. >>> >>> From that perspective, even if Iceberg adopts some form of a serialized >>> IR, we will end up again doing translation, from that IR to the engine's >>> dialect on view read time, and from the engine's dialect to that IR on the >>> view write time. So serialized IR cannot eliminate translation. >>> >>> I think it is better to not quickly adopt the serialized IR path until >>> it is proven to work and there is sufficient tooling and support around it, >>> else it will end up being a constraint. >>> >>> For Coral vs SQLGlot (Disclaimer: I maintain Coral): There are some >>> fundamental differences between their approaches, mainly around the >>> intermediate representation abstraction. Coral models both the AST and the >>> logical plan of a query, making it able to capture the query semantics more >>> accurately and hence perform precise transformations. On the flip side, >>> SQLGlot abstraction is at the AST level only. Data type inference would be >>> a major gap in any solution that does not capture the logical plan for >>> example, yet very important to perform successful translation. This is >>> backed up by some experiments we performed on actual queries and their >>> translation results (from Spark to Trino, comparing results of Coral and >>> SQLGlot). >>> >>> For the IR: Any translation solution (including Coral) must rely on an >>> IR, and it has to be decoupled from any of the input and output dialects. >>> This is true in the Coral case today. Such IR is the way to represent both >>> the intermediate AST and logical plans. Therefore, I do not think we can >>> necessarily split projects as "IR projects" vs not, since all solutions >>> must use an IR. With that said, IR serialization is a matter of >>> staging/milestones of the project. Serialized IR is next on Coral's >>> roadmap. If Iceberg ends up adopting an IR, it might be a good idea to make >>> Iceberg interoperable with a Coral-based serialized IR. This will make the >>> compatibility with engines that adopt Coral (like Trino) much more robust >>> and straightforward. >>> >>> [1] https://github.com/trinodb/trino/pull/19818#issuecomment-1925894002 >>> >>> Thanks, >>> Walaa. >>> >>> >>> >>>