+1 for this idea. We could visualize it as a JSON, but with its nested and lengthy nature, it would be kind of hard to show it or actually publish it when we have a very complex query.
As Weston suggested, at the moment we have a lengthy Substrait test file mainly because the space is taken by these visual JSON plans. And if this becomes a successful feature, I guess Substrait could adopt it too. On Thu, Nov 10, 2022 at 7:04 AM Weston Pace <weston.p...@gmail.com> wrote: > > To be honest I find this YAML based representation a bit confusing due to > > the unclear parameters of functions. > > To be fair the YAML was my fault :). I don't think Sasha has shown us > yet what the format for relation parameters looks like. > > > Obviously for someone knowledgeable about how Acero works, it is obvious > > that Join takes two inputs, but it's still a bit unclear which one those > > two inputs are > > I suppose it will depend what the use case is. If this is a format > for those expert in the tool to compactly write queries then I think > this is ok. For example, my biggest desire for a feature like this > would be to simplify a lot of the Substrait tests that we have (e.g. > serde_test.cc) which, at the moment, consist of a lot of hard-coded > and unreadable JSON. > > > as that would more easily allow developers to construct > > pipelines at runtime in their own favorite language and then compiling > them > > to those. But at that point, aren't we reimplementing Substrait? > > Substrait does not have a text (e.g. human readable and human > writable) serialization. If anything this is reimplementing SQL, > which is what people seem to have been using so far to author > Substrait, but which is fairly limited in it's ability to represent > Substrait (e.g. specifying desired corner case behavior). If I > understand this proposal is targeted at Acero, and not creating a text > representation of Substrait but, if successful, I think it would > represent a pretty good prototype / example that we could then adopt > for Substrait. That being said, it might be interesting to get > feedback from that community as well, to see if they have any requests > or ideas. > > > Humans would probably design their pipeline starting from the data source > > and then applying transformations to it as they think of the next step. > > +1 > > On Tue, Nov 8, 2022 at 6:06 AM Alessandro Molina > <alessan...@voltrondata.com.invalid> wrote: > > > > To be honest I find this YAML based representation a bit confusing due to > > the unclear parameters of functions. > > In your specific example you have a JOIN taking two sources as their > > inputs. > > But how do I know that the two sources are meant to be inputs to the > join? > > And not only that the last source is the input? > > Obviously for someone knowledgeable about how Acero works, it is obvious > > that Join takes two inputs, but it's still a bit unclear which one those > > two inputs are > > > > I agree with the point of using a easily parsable/writable language like > > JSON or YAML, as that would more easily allow developers to construct > > pipelines at runtime in their own favorite language and then compiling > them > > to those. But at that point, aren't we reimplementing Substrait? > > > > Another thing that came to my mind, the pipeline is written in a way that > > fits the compiler more than a human. > > Humans would probably design their pipeline starting from the data source > > and then applying transformations to it as they think of the next step. > > While here you need to think backward. Obviously you can append to the > top > > as you write your pipeline ,but that's still a bit counterintuitive. > > > > Just my two cents. > > > > > > > > On Thu, Nov 3, 2022 at 8:08 PM Weston Pace <weston.p...@gmail.com> > wrote: > > > > > Indentation works well when you omit the other arguments (e.g. ...) > > > but once you mix in the arguments for the nodes (especially if those > > > arguments have their own indentation / structure) then it ends up > > > becoming unreadable I think. I prefer the idea of each node having > > > it's own block, with no indentation, and using indentation purely for > > > argument structure. For example (using YAML), consider the query > > > `SELECT n_nationkey, n_name, r_name FROM nation INNER JOIN region ON > > > n_regionkey = r_regionkey`. Note, we don't have a serialization for > > > datasets so I'm using substrait serialization for reads. > > > > > > ``` > > > project: > > > expressions: > > > - "!0" > > > - "!1" > > > - "!2" > > > names: > > > - "n_nationkey" > > > - "n_name" > > > - "r_name" > > > > > > join: > > > left_keys: > > > - "!2" > > > right_keys: > > > - "!4" > > > type: JOIN_TYPE_INNER > > > > > > read: > > > base_schema: > > > names: > > > - "r_regionkey" > > > - "r_name" > > > - "r_comment" > > > struct: > > > types: > > > - i32? > > > - string? > > > - string? > > > named_table: > > > names: > > > - "region" > > > > > > read: > > > base_schema: > > > names: > > > - "n_nationkey" > > > - "n_name" > > > - "n_regionkey" > > > - "n_comment" > > > struct: > > > types: > > > - i32? > > > - string? > > > - i32? > > > - string? > > > named_table: > > > names: > > > - "nation" > > > ``` > > > > > > I feel the above is pretty reasonable once you get past the learning > > > curve of prefix processing to build the tree. > > > > > > It's not clear that node-level indentation adds much. > > > > > > ``` > > > project: > > > expressions: > > > - "!0" > > > - "!1" > > > - "!2" > > > names: > > > - "n_nationkey" > > > - "n_name" > > > - "r_name" > > > > > > join: > > > left_keys: > > > - "!2" > > > right_keys: > > > - "!4" > > > type: JOIN_TYPE_INNER > > > > > > read: > > > base_schema: > > > names: > > > - "r_regionkey" > > > - "r_name" > > > - "r_comment" > > > struct: > > > types: > > > - i32? > > > - string? > > > - string? > > > named_table: > > > names: > > > - "region" > > > > > > read: > > > base_schema: > > > names: > > > - "n_nationkey" > > > - "n_name" > > > - "n_regionkey" > > > - "n_comment" > > > struct: > > > types: > > > - i32? > > > - string? > > > - i32? > > > - string? > > > named_table: > > > names: > > > - "nation" > > > ``` > > > > > > And then I think adding parentheses doesn't make sense. I suppose you > > > could change from YAML to something like pythons or JS's formats for > > > array and dict literals but I think it would be quite messy. > > > > > > On Thu, Nov 3, 2022 at 11:07 AM Percy Camilo Triveño Aucahuasi > > > <percy.camilo...@gmail.com> wrote: > > > > > > > > Thanks Sasha! > > > > > > > > A nice advantage about parentheses is that most editors can track and > > > > highlight the sections between them. > > > > Also, those parentheses can be optional when we detect new lines (in > the > > > > case some users don't want to deal with many parentheses); in that > case, > > > we > > > > would just need to ask indentation. > > > > > > > > Percy > > > > > > > > > > > > On Thu, Nov 3, 2022 at 12:47 PM Sasha Krassovsky < > > > krassovskysa...@gmail.com> > > > > wrote: > > > > > > > > > Hi Percy, > > > > > Thanks for the input! New lines would be no problem at all, they’d > > > just be > > > > > treated the same as any other whitespace. One thing to point out > about > > > the > > > > > function call style when written that way is that it looks a lot > like > > > the > > > > > list style, it’s just that there are more parentheses to keep > track of, > > > > > though it does make it more obvious what delineates a subtree. > > > > > > > > > > Sasha > > > > > > > > > > > > > > > > 3 нояб. 2022 г., в 10:35, Percy Camilo Triveño Aucahuasi < > > > > > percy.camilo...@gmail.com> написал(а): > > > > > > > > > > > > Hi Sasha, > > > > > > > > > > > > I like the function call-style variant. Quick question about the > > > parser: > > > > > > Do you think we can parse with new lines too? that way it would > be > > > even > > > > > > more similar to a json-like/declarative approach and could > mitigate > > > a bit > > > > > > the nesting issue (which would make it easier to read as well) > for > > > > > instance: > > > > > > > > > > > > sink( > > > > > > project( > > > > > > filter( > > > > > > source( > > > > > > …) > > > > > > …) > > > > > > …) > > > > > > …) > > > > > > > > > > > > Percy > > > > > > > > > > > > > > > > > >> On Tue, Oct 18, 2022 at 5:54 PM Sasha Krassovsky < > > > > > krassovskysa...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > >> Hi everyone, > > > > > >> We recently had some discussions about parsing expressions. I > > > currently > > > > > >> have a PR [1] up for that taking into account the feedback. > Next I > > > > > wanted > > > > > >> to tackle something for ExecPlans, as manually specifying one > using > > > > > code is > > > > > >> currently cumbersome. I’m currently deciding between 2 variants: > > > > > >> > > > > > >> - Function call-style: This would be a similar syntax to the > > > > > expressions, > > > > > >> where we would have something along the lines of > > > > > >> `sink(project(filter(source(…)…)…)…)`. The problem with this > syntax > > > is > > > > > that > > > > > >> it involves tons of nesting, which although an improvement over > > > > > handwriting > > > > > >> the C++ code, is still cumbersome to write. On the other hand, > this > > > > > syntax > > > > > >> is pretty intuitive and meshes well with the expression syntax. > A > > > minor > > > > > >> modification could be to make the last argument rather than the > > > first be > > > > > >> the input to a node, which would at least keep a node’s > parameters > > > > > >> together. > > > > > >> > > > > > >> - List style: This syntax completely eliminates nesting and > would > > > > > probably > > > > > >> be easier to write but has a steeper learning curve. > Essentially, > > > since > > > > > we > > > > > >> know how many inputs each type of node takes, we can implicitly > > > > > reconstruct > > > > > >> a tree from a list of node names (formally, we are converting > > > from/to a > > > > > >> pre-order traversal of the query tree). For example, it would > look > > > > > >> something like: > > > > > >> > > > > > >> ``` > > > > > >> sink > > > > > >> project <list of names/expressions> > > > > > >> filter <expression> > > > > > >> source > > > > > >> ``` > > > > > >> > > > > > >> The key is that we know that a source takes no inputs, and so > source > > > > > nodes > > > > > >> are leaf nodes. To take an example with a join, it could be > > > something > > > > > like > > > > > >> > > > > > >> ``` > > > > > >> order_by_sink <sort key> > > > > > >> hash_join <join arguments> > > > > > >> filter <expression> > > > > > >> source > > > > > >> filter <expression> > > > > > >> source > > > > > >> ``` > > > > > >> > > > > > >> Since we know that a join always takes two arguments, we > interpret > > > the > > > > > >> first (filter source) slice as the first argument and the > second as > > > the > > > > > >> second argument. It should be noted that the current C++ code > > > already > > > > > >> resembles this kind of syntax, it just has much more clutter. > > > > > >> > > > > > >> Thanks! > > > > > >> Sasha Krassovsky > > > > > >> > > > > > >> [1] https://github.com/apache/arrow/pull/14287 > > > > > > > > >