To be honest I find this YAML based representation a bit confusing due to the unclear parameters of functions. In your specific example you have a JOIN taking two sources as their inputs. But how do I know that the two sources are meant to be inputs to the join? And not only that the last source is the input? Obviously for someone knowledgeable about how Acero works, it is obvious that Join takes two inputs, but it's still a bit unclear which one those two inputs are
I agree with the point of using a easily parsable/writable language like JSON or YAML, as that would more easily allow developers to construct pipelines at runtime in their own favorite language and then compiling them to those. But at that point, aren't we reimplementing Substrait? Another thing that came to my mind, the pipeline is written in a way that fits the compiler more than a human. Humans would probably design their pipeline starting from the data source and then applying transformations to it as they think of the next step. While here you need to think backward. Obviously you can append to the top as you write your pipeline ,but that's still a bit counterintuitive. Just my two cents. On Thu, Nov 3, 2022 at 8:08 PM Weston Pace <weston.p...@gmail.com> wrote: > Indentation works well when you omit the other arguments (e.g. ...) > but once you mix in the arguments for the nodes (especially if those > arguments have their own indentation / structure) then it ends up > becoming unreadable I think. I prefer the idea of each node having > it's own block, with no indentation, and using indentation purely for > argument structure. For example (using YAML), consider the query > `SELECT n_nationkey, n_name, r_name FROM nation INNER JOIN region ON > n_regionkey = r_regionkey`. Note, we don't have a serialization for > datasets so I'm using substrait serialization for reads. > > ``` > project: > expressions: > - "!0" > - "!1" > - "!2" > names: > - "n_nationkey" > - "n_name" > - "r_name" > > join: > left_keys: > - "!2" > right_keys: > - "!4" > type: JOIN_TYPE_INNER > > read: > base_schema: > names: > - "r_regionkey" > - "r_name" > - "r_comment" > struct: > types: > - i32? > - string? > - string? > named_table: > names: > - "region" > > read: > base_schema: > names: > - "n_nationkey" > - "n_name" > - "n_regionkey" > - "n_comment" > struct: > types: > - i32? > - string? > - i32? > - string? > named_table: > names: > - "nation" > ``` > > I feel the above is pretty reasonable once you get past the learning > curve of prefix processing to build the tree. > > It's not clear that node-level indentation adds much. > > ``` > project: > expressions: > - "!0" > - "!1" > - "!2" > names: > - "n_nationkey" > - "n_name" > - "r_name" > > join: > left_keys: > - "!2" > right_keys: > - "!4" > type: JOIN_TYPE_INNER > > read: > base_schema: > names: > - "r_regionkey" > - "r_name" > - "r_comment" > struct: > types: > - i32? > - string? > - string? > named_table: > names: > - "region" > > read: > base_schema: > names: > - "n_nationkey" > - "n_name" > - "n_regionkey" > - "n_comment" > struct: > types: > - i32? > - string? > - i32? > - string? > named_table: > names: > - "nation" > ``` > > And then I think adding parentheses doesn't make sense. I suppose you > could change from YAML to something like pythons or JS's formats for > array and dict literals but I think it would be quite messy. > > On Thu, Nov 3, 2022 at 11:07 AM Percy Camilo Triveño Aucahuasi > <percy.camilo...@gmail.com> wrote: > > > > Thanks Sasha! > > > > A nice advantage about parentheses is that most editors can track and > > highlight the sections between them. > > Also, those parentheses can be optional when we detect new lines (in the > > case some users don't want to deal with many parentheses); in that case, > we > > would just need to ask indentation. > > > > Percy > > > > > > On Thu, Nov 3, 2022 at 12:47 PM Sasha Krassovsky < > krassovskysa...@gmail.com> > > wrote: > > > > > Hi Percy, > > > Thanks for the input! New lines would be no problem at all, they’d > just be > > > treated the same as any other whitespace. One thing to point out about > the > > > function call style when written that way is that it looks a lot like > the > > > list style, it’s just that there are more parentheses to keep track of, > > > though it does make it more obvious what delineates a subtree. > > > > > > Sasha > > > > > > > > > > 3 нояб. 2022 г., в 10:35, Percy Camilo Triveño Aucahuasi < > > > percy.camilo...@gmail.com> написал(а): > > > > > > > > Hi Sasha, > > > > > > > > I like the function call-style variant. Quick question about the > parser: > > > > Do you think we can parse with new lines too? that way it would be > even > > > > more similar to a json-like/declarative approach and could mitigate > a bit > > > > the nesting issue (which would make it easier to read as well) for > > > instance: > > > > > > > > sink( > > > > project( > > > > filter( > > > > source( > > > > …) > > > > …) > > > > …) > > > > …) > > > > > > > > Percy > > > > > > > > > > > >> On Tue, Oct 18, 2022 at 5:54 PM Sasha Krassovsky < > > > krassovskysa...@gmail.com> > > > >> wrote: > > > >> > > > >> Hi everyone, > > > >> We recently had some discussions about parsing expressions. I > currently > > > >> have a PR [1] up for that taking into account the feedback. Next I > > > wanted > > > >> to tackle something for ExecPlans, as manually specifying one using > > > code is > > > >> currently cumbersome. I’m currently deciding between 2 variants: > > > >> > > > >> - Function call-style: This would be a similar syntax to the > > > expressions, > > > >> where we would have something along the lines of > > > >> `sink(project(filter(source(…)…)…)…)`. The problem with this syntax > is > > > that > > > >> it involves tons of nesting, which although an improvement over > > > handwriting > > > >> the C++ code, is still cumbersome to write. On the other hand, this > > > syntax > > > >> is pretty intuitive and meshes well with the expression syntax. A > minor > > > >> modification could be to make the last argument rather than the > first be > > > >> the input to a node, which would at least keep a node’s parameters > > > >> together. > > > >> > > > >> - List style: This syntax completely eliminates nesting and would > > > probably > > > >> be easier to write but has a steeper learning curve. Essentially, > since > > > we > > > >> know how many inputs each type of node takes, we can implicitly > > > reconstruct > > > >> a tree from a list of node names (formally, we are converting > from/to a > > > >> pre-order traversal of the query tree). For example, it would look > > > >> something like: > > > >> > > > >> ``` > > > >> sink > > > >> project <list of names/expressions> > > > >> filter <expression> > > > >> source > > > >> ``` > > > >> > > > >> The key is that we know that a source takes no inputs, and so source > > > nodes > > > >> are leaf nodes. To take an example with a join, it could be > something > > > like > > > >> > > > >> ``` > > > >> order_by_sink <sort key> > > > >> hash_join <join arguments> > > > >> filter <expression> > > > >> source > > > >> filter <expression> > > > >> source > > > >> ``` > > > >> > > > >> Since we know that a join always takes two arguments, we interpret > the > > > >> first (filter source) slice as the first argument and the second as > the > > > >> second argument. It should be noted that the current C++ code > already > > > >> resembles this kind of syntax, it just has much more clutter. > > > >> > > > >> Thanks! > > > >> Sasha Krassovsky > > > >> > > > >> [1] https://github.com/apache/arrow/pull/14287 > > > >