+1 for this idea. We could visualize it as a JSON, but with its nested and
lengthy nature, it would be kind of hard to show it or actually publish it
when we
have a very complex query.

As Weston suggested, at the moment we have a lengthy Substrait test file
mainly because the space is taken by these visual JSON plans.

And if this becomes a successful feature, I guess Substrait could adopt it
too.

On Thu, Nov 10, 2022 at 7:04 AM Weston Pace <weston.p...@gmail.com> wrote:

> > To be honest I find this YAML based representation a bit confusing due to
> > the unclear parameters of functions.
>
> To be fair the YAML was my fault :).  I don't think Sasha has shown us
> yet what the format for relation parameters looks like.
>
> > Obviously for someone knowledgeable about how Acero works, it is obvious
> > that Join takes two inputs, but it's still a bit unclear which one those
> > two inputs are
>
> I suppose it will depend what the use case is.  If this is a format
> for those expert in the tool to compactly write queries then I think
> this is ok.  For example, my biggest desire for a feature like this
> would be to simplify a lot of the Substrait tests that we have (e.g.
> serde_test.cc) which, at the moment, consist of a lot of hard-coded
> and unreadable JSON.
>
> > as that would more easily allow developers to construct
> > pipelines at runtime in their own favorite language and then compiling
> them
> > to those. But at that point, aren't we reimplementing Substrait?
>
> Substrait does not have a text (e.g. human readable and human
> writable) serialization.  If anything this is reimplementing SQL,
> which is what people seem to have been using so far to author
> Substrait, but which is fairly limited in it's ability to represent
> Substrait (e.g. specifying desired corner case behavior).  If I
> understand this proposal is targeted at Acero, and not creating a text
> representation of Substrait but, if successful, I think it would
> represent a pretty good prototype / example that we could then adopt
> for Substrait.  That being said, it might be interesting to get
> feedback from that community as well, to see if they have any requests
> or ideas.
>
> > Humans would probably design their pipeline starting from the data source
> > and then applying transformations to it as they think of the next step.
>
> +1
>
> On Tue, Nov 8, 2022 at 6:06 AM Alessandro Molina
> <alessan...@voltrondata.com.invalid> wrote:
> >
> > To be honest I find this YAML based representation a bit confusing due to
> > the unclear parameters of functions.
> > In your specific example you have a JOIN taking two sources as their
> > inputs.
> > But how do I know that the two sources are meant to be inputs to the
> join?
> > And not only that the last source is the input?
> > Obviously for someone knowledgeable about how Acero works, it is obvious
> > that Join takes two inputs, but it's still a bit unclear which one those
> > two inputs are
> >
> > I agree with the point of using a easily parsable/writable language like
> > JSON or YAML, as that would more easily allow developers to construct
> > pipelines at runtime in their own favorite language and then compiling
> them
> > to those. But at that point, aren't we reimplementing Substrait?
> >
> > Another thing that came to my mind, the pipeline is written in a way that
> > fits the compiler more than a human.
> > Humans would probably design their pipeline starting from the data source
> > and then applying transformations to it as they think of the next step.
> > While here you need to think backward. Obviously you can append to the
> top
> > as you write your pipeline ,but that's still a bit counterintuitive.
> >
> > Just my two cents.
> >
> >
> >
> > On Thu, Nov 3, 2022 at 8:08 PM Weston Pace <weston.p...@gmail.com>
> wrote:
> >
> > > Indentation works well when you omit the other arguments (e.g. ...)
> > > but once you mix in the arguments for the nodes (especially if those
> > > arguments have their own indentation / structure) then it ends up
> > > becoming unreadable I think.  I prefer the idea of each node having
> > > it's own block, with no indentation, and using indentation purely for
> > > argument structure.  For example (using YAML), consider the query
> > > `SELECT n_nationkey, n_name, r_name FROM nation INNER JOIN region ON
> > > n_regionkey = r_regionkey`.  Note, we don't have a serialization for
> > > datasets so I'm using substrait serialization for reads.
> > >
> > > ```
> > > project:
> > >   expressions:
> > >    - "!0"
> > >    - "!1"
> > >    - "!2"
> > >   names:
> > >    - "n_nationkey"
> > >    - "n_name"
> > >    - "r_name"
> > >
> > > join:
> > >   left_keys:
> > >    - "!2"
> > >   right_keys:
> > >    - "!4"
> > >   type: JOIN_TYPE_INNER
> > >
> > > read:
> > >   base_schema:
> > >     names:
> > >      - "r_regionkey"
> > >      - "r_name"
> > >      - "r_comment"
> > >    struct:
> > >      types:
> > >       - i32?
> > >       - string?
> > >       - string?
> > >   named_table:
> > >     names:
> > >      - "region"
> > >
> > > read:
> > >   base_schema:
> > >     names:
> > >      - "n_nationkey"
> > >      - "n_name"
> > >      - "n_regionkey"
> > >      - "n_comment"
> > >     struct:
> > >       types:
> > >         - i32?
> > >         - string?
> > >         - i32?
> > >         - string?
> > >   named_table:
> > >     names:
> > >       - "nation"
> > > ```
> > >
> > > I feel the above is pretty reasonable once you get past the learning
> > > curve of prefix processing to build the tree.
> > >
> > > It's not clear that node-level indentation adds much.
> > >
> > > ```
> > > project:
> > >   expressions:
> > >    - "!0"
> > >    - "!1"
> > >    - "!2"
> > >   names:
> > >    - "n_nationkey"
> > >    - "n_name"
> > >    - "r_name"
> > >
> > >   join:
> > >     left_keys:
> > >      - "!2"
> > >     right_keys:
> > >      - "!4"
> > >     type: JOIN_TYPE_INNER
> > >
> > >     read:
> > >       base_schema:
> > >         names:
> > >          - "r_regionkey"
> > >          - "r_name"
> > >          - "r_comment"
> > >        struct:
> > >          types:
> > >           - i32?
> > >           - string?
> > >           - string?
> > >       named_table:
> > >         names:
> > >          - "region"
> > >
> > >     read:
> > >       base_schema:
> > >         names:
> > >          - "n_nationkey"
> > >          - "n_name"
> > >          - "n_regionkey"
> > >          - "n_comment"
> > >        struct:
> > >           types:
> > >             - i32?
> > >             - string?
> > >             - i32?
> > >             - string?
> > >       named_table:
> > >         names:
> > >           - "nation"
> > > ```
> > >
> > > And then I think adding parentheses doesn't make sense.  I suppose you
> > > could change from YAML to something like pythons or JS's formats for
> > > array and dict literals but I think it would be quite messy.
> > >
> > > On Thu, Nov 3, 2022 at 11:07 AM Percy Camilo Triveño Aucahuasi
> > > <percy.camilo...@gmail.com> wrote:
> > > >
> > > > Thanks Sasha!
> > > >
> > > > A nice advantage about parentheses is that most editors can track and
> > > > highlight the sections between them.
> > > > Also, those parentheses can be optional when we detect new lines (in
> the
> > > > case some users don't want to deal with many parentheses); in that
> case,
> > > we
> > > > would just need to ask indentation.
> > > >
> > > > Percy
> > > >
> > > >
> > > > On Thu, Nov 3, 2022 at 12:47 PM Sasha Krassovsky <
> > > krassovskysa...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Percy,
> > > > > Thanks for the input! New lines would be no problem at all, they’d
> > > just be
> > > > > treated the same as any other whitespace. One thing to point out
> about
> > > the
> > > > > function call style when written that way is that it looks a lot
> like
> > > the
> > > > > list style, it’s just that there are more parentheses to keep
> track of,
> > > > > though it does make it more obvious what delineates a subtree.
> > > > >
> > > > > Sasha
> > > > >
> > > > >
> > > > > > 3 нояб. 2022 г., в 10:35, Percy Camilo Triveño Aucahuasi <
> > > > > percy.camilo...@gmail.com> написал(а):
> > > > > >
> > > > > > Hi Sasha,
> > > > > >
> > > > > > I like the function call-style variant.  Quick question about the
> > > parser:
> > > > > > Do you think we can parse with new lines too? that way it would
> be
> > > even
> > > > > > more similar to a json-like/declarative approach and could
> mitigate
> > > a bit
> > > > > > the nesting issue (which would make it easier to read as well)
> for
> > > > > instance:
> > > > > >
> > > > > > sink(
> > > > > >  project(
> > > > > >    filter(
> > > > > >      source(
> > > > > >        …)
> > > > > >    …)
> > > > > >  …)
> > > > > > …)
> > > > > >
> > > > > > Percy
> > > > > >
> > > > > >
> > > > > >> On Tue, Oct 18, 2022 at 5:54 PM Sasha Krassovsky <
> > > > > krassovskysa...@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> Hi everyone,
> > > > > >> We recently had some discussions about parsing expressions. I
> > > currently
> > > > > >> have a PR [1] up for that taking into account the feedback.
> Next I
> > > > > wanted
> > > > > >> to tackle something for ExecPlans, as manually specifying one
> using
> > > > > code is
> > > > > >> currently cumbersome. I’m currently deciding between 2 variants:
> > > > > >>
> > > > > >> - Function call-style: This would be a similar syntax to the
> > > > > expressions,
> > > > > >> where we would have something along the lines of
> > > > > >> `sink(project(filter(source(…)…)…)…)`. The problem with this
> syntax
> > > is
> > > > > that
> > > > > >> it involves tons of nesting, which although an improvement over
> > > > > handwriting
> > > > > >> the C++ code, is still cumbersome to write. On the other hand,
> this
> > > > > syntax
> > > > > >> is pretty intuitive and meshes well with the expression syntax.
> A
> > > minor
> > > > > >> modification could be to make the last argument rather than the
> > > first be
> > > > > >> the input to a node, which would at least keep a node’s
> parameters
> > > > > >> together.
> > > > > >>
> > > > > >> - List style: This syntax completely eliminates nesting and
> would
> > > > > probably
> > > > > >> be easier to write but has a steeper learning curve.
> Essentially,
> > > since
> > > > > we
> > > > > >> know how many inputs each type of node takes, we can implicitly
> > > > > reconstruct
> > > > > >> a tree from a list of node names (formally, we are converting
> > > from/to a
> > > > > >> pre-order traversal of the query tree). For example, it would
> look
> > > > > >> something like:
> > > > > >>
> > > > > >> ```
> > > > > >> sink
> > > > > >> project <list of names/expressions>
> > > > > >> filter <expression>
> > > > > >> source
> > > > > >> ```
> > > > > >>
> > > > > >> The key is that we know that a source takes no inputs, and so
> source
> > > > > nodes
> > > > > >> are leaf nodes. To take an example with a join, it could be
> > > something
> > > > > like
> > > > > >>
> > > > > >> ```
> > > > > >> order_by_sink <sort key>
> > > > > >> hash_join <join arguments>
> > > > > >> filter <expression>
> > > > > >> source
> > > > > >> filter <expression>
> > > > > >> source
> > > > > >> ```
> > > > > >>
> > > > > >> Since we know that a join always takes two arguments, we
> interpret
> > > the
> > > > > >> first (filter source) slice as the first argument and the
> second as
> > > the
> > > > > >> second argument. It should be noted that the current C++ code
> > > already
> > > > > >> resembles this kind of syntax, it just has much more clutter.
> > > > > >>
> > > > > >> Thanks!
> > > > > >> Sasha Krassovsky
> > > > > >>
> > > > > >> [1] https://github.com/apache/arrow/pull/14287
> > > > >
> > >
>

Reply via email to