Re: Parser for ExecPlans

Weston Pace Thu, 03 Nov 2022 12:08:12 -0700

Indentation works well when you omit the other arguments (e.g. ...)
but once you mix in the arguments for the nodes (especially if those
arguments have their own indentation / structure) then it ends up
becoming unreadable I think.  I prefer the idea of each node having
it's own block, with no indentation, and using indentation purely for
argument structure.  For example (using YAML), consider the query
`SELECT n_nationkey, n_name, r_name FROM nation INNER JOIN region ON
n_regionkey = r_regionkey`.  Note, we don't have a serialization for
datasets so I'm using substrait serialization for reads.


```
project:
  expressions:
   - "!0"
   - "!1"
   - "!2"
  names:
   - "n_nationkey"
   - "n_name"
   - "r_name"

join:
  left_keys:
   - "!2"
  right_keys:
   - "!4"
  type: JOIN_TYPE_INNER

read:
  base_schema:
    names:
     - "r_regionkey"
     - "r_name"
     - "r_comment"
   struct:
     types:
      - i32?
      - string?
      - string?
  named_table:
    names:
     - "region"

read:
  base_schema:
    names:
     - "n_nationkey"
     - "n_name"
     - "n_regionkey"
     - "n_comment"
    struct:
      types:
        - i32?
        - string?
        - i32?
        - string?
  named_table:
    names:
      - "nation"
```

I feel the above is pretty reasonable once you get past the learning
curve of prefix processing to build the tree.

It's not clear that node-level indentation adds much.

```
project:
  expressions:
   - "!0"
   - "!1"
   - "!2"
  names:
   - "n_nationkey"
   - "n_name"
   - "r_name"

  join:
    left_keys:
     - "!2"
    right_keys:
     - "!4"
    type: JOIN_TYPE_INNER

    read:
      base_schema:
        names:
         - "r_regionkey"
         - "r_name"
         - "r_comment"
       struct:
         types:
          - i32?
          - string?
          - string?
      named_table:
        names:
         - "region"

    read:
      base_schema:
        names:
         - "n_nationkey"
         - "n_name"
         - "n_regionkey"
         - "n_comment"
       struct:
          types:
            - i32?
            - string?
            - i32?
            - string?
      named_table:
        names:
          - "nation"
```

And then I think adding parentheses doesn't make sense.  I suppose you
could change from YAML to something like pythons or JS's formats for
array and dict literals but I think it would be quite messy.

On Thu, Nov 3, 2022 at 11:07 AM Percy Camilo Triveño Aucahuasi
<[email protected]> wrote:
>
> Thanks Sasha!
>
> A nice advantage about parentheses is that most editors can track and
> highlight the sections between them.
> Also, those parentheses can be optional when we detect new lines (in the
> case some users don't want to deal with many parentheses); in that case, we
> would just need to ask indentation.
>
> Percy
>
>
> On Thu, Nov 3, 2022 at 12:47 PM Sasha Krassovsky <[email protected]>
> wrote:
>
> > Hi Percy,
> > Thanks for the input! New lines would be no problem at all, they’d just be
> > treated the same as any other whitespace. One thing to point out about the
> > function call style when written that way is that it looks a lot like the
> > list style, it’s just that there are more parentheses to keep track of,
> > though it does make it more obvious what delineates a subtree.
> >
> > Sasha
> >
> >
> > > 3 нояб. 2022 г., в 10:35, Percy Camilo Triveño Aucahuasi <
> > [email protected]> написал(а):
> > >
> > > Hi Sasha,
> > >
> > > I like the function call-style variant.  Quick question about the parser:
> > > Do you think we can parse with new lines too? that way it would be even
> > > more similar to a json-like/declarative approach and could mitigate a bit
> > > the nesting issue (which would make it easier to read as well) for
> > instance:
> > >
> > > sink(
> > >  project(
> > >    filter(
> > >      source(
> > >        …)
> > >    …)
> > >  …)
> > > …)
> > >
> > > Percy
> > >
> > >
> > >> On Tue, Oct 18, 2022 at 5:54 PM Sasha Krassovsky <
> > [email protected]>
> > >> wrote:
> > >>
> > >> Hi everyone,
> > >> We recently had some discussions about parsing expressions. I currently
> > >> have a PR [1] up for that taking into account the feedback. Next I
> > wanted
> > >> to tackle something for ExecPlans, as manually specifying one using
> > code is
> > >> currently cumbersome. I’m currently deciding between 2 variants:
> > >>
> > >> - Function call-style: This would be a similar syntax to the
> > expressions,
> > >> where we would have something along the lines of
> > >> `sink(project(filter(source(…)…)…)…)`. The problem with this syntax is
> > that
> > >> it involves tons of nesting, which although an improvement over
> > handwriting
> > >> the C++ code, is still cumbersome to write. On the other hand, this
> > syntax
> > >> is pretty intuitive and meshes well with the expression syntax. A minor
> > >> modification could be to make the last argument rather than the first be
> > >> the input to a node, which would at least keep a node’s parameters
> > >> together.
> > >>
> > >> - List style: This syntax completely eliminates nesting and would
> > probably
> > >> be easier to write but has a steeper learning curve. Essentially, since
> > we
> > >> know how many inputs each type of node takes, we can implicitly
> > reconstruct
> > >> a tree from a list of node names (formally, we are converting from/to a
> > >> pre-order traversal of the query tree). For example, it would look
> > >> something like:
> > >>
> > >> ```
> > >> sink
> > >> project <list of names/expressions>
> > >> filter <expression>
> > >> source
> > >> ```
> > >>
> > >> The key is that we know that a source takes no inputs, and so source
> > nodes
> > >> are leaf nodes. To take an example with a join, it could be something
> > like
> > >>
> > >> ```
> > >> order_by_sink <sort key>
> > >> hash_join <join arguments>
> > >> filter <expression>
> > >> source
> > >> filter <expression>
> > >> source
> > >> ```
> > >>
> > >> Since we know that a join always takes two arguments, we interpret the
> > >> first (filter source) slice as the first argument and the second as the
> > >> second argument. It should be noted that the current C++ code already
> > >> resembles this kind of syntax, it just has much more clutter.
> > >>
> > >> Thanks!
> > >> Sasha Krassovsky
> > >>
> > >> [1] https://github.com/apache/arrow/pull/14287
> >

Re: Parser for ExecPlans

Reply via email to