Re: Parser for ExecPlans

Sasha Krassovsky Thu, 03 Nov 2022 18:02:15 -0700

Hi Julian,
Thanks for the feedback. I was going to reuse the syntax we use for literals in 
the expression parser PR (of course still subject to change), it’s of the form 
$datatype:value. So short/long/float would all be distinguished by writing 
$int16:0, $int64:0, and $float32:0.


I feel it strikes a good balance of being very precise but not overly verbose 
(I see substrait’s JSON does it like “literal” : { “<datatype>” : <value> } }, 
which conveys the same information but is much more verbose. 

Sasha 

> 3 нояб. 2022 г., в 16:26, Julian Hyde <jhyde.apa...@gmail.com> написал(а):
> 
> When people design a language to represent a data structure, they often do a 
> poor job with literals (i.e. the constant values for each data type). And 
> that causes problems with operator overloading. I recommend that you give 
> each data type its own literal format, so you can distinguish, say, a short 0 
> from an unsigned long 0 or a 64 bit floating point 0. Same goes for composite 
> literals (e.g. a constant of type array or struct or array-of-struct) and 
> floating point numbers.
> 
>> On Nov 3, 2022, at 11:06 AM, Percy Camilo Triveño Aucahuasi 
>> <percy.camilo...@gmail.com> wrote:
>> 
>> Thanks Sasha!
>> 
>> A nice advantage about parentheses is that most editors can track and
>> highlight the sections between them.
>> Also, those parentheses can be optional when we detect new lines (in the
>> case some users don't want to deal with many parentheses); in that case, we
>> would just need to ask indentation.
>> 
>> Percy
>> 
>> 
>>> On Thu, Nov 3, 2022 at 12:47 PM Sasha Krassovsky <krassovskysa...@gmail.com>
>>> wrote:
>>> 
>>> Hi Percy,
>>> Thanks for the input! New lines would be no problem at all, they’d just be
>>> treated the same as any other whitespace. One thing to point out about the
>>> function call style when written that way is that it looks a lot like the
>>> list style, it’s just that there are more parentheses to keep track of,
>>> though it does make it more obvious what delineates a subtree.
>>> 
>>> Sasha
>>> 
>>> 
>>>> 3 нояб. 2022 г., в 10:35, Percy Camilo Triveño Aucahuasi <
>>> percy.camilo...@gmail.com> написал(а):
>>>> 
>>>> Hi Sasha,
>>>> 
>>>> I like the function call-style variant.  Quick question about the parser:
>>>> Do you think we can parse with new lines too? that way it would be even
>>>> more similar to a json-like/declarative approach and could mitigate a bit
>>>> the nesting issue (which would make it easier to read as well) for
>>> instance:
>>>> 
>>>> sink(
>>>> project(
>>>>  filter(
>>>>    source(
>>>>      …)
>>>>  …)
>>>> …)
>>>> …)
>>>> 
>>>> Percy
>>>> 
>>>> 
>>>>> On Tue, Oct 18, 2022 at 5:54 PM Sasha Krassovsky <
>>> krassovskysa...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>> Hi everyone,
>>>>> We recently had some discussions about parsing expressions. I currently
>>>>> have a PR [1] up for that taking into account the feedback. Next I
>>> wanted
>>>>> to tackle something for ExecPlans, as manually specifying one using
>>> code is
>>>>> currently cumbersome. I’m currently deciding between 2 variants:
>>>>> 
>>>>> - Function call-style: This would be a similar syntax to the
>>> expressions,
>>>>> where we would have something along the lines of
>>>>> `sink(project(filter(source(…)…)…)…)`. The problem with this syntax is
>>> that
>>>>> it involves tons of nesting, which although an improvement over
>>> handwriting
>>>>> the C++ code, is still cumbersome to write. On the other hand, this
>>> syntax
>>>>> is pretty intuitive and meshes well with the expression syntax. A minor
>>>>> modification could be to make the last argument rather than the first be
>>>>> the input to a node, which would at least keep a node’s parameters
>>>>> together.
>>>>> 
>>>>> - List style: This syntax completely eliminates nesting and would
>>> probably
>>>>> be easier to write but has a steeper learning curve. Essentially, since
>>> we
>>>>> know how many inputs each type of node takes, we can implicitly
>>> reconstruct
>>>>> a tree from a list of node names (formally, we are converting from/to a
>>>>> pre-order traversal of the query tree). For example, it would look
>>>>> something like:
>>>>> 
>>>>> ```
>>>>> sink
>>>>> project <list of names/expressions>
>>>>> filter <expression>
>>>>> source
>>>>> ```
>>>>> 
>>>>> The key is that we know that a source takes no inputs, and so source
>>> nodes
>>>>> are leaf nodes. To take an example with a join, it could be something
>>> like
>>>>> 
>>>>> ```
>>>>> order_by_sink <sort key>
>>>>> hash_join <join arguments>
>>>>> filter <expression>
>>>>> source
>>>>> filter <expression>
>>>>> source
>>>>> ```
>>>>> 
>>>>> Since we know that a join always takes two arguments, we interpret the
>>>>> first (filter source) slice as the first argument and the second as the
>>>>> second argument. It should be noted that the current C++ code already
>>>>> resembles this kind of syntax, it just has much more clutter.
>>>>> 
>>>>> Thanks!
>>>>> Sasha Krassovsky
>>>>> 
>>>>> [1] https://github.com/apache/arrow/pull/14287
>>> 
>

Re: Parser for ExecPlans

Reply via email to