Re: Building and visualizing the Beam SQL graph

Anton Kedin Wed, 13 Jun 2018 17:04:57 -0700

>From the visualization perspective I really loved the interactive runner
demo where it shows the graph:
https://www.youtube.com/watch?v=c5CjA1e3Cqw&t=27s


On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <k...@google.com> wrote:

> Another thing to consider is that we might return something like a
> "SqlPCollection" that is the PCollection<Row> plus additional metadata that
> is useful to the shell / enumerable converter (such as if the PCollection
> has a known finite size due to LIMIT, even if it is "unbounded", and the
> shell can return control to the user once it receives enough rows). After
> your proposed change this will be much more natural to do, so that's
> another point in favor of the refactor.
>
> Kenn
>
> On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <apill...@google.com>
> wrote:
>
>> One of my goals is to make the graph easier to read and map back to the
>> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs
>> `toPCollection`) does make a big difference in that graph. I think it is
>> also important to have a common function to do the apply with consistent
>> naming. I think that will greatly help with ease of understanding. It
>> sounds like what really want is this in the BeamRelNode interface:
>>
>> PInput buildPInput(Pipeline pipeline);
>> PTransform<PInput, PCollection<Row>> buildPTransform();
>>
>> default PCollection<Row> toPCollection(Pipeline pipeline) {
>>     return buildPInput(pipeline).apply(getStageName(), buildPTransform());
>> }
>>
>> Andrew
>>
>> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mingm...@gmail.com> wrote:
>>
>>> EXPLAIN shows the execution plan in SQL perspective only. After
>>> converting to a Beam composite PTransform, there're more steps underneath,
>>> each Runner re-org Beam PTransforms again which makes the final pipeline
>>> hard to read. In SQL module itself, I don't see any difference between
>>> `toPTransform` and `toPCollection`. We could have an easy-to-understand
>>> step name when converting RelNodes, but Runners show the graph to
>>> developers.
>>>
>>> Mingmin
>>>
>>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <apill...@google.com>
>>> wrote:
>>>
>>>> That sounds correct. And because each rel node might have a different
>>>> input there isn't a standard interface (like PTransform<
>>>> PCollection<Row>, PCollection<Row>> toPTransform());
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <k...@google.com> wrote:
>>>>
>>>>> Agree with that. It will be kind of tricky to generalize. I think
>>>>> there are some criteria in this case that might apply in other cases:
>>>>>
>>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for
>>>>> how it computes its result from its inputs.
>>>>> 2. The inputs to that PTransform should actually be the inputs to the
>>>>> rel node!
>>>>>
>>>>> So I tried to improve #1 but I probably made #2 worse.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote:
>>>>>
>>>>>> Not answering the original question, but doesn't "explain" satisfy
>>>>>> the SQL use case?
>>>>>>
>>>>>> Going forward we probably want to solve this in a more general way.
>>>>>> We have at least 3 ways to represent the pipeline:
>>>>>>  - how runner executes it;
>>>>>>  - what it looks like when constructed;
>>>>>>  - what the user was describing in DSL;
>>>>>> And there will probably be more, if extra layers are built on top of
>>>>>> DSLs.
>>>>>>
>>>>>> If possible, we probably should be able to map any level of
>>>>>> abstraction to any other to better understand and debug the pipelines.
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <k...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> In other words, revert
>>>>>>> https://github.com/apache/beam/pull/4705/files, at least in spirit?
>>>>>>> I agree :-)
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <apill...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We are currently converting the Calcite Rel tree to Beam by
>>>>>>>> recursively building a tree of nested PTransforms. This results in a 
>>>>>>>> weird
>>>>>>>> nested graph in the dataflow UI where each node contains its inputs 
>>>>>>>> nested
>>>>>>>> inside of it. I'm going to change the internal data structure for
>>>>>>>> converting the tree from a PTransform to a PCollection, which will 
>>>>>>>> result
>>>>>>>> in a more accurate representation of the tree structure being built and
>>>>>>>> should simplify the code as well. This will not change the public 
>>>>>>>> interface
>>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections?
>>>>>>>>
>>>>>>>> I was also wondering if there are tools for visualizing the Beam
>>>>>>>> graph aside from the dataflow runner UI. What other tools exist?
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>
>>>
>>>
>>> --
>>> ----
>>> Mingmin
>>>
>>

Re: Building and visualizing the Beam SQL graph

Reply via email to