>From the visualization perspective I really loved the interactive runner demo where it shows the graph: https://www.youtube.com/watch?v=c5CjA1e3Cqw&t=27s
On Wed, Jun 13, 2018 at 4:36 PM Kenneth Knowles <k...@google.com> wrote: > Another thing to consider is that we might return something like a > "SqlPCollection" that is the PCollection<Row> plus additional metadata that > is useful to the shell / enumerable converter (such as if the PCollection > has a known finite size due to LIMIT, even if it is "unbounded", and the > shell can return control to the user once it receives enough rows). After > your proposed change this will be much more natural to do, so that's > another point in favor of the refactor. > > Kenn > > On Wed, Jun 13, 2018 at 10:22 AM Andrew Pilloud <apill...@google.com> > wrote: > >> One of my goals is to make the graph easier to read and map back to the >> SQL EXPLAIN output. The way the graph is currently built (`toPTransform` vs >> `toPCollection`) does make a big difference in that graph. I think it is >> also important to have a common function to do the apply with consistent >> naming. I think that will greatly help with ease of understanding. It >> sounds like what really want is this in the BeamRelNode interface: >> >> PInput buildPInput(Pipeline pipeline); >> PTransform<PInput, PCollection<Row>> buildPTransform(); >> >> default PCollection<Row> toPCollection(Pipeline pipeline) { >> return buildPInput(pipeline).apply(getStageName(), buildPTransform()); >> } >> >> Andrew >> >> On Mon, Jun 11, 2018 at 2:27 PM Mingmin Xu <mingm...@gmail.com> wrote: >> >>> EXPLAIN shows the execution plan in SQL perspective only. After >>> converting to a Beam composite PTransform, there're more steps underneath, >>> each Runner re-org Beam PTransforms again which makes the final pipeline >>> hard to read. In SQL module itself, I don't see any difference between >>> `toPTransform` and `toPCollection`. We could have an easy-to-understand >>> step name when converting RelNodes, but Runners show the graph to >>> developers. >>> >>> Mingmin >>> >>> On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud <apill...@google.com> >>> wrote: >>> >>>> That sounds correct. And because each rel node might have a different >>>> input there isn't a standard interface (like PTransform< >>>> PCollection<Row>, PCollection<Row>> toPTransform()); >>>> >>>> Andrew >>>> >>>> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles <k...@google.com> wrote: >>>> >>>>> Agree with that. It will be kind of tricky to generalize. I think >>>>> there are some criteria in this case that might apply in other cases: >>>>> >>>>> 1. Each rel node (or construct of a DSL) should have a PTransform for >>>>> how it computes its result from its inputs. >>>>> 2. The inputs to that PTransform should actually be the inputs to the >>>>> rel node! >>>>> >>>>> So I tried to improve #1 but I probably made #2 worse. >>>>> >>>>> Kenn >>>>> >>>>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin <ke...@google.com> wrote: >>>>> >>>>>> Not answering the original question, but doesn't "explain" satisfy >>>>>> the SQL use case? >>>>>> >>>>>> Going forward we probably want to solve this in a more general way. >>>>>> We have at least 3 ways to represent the pipeline: >>>>>> - how runner executes it; >>>>>> - what it looks like when constructed; >>>>>> - what the user was describing in DSL; >>>>>> And there will probably be more, if extra layers are built on top of >>>>>> DSLs. >>>>>> >>>>>> If possible, we probably should be able to map any level of >>>>>> abstraction to any other to better understand and debug the pipelines. >>>>>> >>>>>> >>>>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles <k...@google.com> >>>>>> wrote: >>>>>> >>>>>>> In other words, revert >>>>>>> https://github.com/apache/beam/pull/4705/files, at least in spirit? >>>>>>> I agree :-) >>>>>>> >>>>>>> Kenn >>>>>>> >>>>>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud <apill...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> We are currently converting the Calcite Rel tree to Beam by >>>>>>>> recursively building a tree of nested PTransforms. This results in a >>>>>>>> weird >>>>>>>> nested graph in the dataflow UI where each node contains its inputs >>>>>>>> nested >>>>>>>> inside of it. I'm going to change the internal data structure for >>>>>>>> converting the tree from a PTransform to a PCollection, which will >>>>>>>> result >>>>>>>> in a more accurate representation of the tree structure being built and >>>>>>>> should simplify the code as well. This will not change the public >>>>>>>> interface >>>>>>>> to SQL, which will remain a PTransform. Any thoughts or objections? >>>>>>>> >>>>>>>> I was also wondering if there are tools for visualizing the Beam >>>>>>>> graph aside from the dataflow runner UI. What other tools exist? >>>>>>>> >>>>>>>> Andrew >>>>>>>> >>>>>>> >>> >>> >>> -- >>> ---- >>> Mingmin >>> >>