Yes please. On Jul 2, 2014 10:21 AM, "Christian Tzolov" <christian.tzo...@gmail.com> wrote:
> cool :) What is the best way to continue? open a new Jira ticket for it? > > > On Tue, Jul 1, 2014 at 3:22 PM, Josh Wills <jwi...@cloudera.com> wrote: > > > +1-- very cool. :) > > > > > > On Tue, Jul 1, 2014 at 5:28 AM, Gabriel Reid <gabriel.r...@gmail.com> > > wrote: > > > > > Hey Christian, > > > > > > This looks awesome! There have been a bunch of times when I've been > > > digging around in the planner and wanting to have something like this, > > > so yes, I definitely think this is useful to have. > > > > > > - Gabriel > > > > > > > > > On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov > > > <christian.tzo...@gmail.com> wrote: > > > > Hi, > > > > > > > > While exploring the Crunch MR execution flow I decided to augment the > > > > excellent pipeline DOT diagram with few additional visualizations of > > some > > > > interesting (for me) internal/intermediate pipeline preparation > states. > > > > Such like the output-pcollection-targets structure (used for the > > pipeline > > > > planning), the Graphs before and after the split up of dependent GBK > > > nodes > > > > and the RTNode hierarchy as persistent in the Configuration before > the > > > > execution of the pipeline. > > > > For each diagram I've plotted some relevant internals like the PTypte > > > > structures. The implementation hack includes 3 additional > > DotfileWriters > > > > hooked inside the MSCRPlanner#plan() to intercept the flow. > > > > > > > > An example of the diagrams generated from the > > > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked > > > below. > > > > > > > > Do we need such internals visualization? Something like visualization > > of > > > > the logical, mapping and physical (e.g. RTNodes) plans of the > pipeline > > > > preparation? What do you think? > > > > > > > > Cheers, > > > > Christian > > > > > > > > > > > > Diagrams generated from the > > > > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline. > > > > > > > > - Dotfile containing all graphs: > > > > > > > > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot > > > > > > > > > > > > 1. > > > > > > > > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png > > > > - is the existing diagram. It provides very well balanced view of the > > > > pipeline, showing how the functional blocks are mapped into execution > > > > Map/Reduce components and the dependencies between them. > > > > > > > > 2. > > > > > > > > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png > > > > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> > outputs) > > > in > > > > the MSCRPlanner on plan() operation is execution: > > > > - Each data flow is depicted with different color to indicate the > > > > overlapping execution paths. > > > > - The PCollection name, class and PTypes are shown. > > > > > > > > 3. > > > > > > > > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png > > > > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() > method. > > > It > > > > draws the Vertices with their names, pcollection and ptype. The arc > > label > > > > lists the Graph's edge path lists. > > > > > > > > 4. > > > > > > > > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png > > > > - Graph created in the MSCRPlanner#plan() after the splits up of > > > dependent > > > > GBK nodes and break the graph up into connected components - bounded > by > > > > read dashed line. > > > > > > > > 5. > > > > > > > > > > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png > > > > - Visualizes the RTNodes ussed inside the CrunchMapper and > > CrunchReducer > > > as > > > > well as the Inputs and Outputs. > > > > - RTNodes are deserialized from the Job's > > > > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped > > to > > > > the containing Map or Reduce tasks and parent Crunch Job. The > > > relationship > > > > between RTNodes (e.g. parent/children) is depicted with arrows. > > > > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into > > Map<String, > > > > OutputConfig> and depicted in the magenta subgraph > > > > - Inputs are deserialized from the CRUNCH_INPUTS into > Map<FormatBundle, > > > > Map<Integer, List<Path>>> and depicted in green subgraph > > > > - The inputs are mapped to the corresponding RTNode using the > nodeIndex > > > > reference. > > > > - Outputs are mapped to the corresponding RTNode by the Output Name > > > > references > > > > - There is not good way to print the anonymous DoFn instances. > > > > - Note: the dependency between the crunch jobs is not drawn as it my > > > > require access to the competition hook attributes. > > > > - Note: in order to draw the RTNodes i had to expose its attributes > via > > > > public getters. > > > > > > > > > > > -- > > Director of Data Science > > Cloudera <http://www.cloudera.com> > > Twitter: @josh_wills <http://twitter.com/josh_wills> > > >