Hey Christian, This looks awesome! There have been a bunch of times when I've been digging around in the planner and wanting to have something like this, so yes, I definitely think this is useful to have.
- Gabriel On Tue, Jul 1, 2014 at 2:16 PM, Christian Tzolov <christian.tzo...@gmail.com> wrote: > Hi, > > While exploring the Crunch MR execution flow I decided to augment the > excellent pipeline DOT diagram with few additional visualizations of some > interesting (for me) internal/intermediate pipeline preparation states. > Such like the output-pcollection-targets structure (used for the pipeline > planning), the Graphs before and after the split up of dependent GBK nodes > and the RTNode hierarchy as persistent in the Configuration before the > execution of the pipeline. > For each diagram I've plotted some relevant internals like the PTypte > structures. The implementation hack includes 3 additional DotfileWriters > hooked inside the MSCRPlanner#plan() to intercept the flow. > > An example of the diagrams generated from the > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline is linked below. > > Do we need such internals visualization? Something like visualization of > the logical, mapping and physical (e.g. RTNodes) plans of the pipeline > preparation? What do you think? > > Cheers, > Christian > > > Diagrams generated from the > org.apache.crunch.Breakpoint2IT#testBreakpoint() pipeline. > > - Dotfile containing all graphs: > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint.dot > > > 1. > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_main.png > - is the existing diagram. It provides very well balanced view of the > pipeline, showing how the functional blocks are mapped into execution > Map/Reduce components and the dependencies between them. > > 2. > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_pcollection_outputTargets.png > - Visualizes the outputs (Map<PCollectionImpl<?>, Set<Target>> outputs) in > the MSCRPlanner on plan() operation is execution: > - Each data flow is depicted with different color to indicate the > overlapping execution paths. > - The PCollection name, class and PTypes are shown. > > 3. > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_BaseGraph.png > - Visualizes the 'Base Graph' created in the MSCRPlanner#plan() method. It > draws the Vertices with their names, pcollection and ptype. The arc label > lists the Graph's edge path lists. > > 4. > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_FinalGraphWithComponents.png > - Graph created in the MSCRPlanner#plan() after the splits up of dependent > GBK nodes and break the graph up into connected components - bounded by > read dashed line. > > 5. > https://dl.dropboxusercontent.com/u/79241625/share_crunch/Breakpoint2IT_testBreakpoint_RTNodesAndFormatBundles.png > - Visualizes the RTNodes ussed inside the CrunchMapper and CrunchReducer as > well as the Inputs and Outputs. > - RTNodes are deserialized from the Job's > CRUNCH_WORKING_DIRECTORY/(MAP|REDUCE|COMBINE). Every RTNode is mapped to > the containing Map or Reduce tasks and parent Crunch Job. The relationship > between RTNodes (e.g. parent/children) is depicted with arrows. > - Named Outputs are deserialized from the CRUNCH_OUTPUTS into Map<String, > OutputConfig> and depicted in the magenta subgraph > - Inputs are deserialized from the CRUNCH_INPUTS into Map<FormatBundle, > Map<Integer, List<Path>>> and depicted in green subgraph > - The inputs are mapped to the corresponding RTNode using the nodeIndex > reference. > - Outputs are mapped to the corresponding RTNode by the Output Name > references > - There is not good way to print the anonymous DoFn instances. > - Note: the dependency between the crunch jobs is not drawn as it my > require access to the competition hook attributes. > - Note: in order to draw the RTNodes i had to expose its attributes via > public getters.