Josh Wills created CRUNCH-34:
--------------------------------

             Summary: Refactor the MSCRPlanner logic
                 Key: CRUNCH-34
                 URL: https://issues.apache.org/jira/browse/CRUNCH-34
             Project: Crunch
          Issue Type: Improvement
          Components: Core
    Affects Versions: 0.3.0
            Reporter: Josh Wills
            Assignee: Josh Wills
         Attachments: PLANNER-REFACTORING.patch

I had a conversation with Robert awhile back about one of the shoddier areas of 
the Crunch codebase-- the planning logic. It relies on a whole bunch of mutable 
state, which makes the logic of the overall planning process incomprehensible 
to anyone except for me (back when I wrote it) and Gabriel (who grokked it well 
enough to fix some bugs in it.)

It turns out that understanding the planning process is actually pretty easy if 
you map the logical plan to a graph that has three kinds of vertices: Source, 
Target, and GroupByKey (GBK). All of the other nodes in the logical plan 
(primarily DoCollection/DoTable instances) make up the edges of the graph.

Once you take this graph perspective, you can think of the MapReduce job 
creation process entirely in terms of graph operations:

1) Walk the logical plan and construct the initial Graph object, which allows 
Edges to exist between GBK vertices.
2) Build a new graph that is identical to the first one, except it eliminates 
Edges between GBK vertices by constructing additional Source and Target 
vertices.
3) Identify all of the (weakly) connected components of the new graph.
4) Construct MapReduce jobs out of the connected components, either map-only 
jobs when there is no GBK node in the component, or MapReduce jobs when there 
is one (or a fusion job when there is more than one.)

I've been working on this off-and-on for a couple of weeks, and I have a 
version of the planning code that implements the description above and passes 
all of our tests. There are still places where we have mutable state that will 
need to be cleaned up, but I think this is a step in the right direction. I'm 
not sure it's ready for prime-time yet, but I wanted to get the conversation 
started.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to