Re: Generating DOT files for Crunch job plans

Gabriel Reid Sat, 27 Oct 2012 12:41:27 -0700

On 27 Oct 2012, at 15:39, Matthias Friedrich <[email protected]> wrote:

> Hi,
> 
> On Saturday, 2012-10-27, Gabriel Reid wrote:
>> In the few times that I've debugged issues in the planner in Crunch,
>> it always takes me a bit of time to figure out (again) how things
>> work there. I've been thinking/planning of writing some more inline
>> docs and doing a bit of refactoring in the code to help myself (and
>> others) with doing this in the future, but something else that I was
>> thinking of was the generation of DOT[1] files for pipelines so that
>> it's easier to visualize what's going on.
> 
> That's a great idea, it will help to win prospective users over who
> wonder whether Crunch's performs as well as a sequence of hand-written
> MR jobs.
> 
> There are other ways in Java to generate graphs, BTW, but from my
> experience none of them produces output that matches dot/graphviz. In
> my opinion we shouldn't run dot ourselves though, because most users
> don't have dot installed. just generate the output and let users call
> dot themselves.


Yes, graphviz/dot also have the advantage of being pretty ubiquitous.

I definitely agree on not running dot ourselves -- the main point to me
for now is just making the information available to anyone who's 
interested in it.

> 
>> I'm sure that functionality like this can be useful (at least to me,
>> as I was just using it in a somewhat ad-hoc way to debug
>> CRUNCH-102), but I'm not sure if this is something we want to expose
>> easily, or keep pretty hidden to just use for debugging. I believe
>> Pig provides this same functionality with the "explain" command.
> 
>> Any thoughts on adding this, particularly around how we could/should
>> expose it in the API?
> 
> I think we should make it available for users and make it really easy
> to access it. I'm not sure about the API, though. Since it's really
> cheap to create we could always generate dot output, store it inside
> the Configuration instance and provide a static utility class to
> access it? A while ago we discussed moving debugging/log4j manipulation
> logic out of the MRPipeline, perhaps we can use a single CrunchDebug
> utilty for both.

I really like the idea of sticking the dot information in the Configuration. In
fact, one of the (several) issues I had before you mentioned that was that
there are actually a few different graphs built up during the planning phase,
and it would be interesting to have access to all of them. Putting them into
a Configuration will resolve that.

I guess we don't need to worry about the API too much for now; if we just
populate the information in the Configuration, we can see how (or if) we
need to make a specific API around it when we get to that point.

- Gabriel

Re: Generating DOT files for Crunch job plans

Reply via email to