On 27 Oct 2012, at 15:39, Matthias Friedrich <[email protected]> wrote: > Hi, > > On Saturday, 2012-10-27, Gabriel Reid wrote: >> In the few times that I've debugged issues in the planner in Crunch, >> it always takes me a bit of time to figure out (again) how things >> work there. I've been thinking/planning of writing some more inline >> docs and doing a bit of refactoring in the code to help myself (and >> others) with doing this in the future, but something else that I was >> thinking of was the generation of DOT[1] files for pipelines so that >> it's easier to visualize what's going on. > > That's a great idea, it will help to win prospective users over who > wonder whether Crunch's performs as well as a sequence of hand-written > MR jobs. > > There are other ways in Java to generate graphs, BTW, but from my > experience none of them produces output that matches dot/graphviz. In > my opinion we shouldn't run dot ourselves though, because most users > don't have dot installed. just generate the output and let users call > dot themselves.
Yes, graphviz/dot also have the advantage of being pretty ubiquitous. I definitely agree on not running dot ourselves -- the main point to me for now is just making the information available to anyone who's interested in it. > >> I'm sure that functionality like this can be useful (at least to me, >> as I was just using it in a somewhat ad-hoc way to debug >> CRUNCH-102), but I'm not sure if this is something we want to expose >> easily, or keep pretty hidden to just use for debugging. I believe >> Pig provides this same functionality with the "explain" command. > >> Any thoughts on adding this, particularly around how we could/should >> expose it in the API? > > I think we should make it available for users and make it really easy > to access it. I'm not sure about the API, though. Since it's really > cheap to create we could always generate dot output, store it inside > the Configuration instance and provide a static utility class to > access it? A while ago we discussed moving debugging/log4j manipulation > logic out of the MRPipeline, perhaps we can use a single CrunchDebug > utilty for both. I really like the idea of sticking the dot information in the Configuration. In fact, one of the (several) issues I had before you mentioned that was that there are actually a few different graphs built up during the planning phase, and it would be interesting to have access to all of them. Putting them into a Configuration will resolve that. I guess we don't need to worry about the API too much for now; if we just populate the information in the Configuration, we can see how (or if) we need to make a specific API around it when we get to that point. - Gabriel
