Hi,

We're grad students at UC Berkeley working on a project to instrument Hadoop using an open-source path-based tracing framework called X- Trace (www.x-trace.net/wiki). X-Trace captures causal dependencies between events in addition to timings, letting developers analyze not just performance but also context and dependencies for various events. We have created a web-based trace analysis UI that shows performance of different IPC calls, DFS operations, and phases of a MapReduce job. The goal is to let users easily spot the origin of unusual behavior in a running system at a centralized location. We believe that this kind of tracing can be used for performance tuning and debugging in both development and production environments.

We'd like to get feedback on our work and suggestions on what trace analyses would be useful to Hadoop developers and users. Some of the reports we currently generate include machine utilization over time, relative performance of different tasks, and performance of DFS operations. You can see an example set of reports at http://www.cs.berkeley.edu/~matei/xtrace_sample_task.html (this is a trace of a Nutch indexing job). You can also read our project journal at http://radlab.cs.berkeley.edu/wiki/Projects/Monitoring_Hadoop_through_Tracing . We've already spotted some interesting issues, like map tasks and DFS reads/writes that are an order of magnitude slower than the average, and we are investigating possible causes for them. Most importantly, the UI lets a user easily see where the system is spending time and reason about how to tune it, and provides much more information than the progress data in the JobTracker UI. As a Hadoop developer, what kinds of questions do you want answered about running jobs that are hard to obtain just from process logs?

Once we've had a discussion on features for a trace analysis UI, we would like to contribute our work into the Hadoop codebase. We will create a JIRA issue and patch adding this functionality. We're also interested in seeing if we can integrate X-Trace logging more tightly with the current Apache logging in Hadoop.

Finally, we are currently experimenting on relatively small (<50 nodes) clusters here at Berkeley, but we would really like to try tracing some large (>1000 node) clusters. If there is someone interested in evaluating performance on such a cluster, we would be very happy to talk about how to set up X-Trace and provide you with a patch.

Thanks,

Andy Konwinski and Matei Zaharia

Reply via email to