[
https://issues.apache.org/jira/browse/CRUNCH-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13784369#comment-13784369
]
Mike Zimmerman commented on CRUNCH-272:
---------------------------------------
Josh, I think that is a great first step. Micah and I had a conversation
offline about this JIRA after I logged it and we walked through my use cases in
more detail. My targeted users are system administrators and developers that
are trying to monitor and tune oozie workflows running on the Hadoop cluster.
The first part of the problem is figuring out a way to mark which jobs are
involved in a higher level operation like a crunch job launched through an
oozie workflow. (Your suggestion may help do this.) The second and more
difficult part of the problem is locating these marked jobs after the parent
process has completed. My first thought is that it would be awesome if I could
query the Job Tracker, by giving it a correlation id and have it return a list
of qualifying jobs. I don't believe this is possible today and the idea is
also somewhat flawed by the fact that all data would be lost if the Job Tracker
instance was restarted. My second thought is to harvest the information
through log data, but that seems like a lot of overhead and load on the cluster
to do something that should be relatively simple. My final thought is to write
custom code to log this information out to a store that can be queried at the
time the crunch job is executing. Any recommendations you have are very much
appreciated. I believe the solution to this problem probably lies outside of
the Crunch project, so if you need to close this issue please feel free to do
so.
> Unable to correlate crunch jobs within Oozie
> --------------------------------------------
>
> Key: CRUNCH-272
> URL: https://issues.apache.org/jira/browse/CRUNCH-272
> Project: Crunch
> Issue Type: Improvement
> Reporter: Mike Zimmerman
>
> I'm not really sure if this should be logged to Oozie or to Crunch, so please
> feel free to move as needed.
> I would like to request a way to decorate map/reduce jobs that are spawned by
> a Crunch pipeline so that I can programmatically determine their origin. The
> primary use case for this is integration with Oozie. Oozie launches a single
> map job to run a java action (in our case this java action runs a crunch
> job). Traceability from this original "launcher" job to the jobs created by
> the crunch job is impossible without trolling logs. This leaves a big black
> hole for the system operator to assess the performance/impact of these jobs.
> My initial thought was to provide a simple way to indicate a correlationId or
> similar on a map/reduce job and then make it accessible within Oozie to query
> for. Obviously, that request would have to come after the correlation
> feature was available within map/reduce.
--
This message was sent by Atlassian JIRA
(v6.1#6144)