[ 
https://issues.apache.org/jira/browse/MAPREDUCE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852754#action_12852754
 ] 

Hong Tang commented on MAPREDUCE-778:
-------------------------------------

Guanying, thanks for taking the effort.

Although it seems versatile to have the tool to parse all types of formats, I 
am concerned the effort of maintaining such versatility may outweight its 
potential usefulness. I think it is more preferable to implement the tool on 
top of Rumen API (and probably as part of Rumen). There are a number of reasons 
why this makes sense:

- As we discovered in Rumen development, parsing job history is not trivial and 
the format could continue evolving in the near future (the data model is not 
cleanly defined as-is IMO, see MAPREDUCE-1175). So it is advantageous to let 
Rumen be the only module to interface with different variations of job history 
format and present the common abstraction of job history.
- Job history contains more than the basic information about job execution, it 
also contains things like status string, and counters, etc and we have lesser 
control of what fields may be added into job history logs over time. So it 
would be a challenging task to keep the anonymizer up to date with high 
confidence that it would not leak any private  info. On the other side, since 
Rumen only extracts a subset of info from the job history logs, we can easily 
enumerate every fields of Rumen json objects to be sure any sensitive fields 
are anonymized.
- Currently some info we want wrt job execution are only available in jobconf 
xml file (such as queue name), rumen does the job of combining them together, 
and building the anonymizer on top of rumen saves the effort of having to have 
another configuration parser.

Other comments:
- I like the idea of storing a private "lookup table" and keep the capability 
of "de-anonymize" the trace if we choose to.
- The coverage of anonymization fields and the way how they are anonymized 
looks good to me. (Need to add the "queue" entity and I do not think we need 
"/path" type.)


> Need a standalone JobHistory log anonymizer
> -------------------------------------------
>
>                 Key: MAPREDUCE-778
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-778
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
>         Attachments: anonymizer.py, same.py
>
>
> Job history logs contain a rich set of information that can help understand 
> and characterize cluster workload and individual job execution. Examples of 
> work that parses or utilizes job history include HADOOP-3585, MAPREDUCE-534, 
> HDFS-459, MAPREDUCE-728, and MAPREDUCE-776. Some of the parsing tools 
> developed in previous work already contains a component to anonymize the 
> logs. It would be nice to combine these effort and have a common standalone 
> tool that can anonymizes job history logs and preserve much of the structure 
> of the files so that existing tools on top of job history logs continue work 
> with no modification.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to