[
https://issues.apache.org/jira/browse/MAPREDUCE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guanying Wang updated MAPREDUCE-778:
------------------------------------
Attachment: same.py
anonymizer.py
An anonymizer implemented in Python attached. This anonymizer can work with
v20, v22, or rumen log files. On doing anonymization, a private file with
tables is created, and can be used to de-anonymize the anonymized trace. The
tables file can be used in two ways, either grown incrementally or stand alone,
when working with multiple traces.
Another file attached same.py is a simple Python script to compare two
json-based trace files. It works similar to diff. Because json objects can be
semantically equivalent even if keys in dictionaries are in different orders,
so running diff directly on two files may not work as desired. It outputs
nothing if the two files represent the same trace, otherwise print the objects
(which can be big anyway) that are different in the two files. v22 and rumen
log files can be compared using this script. Keys in v20 script have fixed
orders so v20 log files can be compared using diff directly.
Known issues:
1. In v22 and rumen-trace log files, multiple json objects are in one file, and
separate by white spaces. Without the power of Java Jackson package, the Python
json module can only load a json object from a string or a file. Currently,
the scripts rely on detecting "}\n" as a whole line to determine ending of a
json object. That may fail if the particular pattern occurs in a string object.
A better implementation is similar to what Java Jackson does. An object should
be found from a file, leaving the rest of the file still operational for
further operations.
2. Sample rumen-trace and rumen-topology files are got from
hadoop-mapreduce/src/test/tools/data/rumen/. These sample files seem to be
generated from v20 log files, since "." are escaped as "\." in many fields. I'm
not sure if rumen works with v22 log files, and if there are differences
between rumen files generated from v22 log files and rumen files generated from
v20 log files.
> Need a standalone JobHistory log anonymizer
> -------------------------------------------
>
> Key: MAPREDUCE-778
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-778
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Reporter: Hong Tang
> Attachments: anonymizer.py, same.py
>
>
> Job history logs contain a rich set of information that can help understand
> and characterize cluster workload and individual job execution. Examples of
> work that parses or utilizes job history include HADOOP-3585, MAPREDUCE-534,
> HDFS-459, MAPREDUCE-728, and MAPREDUCE-776. Some of the parsing tools
> developed in previous work already contains a component to anonymize the
> logs. It would be nice to combine these effort and have a common standalone
> tool that can anonymizes job history logs and preserve much of the structure
> of the files so that existing tools on top of job history logs continue work
> with no modification.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.