[jira] Updated: (MAPREDUCE-778) Need a standalone JobHistory log anonymizer

Guanying Wang (JIRA) Wed, 31 Mar 2010 20:22:54 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Guanying Wang updated MAPREDUCE-778:
------------------------------------

    Attachment: same.py
                anonymizer.py

An anonymizer implemented in Python attached. This anonymizer can work with 
v20, v22, or rumen log files. On doing anonymization, a private file with 
tables is created, and can be used to de-anonymize the anonymized trace. The 
tables file can be used in two ways, either grown incrementally or stand alone, 
when working with multiple traces.

Another file attached same.py is a simple Python script to compare two 
json-based trace files. It works similar to diff. Because json objects can be 
semantically equivalent even if keys in dictionaries are in different orders, 
so running diff directly on two files may not work as desired. It outputs 
nothing if the two files represent the same trace, otherwise print the objects 
(which can be big anyway) that are different in the two files. v22 and rumen 
log files can be compared using this script. Keys in v20 script have fixed 
orders so v20 log files can be compared using diff directly.

Known issues:

1. In v22 and rumen-trace log files, multiple json objects are in one file, and 
separate by white spaces. Without the power of Java Jackson package, the Python 
json module can only load a json object from a string or a file.  Currently, 
the scripts rely on detecting "}\n" as a whole line to determine ending of a 
json object. That may fail if the particular pattern occurs in a string object. 
A better implementation is similar to what Java Jackson does. An object should 
be found from a file, leaving the rest of the file still operational for 
further operations.

2. Sample rumen-trace and rumen-topology files are got from 
hadoop-mapreduce/src/test/tools/data/rumen/. These sample files seem to be 
generated from v20 log files, since "." are escaped as "\." in many fields. I'm 
not sure if rumen works with v22 log files, and if there are differences 
between rumen files generated from v22 log files and rumen files generated from 
v20 log files.


> Need a standalone JobHistory log anonymizer
> -------------------------------------------
>
>                 Key: MAPREDUCE-778
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-778
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Hong Tang
>         Attachments: anonymizer.py, same.py
>
>
> Job history logs contain a rich set of information that can help understand 
> and characterize cluster workload and individual job execution. Examples of 
> work that parses or utilizes job history include HADOOP-3585, MAPREDUCE-534, 
> HDFS-459, MAPREDUCE-728, and MAPREDUCE-776. Some of the parsing tools 
> developed in previous work already contains a component to anonymize the 
> logs. It would be nice to combine these effort and have a common standalone 
> tool that can anonymizes job history logs and preserve much of the structure 
> of the files so that existing tools on top of job history logs continue work 
> with no modification.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-778) Need a standalone JobHistory log anonymizer

Reply via email to