Re: [PATCH v3] teach fast-export an --anonymize option
On Thu, Aug 28, 2014 at 12:01 AM, Jeff King p...@peff.net wrote: You can get an overview of what will be shared by running a command like: git fast-export --anonymize --all | perl -pe 's/\d+/X/g' | sort -u | less which will show every unique line we generate, modulo any numbers (each anonymized token is assigned a number, like User 0, and we replace it consistently in the output). I feel like this should be part of git-fast-export.txt, just to increase the user's confidence in the tool (and I don't expect most users to read this commit message). -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] teach fast-export an --anonymize option
On Thu, Aug 28, 2014 at 05:30:44PM +0700, Duy Nguyen wrote: On Thu, Aug 28, 2014 at 12:01 AM, Jeff King p...@peff.net wrote: You can get an overview of what will be shared by running a command like: git fast-export --anonymize --all | perl -pe 's/\d+/X/g' | sort -u | less which will show every unique line we generate, modulo any numbers (each anonymized token is assigned a number, like User 0, and we replace it consistently in the output). I feel like this should be part of git-fast-export.txt, just to increase the user's confidence in the tool (and I don't expect most users to read this commit message). Hmph. Whenever I say I think this patch is done, suddenly the comments start pouring in. :) I think you are right, though, and we could stand to explain the feature a little more in the documentation in general. How about this patch on top (or squashed in): -- 8 -- Subject: docs/fast-export: explain --anonymize more completely The original commit made mention of this option, but not why one might want it or how they might use it. Let's try to be a little more thorough, and also explain how to confirm that the output really is anonymous. Signed-off-by: Jeff King p...@peff.net --- Documentation/git-fast-export.txt | 63 --- 1 file changed, 59 insertions(+), 4 deletions(-) diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index 52831fa..dbe9a46 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -106,10 +106,9 @@ marks the same across runs. different from the commit's first parent). --anonymize:: - Replace all refnames, paths, blob contents, commit and tag - messages, names, and email addresses in the output with - anonymized data, while still retaining the shape of history and - of the stored tree. + Anonymize the contents of the repository while still retaining + the shape of the history and stored tree. See the section on + `ANONYMIZING` below. --refspec:: Apply the specified refspec to each ref exported. Multiple of them can @@ -147,6 +146,62 @@ referenced by that revision range contains the string 'refs/heads/master'. +ANONYMIZING +--- + +If the `--anonymize` option is given, git will attempt to remove all +identifying information from the repository while still retaining enough +of the original tree and history patterns to reproduce some bugs. The +goal is that a git bug which is found on a private repository will +persist in the anonymized repository, and the latter can be shared with +git developers to help solve the bug. + +With this option, git will replace all refnames, paths, blob contents, +commit and tag messages, names, and email addresses in the output with +anonymized data. Two instances of the same string will be replaced +equivalently (e.g., two commits with the same author will have the same +anonymized author in the output, but bear no resemblance to the original +author string). The relationship between commits, branches, and tags is +retained, as well as the commit timestamps (but the commit messages and +refnames bear no resemblance to the originals). The relative makeup of +the tree is retained (e.g., if you have a root tree with 10 files and 3 +trees, so will the output), but their names and the contents of the +files will be replaced. + +If you think you have found a git bug, you can start by exporting an +anonymized stream of the whole repository: + +--- +$ git fast-export --anonymize --all anon-stream +--- + +Then confirm that the bug persists in a repository created from that +stream (many bugs will not, as they really do depend on the exact +repository contents): + +--- +$ git init anon-repo +$ cd anon-repo +$ git fast-import ../anon-stream +$ ... test your bug ... +--- + +If the anonymized repository shows the bug, it may be worth sharing +`anon-stream` along with a regular bug report. Note that the anonymized +stream compresses very well, so gzipping it is encouraged. If you want +to examine the stream to see that it does not contain any private data, +you can peruse it directly before sending. You may also want to try: + +--- +$ perl -pe 's/\d+/X/g' anon-stream | sort -u | less +--- + +which shows all of the unique lines (with numbers converted to X, to +collapse User 0, User 1, etc into User X). This produces a much +smaller output, and it is usually easy to quickly confirm that there is +no private data in the stream. + + Limitations --- -- 2.1.0.346.ga0367b9 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to
Re: [PATCH v3] teach fast-export an --anonymize option
On 28/08/14 13:32, Jeff King wrote: On Thu, Aug 28, 2014 at 05:30:44PM +0700, Duy Nguyen wrote: On Thu, Aug 28, 2014 at 12:01 AM, Jeff King p...@peff.net wrote: You can get an overview of what will be shared by running a command like: git fast-export --anonymize --all | perl -pe 's/\d+/X/g' | sort -u | less which will show every unique line we generate, modulo any numbers (each anonymized token is assigned a number, like User 0, and we replace it consistently in the output). I feel like this should be part of git-fast-export.txt, just to increase the user's confidence in the tool (and I don't expect most users to read this commit message). Hmph. Whenever I say I think this patch is done, suddenly the comments start pouring in. :) :-D I think you are right, though, and we could stand to explain the feature a little more in the documentation in general. How about this patch on top (or squashed in): -- 8 -- Subject: docs/fast-export: explain --anonymize more completely The original commit made mention of this option, but not why one might want it or how they might use it. Let's try to be a little more thorough, and also explain how to confirm that the output really is anonymous. Signed-off-by: Jeff King p...@peff.net --- Documentation/git-fast-export.txt | 63 --- 1 file changed, 59 insertions(+), 4 deletions(-) diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index 52831fa..dbe9a46 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -106,10 +106,9 @@ marks the same across runs. different from the commit's first parent). --anonymize:: - Replace all refnames, paths, blob contents, commit and tag - messages, names, and email addresses in the output with - anonymized data, while still retaining the shape of history and - of the stored tree. + Anonymize the contents of the repository while still retaining + the shape of the history and stored tree. See the section on + `ANONYMIZING` below. --refspec:: Apply the specified refspec to each ref exported. Multiple of them can @@ -147,6 +146,62 @@ referenced by that revision range contains the string 'refs/heads/master'. +ANONYMIZING +--- + +If the `--anonymize` option is given, git will attempt to remove all +identifying information from the repository while still retaining enough +of the original tree and history patterns to reproduce some bugs. The +goal is that a git bug which is found on a private repository will s/goal/hope/ ;-) +persist in the anonymized repository, and the latter can be shared with +git developers to help solve the bug. + +With this option, git will replace all refnames, paths, blob contents, +commit and tag messages, names, and email addresses in the output with +anonymized data. Two instances of the same string will be replaced +equivalently (e.g., two commits with the same author will have the same +anonymized author in the output, but bear no resemblance to the original +author string). The relationship between commits, branches, and tags is +retained, as well as the commit timestamps (but the commit messages and +refnames bear no resemblance to the originals). The relative makeup of +the tree is retained (e.g., if you have a root tree with 10 files and 3 +trees, so will the output), but their names and the contents of the +files will be replaced. + +If you think you have found a git bug, you can start by exporting an +anonymized stream of the whole repository: + +--- +$ git fast-export --anonymize --all anon-stream +--- + +Then confirm that the bug persists in a repository created from that +stream (many bugs will not, as they really do depend on the exact +repository contents): Dumb question (I have not even read the patch, so please just ignore me if this is indeed dumb!): Is the map of original-name, anonymized-name available to the user while he attempts to confirm that the bug is still present? For example, if I anonymized git.git, and did 'git branch -v' (say), how easy would it be for me to recognise which branch was 'next'? ATB, Ramsay Jones -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] teach fast-export an --anonymize option
Jeff King p...@peff.net writes: Subject: docs/fast-export: explain --anonymize more completely The original commit made mention of this option, but not why one might want it or how they might use it. Let's try to be a little more thorough, and also explain how to confirm that the output really is anonymous. Signed-off-by: Jeff King p...@peff.net --- Documentation/git-fast-export.txt | 63 --- 1 file changed, 59 insertions(+), 4 deletions(-) diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt index 52831fa..dbe9a46 100644 --- a/Documentation/git-fast-export.txt +++ b/Documentation/git-fast-export.txt @@ -106,10 +106,9 @@ marks the same across runs. different from the commit's first parent). --anonymize:: - Replace all refnames, paths, blob contents, commit and tag - messages, names, and email addresses in the output with - anonymized data, while still retaining the shape of history and - of the stored tree. + Anonymize the contents of the repository while still retaining + the shape of the history and stored tree. See the section on + `ANONYMIZING` below. Technically s/tree/trees/, I would think. For a repository with multiple branches, perhaps s/history/histories/, too, but I would not insist on that ;-). +ANONYMIZING +--- + +If the `--anonymize` option is given, git will attempt to remove all +identifying information from the repository while still retaining enough +of the original tree and history patterns to reproduce some bugs. The +goal is that a git bug which is found on a private repository will +persist in the anonymized repository, and the latter can be shared with +git developers to help solve the bug. + +With this option, git will replace all refnames, paths, blob contents, +commit and tag messages, names, and email addresses in the output with +anonymized data. Two instances of the same string will be replaced +equivalently (e.g., two commits with the same author will have the same +anonymized author in the output, but bear no resemblance to the original +author string). The relationship between commits, branches, and tags is +retained, as well as the commit timestamps (but the commit messages and +refnames bear no resemblance to the originals). The relative makeup of +the tree is retained (e.g., if you have a root tree with 10 files and 3 +trees, so will the output), but their names and the contents of the +files will be replaced. While I do not think I or anybody who would ask other people to use this option would be confused, the phrase the same string may risk unnecessary worries from those who are asked to trust this option. I am not yet convinced that it is unlikely for the reader to read the above and imagine that the anonymiser may go word by word, replacing the same string with the same anonymised gibberish (which would be susceptible to old-school cryptoanalysis techniques). Among the ones that listed, refnames, blob contents, commit messages and tag messages are converted as a single string and I wish I could think of phrasing to stress that point somehow. Each path component in paths is converted as a single string, so we can read from two anonymised paths if they refer to blobs in the same directory in the original. This is a good thing, of course, but it shows that among those listed in refnames, paths, blob contents, ... in a flat sentence, some are treated as a single token for replacement but not others, and it is hard to tell for a reader which one is which, unless the reader knows the internals of Git, i.e. what kind of things we as the debuggers-of-Git would want to preserve. Isn't the unit for human identity anonymisation even more coarse? If it is not should it? In other words, do Junio C Hamano ju...@pobox.com and Junio C Hamano gits...@pobox.com map to one gibberish human readable name with two gibberish e-mail addresses, or 2 User$n user$n? Is the fact that this organization seems to allocate two e-mails to each developer something this organization may want to hide from the public (and something we as the Git debuggers would not benefit from knowing)? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] teach fast-export an --anonymize option
Ramsay Jones ram...@ramsay1.demon.co.uk writes: Dumb question (I have not even read the patch, so please just ignore me if this is indeed dumb!): Is the map of original-name, anonymized-name available to the user while he attempts to confirm that the bug is still present? For example, if I anonymized git.git, and did 'git branch -v' (say), how easy would it be for me to recognise which branch was 'next'? It is not dumb but actually is a very good point. There needs an easy way for the reporting user to turn an observation such as When I do 'git log master..next' I see this one extraneous commit shown into a corresponding statement to accompany the anonymised output. The user needs it to make sure that the symptom reproduces in the anonymised repository in order to decide if it is even worthwhile to send the output for analysis in the first place. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] teach fast-export an --anonymize option
On Thu, Aug 28, 2014 at 05:46:15PM +0100, Ramsay Jones wrote: Dumb question (I have not even read the patch, so please just ignore me if this is indeed dumb!): Is the map of original-name, anonymized-name available to the user while he attempts to confirm that the bug is still present? No, it's not. For example, if I anonymized git.git, and did 'git branch -v' (say), how easy would it be for me to recognise which branch was 'next'? You can't, really. The simplest thing would be to pare down your repository to the minimum number of branches before anonymizing. It might make sense to have an option to dump the maps we've stored to a separate file (in theory, you could even load them back in and do an incremental anonymized export[1]). I think I'd rather wait on implementing that until we see more real-world use cases (but as always, I'm happy to review if somebody wants to pick it up). -Peff [1] Incremental anonymization is not something I think is worth supporting by itself. However, there may be some value in being able to anonymize two similar repositories using the same mappings. For instance, a repository and its clone. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] teach fast-export an --anonymize option
On Thu, Aug 28, 2014 at 11:11:47AM -0700, Junio C Hamano wrote: + Anonymize the contents of the repository while still retaining + the shape of the history and stored tree. See the section on + `ANONYMIZING` below. Technically s/tree/trees/, I would think. For a repository with multiple branches, perhaps s/history/histories/, too, but I would not insist on that ;-). Sure, I think both of those are fine (I meant tree here to refer to the general notion of a set of paths over time, not a particular tree object). +With this option, git will replace all refnames, paths, blob contents, +commit and tag messages, names, and email addresses in the output with +anonymized data. Two instances of the same string will be replaced +equivalently (e.g., two commits with the same author will have the same +anonymized author in the output, but bear no resemblance to the original +author string). The relationship between commits, branches, and tags is +retained, as well as the commit timestamps (but the commit messages and +refnames bear no resemblance to the originals). The relative makeup of +the tree is retained (e.g., if you have a root tree with 10 files and 3 +trees, so will the output), but their names and the contents of the +files will be replaced. While I do not think I or anybody who would ask other people to use this option would be confused, the phrase the same string may risk unnecessary worries from those who are asked to trust this option. I am not yet convinced that it is unlikely for the reader to read the above and imagine that the anonymiser may go word by word, replacing the same string with the same anonymised gibberish (which would be susceptible to old-school cryptoanalysis techniques). I tried to use phrases like bears no resemblance to indicate that the mapping was not leaking information. Does it bear a separate paragraph explaining the transformation (I was trying to avoid that because it is necessarily intimately linked with the particular implementation chosen). Among the ones that listed, refnames, blob contents, commit messages and tag messages are converted as a single string and I wish I could think of phrasing to stress that point somehow. Maybe a separate paragraph like: Note that the replacement strings are chosen with no input from the original strings. There is no cryptography or other tricks involved, but rather we make up a new string like message 123, replace a particular commit message with it, and then use the mapping between the two for the rest of the output. Thus, no information about the original commit message is leaked, and only the internal mapping (which is not part of the output stream) could reverse the transformation. Each path component in paths is converted as a single string, so we can read from two anonymised paths if they refer to blobs in the same directory in the original. This is a good thing, of course, but it shows that among those listed in refnames, paths, blob contents, ... in a flat sentence, some are treated as a single token for replacement but not others, and it is hard to tell for a reader which one is which, unless the reader knows the internals of Git, i.e. what kind of things we as the debuggers-of-Git would want to preserve. Yes, I was really trying not to get into those details, because I do not think they matter to most callers and are subject to change as we come up with better heuristics. I do not even want to promise an implementation like no tricky cryptography above, because we may think of a more interesting way to transform components. Isn't the unit for human identity anonymisation even more coarse? If it is not should it? In other words, do Junio C Hamano ju...@pobox.com and Junio C Hamano gits...@pobox.com map to one gibberish human readable name with two gibberish e-mail addresses, or 2 User$n user$n? Is the fact that this organization seems to allocate two e-mails to each developer something this organization may want to hide from the public (and something we as the Git debuggers would not benefit from knowing)? The ident mapping takes a single Name email string and converts it into a User X us...@example.com string. So no, we are not leaking the fact that one name has multiple emails. I actually started down that path, but gave it up, as it could produce entries like User 3 ema...@example.com which were downright confusing. Plus I did not think that would be a useful thing for debuggers to know, and replacing the whole string is simpler (I also entertained the idea of just blanking _all_ idents; what I expect to be of primary use here is the history shape, and I doubt that a bug would be triggered by the pattern of usernames but not their actual content). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at