Re: [PATCH v3] teach fast-export an --anonymize option

2014-08-28 Thread Duy Nguyen
On Thu, Aug 28, 2014 at 12:01 AM, Jeff King p...@peff.net wrote:
 You can get an overview of what will be shared
 by running a command like:

   git fast-export --anonymize --all |
   perl -pe 's/\d+/X/g' |
   sort -u |
   less

 which will show every unique line we generate, modulo any
 numbers (each anonymized token is assigned a number, like
 User 0, and we replace it consistently in the output).

I feel like this should be part of git-fast-export.txt, just to
increase the user's confidence in the tool (and I don't expect most
users to read this commit message).
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] teach fast-export an --anonymize option

2014-08-28 Thread Jeff King
On Thu, Aug 28, 2014 at 05:30:44PM +0700, Duy Nguyen wrote:

 On Thu, Aug 28, 2014 at 12:01 AM, Jeff King p...@peff.net wrote:
  You can get an overview of what will be shared
  by running a command like:
 
git fast-export --anonymize --all |
perl -pe 's/\d+/X/g' |
sort -u |
less
 
  which will show every unique line we generate, modulo any
  numbers (each anonymized token is assigned a number, like
  User 0, and we replace it consistently in the output).
 
 I feel like this should be part of git-fast-export.txt, just to
 increase the user's confidence in the tool (and I don't expect most
 users to read this commit message).

Hmph. Whenever I say I think this patch is done, suddenly the comments
start pouring in. :)

I think you are right, though, and we could stand to explain
the feature a little more in the documentation in general.
How about this patch on top (or squashed in):

-- 8 --
Subject: docs/fast-export: explain --anonymize more completely

The original commit made mention of this option, but not why
one might want it or how they might use it. Let's try to be
a little more thorough, and also explain how to confirm that
the output really is anonymous.

Signed-off-by: Jeff King p...@peff.net
---
 Documentation/git-fast-export.txt | 63 ---
 1 file changed, 59 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-fast-export.txt 
b/Documentation/git-fast-export.txt
index 52831fa..dbe9a46 100644
--- a/Documentation/git-fast-export.txt
+++ b/Documentation/git-fast-export.txt
@@ -106,10 +106,9 @@ marks the same across runs.
different from the commit's first parent).
 
 --anonymize::
-   Replace all refnames, paths, blob contents, commit and tag
-   messages, names, and email addresses in the output with
-   anonymized data, while still retaining the shape of history and
-   of the stored tree.
+   Anonymize the contents of the repository while still retaining
+   the shape of the history and stored tree.  See the section on
+   `ANONYMIZING` below.
 
 --refspec::
Apply the specified refspec to each ref exported. Multiple of them can
@@ -147,6 +146,62 @@ referenced by that revision range contains the string
 'refs/heads/master'.
 
 
+ANONYMIZING
+---
+
+If the `--anonymize` option is given, git will attempt to remove all
+identifying information from the repository while still retaining enough
+of the original tree and history patterns to reproduce some bugs. The
+goal is that a git bug which is found on a private repository will
+persist in the anonymized repository, and the latter can be shared with
+git developers to help solve the bug.
+
+With this option, git will replace all refnames, paths, blob contents,
+commit and tag messages, names, and email addresses in the output with
+anonymized data.  Two instances of the same string will be replaced
+equivalently (e.g., two commits with the same author will have the same
+anonymized author in the output, but bear no resemblance to the original
+author string). The relationship between commits, branches, and tags is
+retained, as well as the commit timestamps (but the commit messages and
+refnames bear no resemblance to the originals). The relative makeup of
+the tree is retained (e.g., if you have a root tree with 10 files and 3
+trees, so will the output), but their names and the contents of the
+files will be replaced.
+
+If you think you have found a git bug, you can start by exporting an
+anonymized stream of the whole repository:
+
+---
+$ git fast-export --anonymize --all anon-stream
+---
+
+Then confirm that the bug persists in a repository created from that
+stream (many bugs will not, as they really do depend on the exact
+repository contents):
+
+---
+$ git init anon-repo
+$ cd anon-repo
+$ git fast-import ../anon-stream
+$ ... test your bug ...
+---
+
+If the anonymized repository shows the bug, it may be worth sharing
+`anon-stream` along with a regular bug report. Note that the anonymized
+stream compresses very well, so gzipping it is encouraged. If you want
+to examine the stream to see that it does not contain any private data,
+you can peruse it directly before sending. You may also want to try:
+
+---
+$ perl -pe 's/\d+/X/g' anon-stream | sort -u | less
+---
+
+which shows all of the unique lines (with numbers converted to X, to
+collapse User 0, User 1, etc into User X). This produces a much
+smaller output, and it is usually easy to quickly confirm that there is
+no private data in the stream.
+
+
 Limitations
 ---
 
-- 
2.1.0.346.ga0367b9

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to 

Re: [PATCH v3] teach fast-export an --anonymize option

2014-08-28 Thread Ramsay Jones
On 28/08/14 13:32, Jeff King wrote:
 On Thu, Aug 28, 2014 at 05:30:44PM +0700, Duy Nguyen wrote:
 
 On Thu, Aug 28, 2014 at 12:01 AM, Jeff King p...@peff.net wrote:
 You can get an overview of what will be shared
 by running a command like:

   git fast-export --anonymize --all |
   perl -pe 's/\d+/X/g' |
   sort -u |
   less

 which will show every unique line we generate, modulo any
 numbers (each anonymized token is assigned a number, like
 User 0, and we replace it consistently in the output).

 I feel like this should be part of git-fast-export.txt, just to
 increase the user's confidence in the tool (and I don't expect most
 users to read this commit message).
 
 Hmph. Whenever I say I think this patch is done, suddenly the comments
 start pouring in. :)

:-D

 I think you are right, though, and we could stand to explain
 the feature a little more in the documentation in general.
 How about this patch on top (or squashed in):
 
 -- 8 --
 Subject: docs/fast-export: explain --anonymize more completely
 
 The original commit made mention of this option, but not why
 one might want it or how they might use it. Let's try to be
 a little more thorough, and also explain how to confirm that
 the output really is anonymous.
 
 Signed-off-by: Jeff King p...@peff.net
 ---
  Documentation/git-fast-export.txt | 63 
 ---
  1 file changed, 59 insertions(+), 4 deletions(-)
 
 diff --git a/Documentation/git-fast-export.txt 
 b/Documentation/git-fast-export.txt
 index 52831fa..dbe9a46 100644
 --- a/Documentation/git-fast-export.txt
 +++ b/Documentation/git-fast-export.txt
 @@ -106,10 +106,9 @@ marks the same across runs.
   different from the commit's first parent).
  
  --anonymize::
 - Replace all refnames, paths, blob contents, commit and tag
 - messages, names, and email addresses in the output with
 - anonymized data, while still retaining the shape of history and
 - of the stored tree.
 + Anonymize the contents of the repository while still retaining
 + the shape of the history and stored tree.  See the section on
 + `ANONYMIZING` below.
  
  --refspec::
   Apply the specified refspec to each ref exported. Multiple of them can
 @@ -147,6 +146,62 @@ referenced by that revision range contains the string
  'refs/heads/master'.
  
  
 +ANONYMIZING
 +---
 +
 +If the `--anonymize` option is given, git will attempt to remove all
 +identifying information from the repository while still retaining enough
 +of the original tree and history patterns to reproduce some bugs. The
 +goal is that a git bug which is found on a private repository will

s/goal/hope/ ;-)

 +persist in the anonymized repository, and the latter can be shared with
 +git developers to help solve the bug.
 +
 +With this option, git will replace all refnames, paths, blob contents,
 +commit and tag messages, names, and email addresses in the output with
 +anonymized data.  Two instances of the same string will be replaced
 +equivalently (e.g., two commits with the same author will have the same
 +anonymized author in the output, but bear no resemblance to the original
 +author string). The relationship between commits, branches, and tags is
 +retained, as well as the commit timestamps (but the commit messages and
 +refnames bear no resemblance to the originals). The relative makeup of
 +the tree is retained (e.g., if you have a root tree with 10 files and 3
 +trees, so will the output), but their names and the contents of the
 +files will be replaced.
 +
 +If you think you have found a git bug, you can start by exporting an
 +anonymized stream of the whole repository:
 +
 +---
 +$ git fast-export --anonymize --all anon-stream
 +---
 +
 +Then confirm that the bug persists in a repository created from that
 +stream (many bugs will not, as they really do depend on the exact
 +repository contents):

Dumb question (I have not even read the patch, so please just ignore me
if this is indeed dumb!): Is the map of original-name, anonymized-name
available to the user while he attempts to confirm that the bug is still
present?

For example, if I anonymized git.git, and did 'git branch -v' (say), how
easy would it be for me to recognise which branch was 'next'?

ATB,
Ramsay Jones



--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] teach fast-export an --anonymize option

2014-08-28 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 Subject: docs/fast-export: explain --anonymize more completely

 The original commit made mention of this option, but not why
 one might want it or how they might use it. Let's try to be
 a little more thorough, and also explain how to confirm that
 the output really is anonymous.

 Signed-off-by: Jeff King p...@peff.net
 ---
  Documentation/git-fast-export.txt | 63 
 ---
  1 file changed, 59 insertions(+), 4 deletions(-)

 diff --git a/Documentation/git-fast-export.txt 
 b/Documentation/git-fast-export.txt
 index 52831fa..dbe9a46 100644
 --- a/Documentation/git-fast-export.txt
 +++ b/Documentation/git-fast-export.txt
 @@ -106,10 +106,9 @@ marks the same across runs.
   different from the commit's first parent).
  
  --anonymize::
 - Replace all refnames, paths, blob contents, commit and tag
 - messages, names, and email addresses in the output with
 - anonymized data, while still retaining the shape of history and
 - of the stored tree.
 + Anonymize the contents of the repository while still retaining
 + the shape of the history and stored tree.  See the section on
 + `ANONYMIZING` below.

Technically s/tree/trees/, I would think.  For a repository with
multiple branches, perhaps s/history/histories/, too, but I would
not insist on that ;-).

 +ANONYMIZING
 +---
 +
 +If the `--anonymize` option is given, git will attempt to remove all
 +identifying information from the repository while still retaining enough
 +of the original tree and history patterns to reproduce some bugs. The
 +goal is that a git bug which is found on a private repository will
 +persist in the anonymized repository, and the latter can be shared with
 +git developers to help solve the bug.
 +
 +With this option, git will replace all refnames, paths, blob contents,
 +commit and tag messages, names, and email addresses in the output with
 +anonymized data.  Two instances of the same string will be replaced
 +equivalently (e.g., two commits with the same author will have the same
 +anonymized author in the output, but bear no resemblance to the original
 +author string). The relationship between commits, branches, and tags is
 +retained, as well as the commit timestamps (but the commit messages and
 +refnames bear no resemblance to the originals). The relative makeup of
 +the tree is retained (e.g., if you have a root tree with 10 files and 3
 +trees, so will the output), but their names and the contents of the
 +files will be replaced.

While I do not think I or anybody who would ask other people to use
this option would be confused, the phrase the same string may risk
unnecessary worries from those who are asked to trust this option.

I am not yet convinced that it is unlikely for the reader to read
the above and imagine that the anonymiser may go word by word,
replacing the same string with the same anonymised gibberish
(which would be susceptible to old-school cryptoanalysis
techniques).

Among the ones that listed, refnames, blob contents, commit messages
and tag messages are converted as a single string and I wish I
could think of phrasing to stress that point somehow.

Each path component in paths is converted as a single string, so
we can read from two anonymised paths if they refer to blobs in the
same directory in the original.  This is a good thing, of course,
but it shows that among those listed in refnames, paths, blob
contents, ... in a flat sentence, some are treated as a single
token for replacement but not others, and it is hard to tell for a
reader which one is which, unless the reader knows the internals of
Git, i.e. what kind of things we as the debuggers-of-Git would want
to preserve.

Isn't the unit for human identity anonymisation even more coarse?
If it is not should it?

In other words, do Junio C Hamano ju...@pobox.com and Junio C
Hamano gits...@pobox.com map to one gibberish human readable name
with two gibberish e-mail addresses, or 2 User$n user$n?  Is the
fact that this organization seems to allocate two e-mails to each
developer something this organization may want to hide from the
public (and something we as the Git debuggers would not benefit from
knowing)?


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] teach fast-export an --anonymize option

2014-08-28 Thread Junio C Hamano
Ramsay Jones ram...@ramsay1.demon.co.uk writes:

 Dumb question (I have not even read the patch, so please just ignore me
 if this is indeed dumb!): Is the map of original-name, anonymized-name
 available to the user while he attempts to confirm that the bug is still
 present?

 For example, if I anonymized git.git, and did 'git branch -v' (say), how
 easy would it be for me to recognise which branch was 'next'?

It is not dumb but actually is a very good point.

There needs an easy way for the reporting user to turn an
observation such as When I do 'git log master..next' I see this one
extraneous commit shown into a corresponding statement to accompany
the anonymised output.  The user needs it to make sure that the
symptom reproduces in the anonymised repository in order to decide
if it is even worthwhile to send the output for analysis in the
first place.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] teach fast-export an --anonymize option

2014-08-28 Thread Jeff King
On Thu, Aug 28, 2014 at 05:46:15PM +0100, Ramsay Jones wrote:

 Dumb question (I have not even read the patch, so please just ignore me
 if this is indeed dumb!): Is the map of original-name, anonymized-name
 available to the user while he attempts to confirm that the bug is still
 present?

No, it's not.

 For example, if I anonymized git.git, and did 'git branch -v' (say), how
 easy would it be for me to recognise which branch was 'next'?

You can't, really. The simplest thing would be to pare down your
repository to the minimum number of branches before anonymizing.

It might make sense to have an option to dump the maps we've stored to a
separate file (in theory, you could even load them back in and do an
incremental anonymized export[1]). I think I'd rather wait on
implementing that until we see more real-world use cases (but as always,
I'm happy to review if somebody wants to pick it up).

-Peff

[1] Incremental anonymization is not something I think is worth
supporting by itself. However, there may be some value in being able
to anonymize two similar repositories using the same mappings. For
instance, a repository and its clone.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] teach fast-export an --anonymize option

2014-08-28 Thread Jeff King
On Thu, Aug 28, 2014 at 11:11:47AM -0700, Junio C Hamano wrote:

  +   Anonymize the contents of the repository while still retaining
  +   the shape of the history and stored tree.  See the section on
  +   `ANONYMIZING` below.
 
 Technically s/tree/trees/, I would think.  For a repository with
 multiple branches, perhaps s/history/histories/, too, but I would
 not insist on that ;-).

Sure, I think both of those are fine (I meant tree here to refer to
the general notion of a set of paths over time, not a particular tree
object).

  +With this option, git will replace all refnames, paths, blob contents,
  +commit and tag messages, names, and email addresses in the output with
  +anonymized data.  Two instances of the same string will be replaced
  +equivalently (e.g., two commits with the same author will have the same
  +anonymized author in the output, but bear no resemblance to the original
  +author string). The relationship between commits, branches, and tags is
  +retained, as well as the commit timestamps (but the commit messages and
  +refnames bear no resemblance to the originals). The relative makeup of
  +the tree is retained (e.g., if you have a root tree with 10 files and 3
  +trees, so will the output), but their names and the contents of the
  +files will be replaced.
 
 While I do not think I or anybody who would ask other people to use
 this option would be confused, the phrase the same string may risk
 unnecessary worries from those who are asked to trust this option.
 
 I am not yet convinced that it is unlikely for the reader to read
 the above and imagine that the anonymiser may go word by word,
 replacing the same string with the same anonymised gibberish
 (which would be susceptible to old-school cryptoanalysis
 techniques).

I tried to use phrases like bears no resemblance to indicate that the
mapping was not leaking information. Does it bear a separate paragraph
explaining the transformation (I was trying to avoid that because it is
necessarily intimately linked with the particular implementation
chosen).

 Among the ones that listed, refnames, blob contents, commit messages
 and tag messages are converted as a single string and I wish I
 could think of phrasing to stress that point somehow.

Maybe a separate paragraph like:

  Note that the replacement strings are chosen with no input from the
  original strings. There is no cryptography or other tricks involved,
  but rather we make up a new string like message 123, replace a
  particular commit message with it, and then use the mapping between
  the two for the rest of the output. Thus, no information about the
  original commit message is leaked, and only the internal mapping
  (which is not part of the output stream) could reverse the
  transformation.

 Each path component in paths is converted as a single string, so
 we can read from two anonymised paths if they refer to blobs in the
 same directory in the original.  This is a good thing, of course,
 but it shows that among those listed in refnames, paths, blob
 contents, ... in a flat sentence, some are treated as a single
 token for replacement but not others, and it is hard to tell for a
 reader which one is which, unless the reader knows the internals of
 Git, i.e. what kind of things we as the debuggers-of-Git would want
 to preserve.

Yes, I was really trying not to get into those details, because I do not
think they matter to most callers and are subject to change as we come
up with better heuristics. I do not even want to promise an
implementation like no tricky cryptography above, because we may think
of a more interesting way to transform components.

 Isn't the unit for human identity anonymisation even more coarse?
 If it is not should it?
 
 In other words, do Junio C Hamano ju...@pobox.com and Junio C
 Hamano gits...@pobox.com map to one gibberish human readable name
 with two gibberish e-mail addresses, or 2 User$n user$n?  Is the
 fact that this organization seems to allocate two e-mails to each
 developer something this organization may want to hide from the
 public (and something we as the Git debuggers would not benefit from
 knowing)?

The ident mapping takes a single Name email string and converts it
into a User X us...@example.com string. So no, we are not leaking
the fact that one name has multiple emails. I actually started down that
path, but gave it up, as it could produce entries like User 3
ema...@example.com which were downright confusing. Plus I did not
think that would be a useful thing for debuggers to know, and replacing
the whole string is simpler (I also entertained the idea of just
blanking _all_ idents; what I expect to be of primary use here is the
history shape, and I doubt that a bug would be triggered by the pattern
of usernames but not their actual content).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at