[PATCH v3 3/6] merge-recursive: align labels with their respective code blocks

2018-06-09 Thread Elijah Newren
Signed-off-by: Elijah Newren 
---
 merge-recursive.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 09c249b93e..5538038178 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -998,10 +998,10 @@ static int update_file_flags(struct merge_options *o,
ret = err(o,
  _("do not know what to do with %06o %s '%s'"),
  mode, oid_to_hex(oid), path);
- free_buf:
+   free_buf:
free(buf);
}
- update_index:
+update_index:
if (!ret && update_cache)
if (add_cacheinfo(o, mode, oid, path, 0, update_wd,
  ADD_CACHE_OK_TO_ADD))
@@ -3326,7 +3326,7 @@ int merge_trees(struct merge_options *o,
entries->items[i].string);
}
 
-cleanup:
+   cleanup:
final_cleanup_renames(_info);
 
string_list_clear(entries, 1);
-- 
2.18.0.rc1.6.gffeb3ef585



[PATCH v3 6/6] merge-recursive: add pointer about unduly complex looking code

2018-06-09 Thread Elijah Newren
handle_change_delete() has a block of code displaying one of four nearly
identical messages.  Each contains about half a dozen variable
interpolations, which use nearly identical variables as well.  Someone
trying to parse this may be slowed down trying to parse the differences
and why they are here; help them out by adding a comment explaining the
differences.

Further, point out that this code structure isn't collapsed into something
more concise and readable for the programmer, because we want to keep full
messages intact in order to make translators' jobs much easier.

Signed-off-by: Elijah Newren 
---
 merge-recursive.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/merge-recursive.c b/merge-recursive.c
index 7cf11dc04c..284b27d895 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1476,6 +1476,21 @@ static int handle_change_delete(struct merge_options *o,
if (!ret)
ret = update_file(o, 0, o_oid, o_mode, update_path);
} else {
+   /*
+* Despite the four nearly duplicate messages and argument
+* lists below and the ugliness of the nested if-statements,
+* having complete messages makes the job easier for
+* translators.
+*
+* The slight variance among the cases is due to the fact
+* that:
+*   1) directory/file conflicts (in effect if
+*  !alt_path) could cause us to need to write the
+*  file to a different path.
+*   2) renames (in effect if !old_path) could mean that
+*  there are two names for the path that the user
+*  may know the file by.
+*/
if (!alt_path) {
if (!old_path) {
output(o, 1, _("CONFLICT (%s/delete): %s 
deleted in %s "
-- 
2.18.0.rc1.6.gffeb3ef585



[PATCH v3 0/6] merge-recursive code cleanups

2018-06-09 Thread Elijah Newren
This patch series contains several small code cleanups for
merge-recursive.

Changes since v2:
  * Now built on master (the topics this depended on have graduated);
merges cleanly with next and pu.
  * Patch #2 has a few additional argument alignment fixes; some in
code added by topics that have since graduated, and a couple that
were overlooked previously.
  * Patch #3 is entirely new.
  * Re-wrapped the commit message of the final patch.
  (range-diff below)

Elijah Newren (6):
  merge-recursive: fix miscellaneous grammar error in comment
  merge-recursive: fix numerous argument alignment issues
  merge-recursive: align labels with their respective code blocks
  merge-recursive: clarify the rename_dir/RENAME_DIR meaning
  merge-recursive: rename conflict_rename_*() family of functions
  merge-recursive: add pointer about unduly complex looking code

 merge-recursive.c | 186 ++
 1 file changed, 104 insertions(+), 82 deletions(-)

range-diff against v2:

 1: 4222f174de =  1: 4222f174de merge-recursive: fix miscellaneous grammar 
error in comment
 2: 4ebf93822d !  2: 284a5fee3d merge-recursive: fix numerous argument 
alignment issues
@@ -33,6 +33,76 @@
  {
struct path_hashmap_entry *entry;
int baselen = base->len;
+@@
+*/
+   if (would_lose_untracked(path))
+   return err(o, _("refusing to lose untracked file at '%s'"),
+-   path);
++ path);
+ 
+   /* Successful unlink is good.. */
+   if (!unlink(path))
+@@
+   unlink(path);
+   if (symlink(lnk, path))
+   ret = err(o, _("failed to symlink '%s': %s"),
+-  path, strerror(errno));
++path, strerror(errno));
+   free(lnk);
+   } else
+   ret = err(o,
+@@
+ }
+ 
+ static int find_first_merges(struct object_array *result, const char 
*path,
+-  struct commit *a, struct commit *b)
++   struct commit *a, struct commit *b)
+ {
+   int i, j;
+   struct object_array merges = OBJECT_ARRAY_INIT;
+@@
+ 
+   /* get all revisions that merge commit a */
+   xsnprintf(merged_revision, sizeof(merged_revision), "^%s",
+-  oid_to_hex(>object.oid));
++oid_to_hex(>object.oid));
+   init_revisions(, NULL);
+   rev_opts.submodule = path;
+   /* FIXME: can't handle linked worktrees in submodules yet */
+@@
+   output(o, 2, _("Found a possible merge resolution for the 
submodule:\n"));
+   print_commit((struct commit *) merges.objects[0].item);
+   output(o, 2, _(
+-  "If this is correct simply add it to the index "
+-  "for example\n"
+-  "by using:\n\n"
+-  "  git update-index --cacheinfo 16 %s \"%s\"\n\n"
+-  "which will accept this suggestion.\n"),
+-  oid_to_hex([0].item->oid), path);
++ "If this is correct simply add it to the index "
++ "for example\n"
++ "by using:\n\n"
++ "  git update-index --cacheinfo 16 %s \"%s\"\n\n"
++ "which will accept this suggestion.\n"),
++ oid_to_hex([0].item->oid), path);
+   break;
+ 
+   default:
+@@
+   result->clean = (merge_status == 0);
+   } else if (S_ISGITLINK(a->mode)) {
+   result->clean = merge_submodule(o, >oid,
+- one->path,
+- >oid,
+- >oid,
+- >oid);
++  one->path,
++  >oid,
++  >oid,
++  >oid);
+   } else if (S_ISLNK(a->mode)) {
+   switch (o->recursive_variant) {
+   case MERGE_RECURSIVE_NORMAL:
 @@
  }
  
--: -- >  3: 6bae2a267f merge-recursive: align labels with their 
respective code blocks
 3: 585759f07a =  4: aecf1267d8 merge-recursive: clarify the 
rename_dir/RENAME_DIR meaning
 4: 3cfb8b01b8 =  5: f7637bef12 merge-recursive: rename conflict_rename_*() 
family of functions
 5: d2a24f5b38 !  6: ffeb3ef585 merge-recursive: add pointer about unduly 
complex looking code
@@ -3,10 +3,11 @@
 merge-recursive: add pointer about unduly complex looking code
 
 handle_change_delete() 

[PATCH v3 4/6] merge-recursive: clarify the rename_dir/RENAME_DIR meaning

2018-06-09 Thread Elijah Newren
We had an enum of rename types which included RENAME_DIR; this name felt
misleading since it was not about an entire directory but was a status for
each individual file add that occurred within a renamed directory.

Since this type is for signifying that the files in question were being
renamed due to directory rename detection, rename this enum value to
RENAME_VIA_DIR.

Make a similar change to the conflict_rename_dir() function, and add a
comment to the top of that function explaining its purpose (it may not be
quite as obvious as for the other conflict_rename_*() functions).

Acked-by: Johannes Schindelin 
Signed-off-by: Elijah Newren 
---
 merge-recursive.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 5538038178..910f0b70f0 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -181,7 +181,7 @@ static int oid_eq(const struct object_id *a, const struct 
object_id *b)
 
 enum rename_type {
RENAME_NORMAL = 0,
-   RENAME_DIR,
+   RENAME_VIA_DIR,
RENAME_DELETE,
RENAME_ONE_FILE_TO_ONE,
RENAME_ONE_FILE_TO_TWO,
@@ -1410,11 +1410,17 @@ static int merge_file_one(struct merge_options *o,
return merge_file_1(o, , , , path, branch1, branch2, mfi);
 }
 
-static int conflict_rename_dir(struct merge_options *o,
-  struct diff_filepair *pair,
-  const char *rename_branch,
-  const char *other_branch)
+static int conflict_rename_via_dir(struct merge_options *o,
+  struct diff_filepair *pair,
+  const char *rename_branch,
+  const char *other_branch)
 {
+   /*
+* Handle file adds that need to be renamed due to directory rename
+* detection.  This differs from handle_rename_normal, because
+* there is no content merge to do; just move the file into the
+* desired final location.
+*/
const struct diff_filespec *dest = pair->two;
 
if (!o->call_depth && would_lose_untracked(dest->path)) {
@@ -2692,7 +2698,7 @@ static int process_renames(struct merge_options *o,
 
if (oid_eq(_other.oid, _oid) &&
ren1->add_turned_into_rename) {
-   setup_rename_conflict_info(RENAME_DIR,
+   setup_rename_conflict_info(RENAME_VIA_DIR,
   ren1->pair,
   NULL,
   branch1,
@@ -3138,12 +3144,12 @@ static int process_entry(struct merge_options *o,
 b_oid, b_mode,
 conflict_info);
break;
-   case RENAME_DIR:
+   case RENAME_VIA_DIR:
clean_merge = 1;
-   if (conflict_rename_dir(o,
-   conflict_info->pair1,
-   conflict_info->branch1,
-   conflict_info->branch2))
+   if (conflict_rename_via_dir(o,
+   conflict_info->pair1,
+   conflict_info->branch1,
+   conflict_info->branch2))
clean_merge = -1;
break;
case RENAME_DELETE:
-- 
2.18.0.rc1.6.gffeb3ef585



[PATCH v3 1/6] merge-recursive: fix miscellaneous grammar error in comment

2018-06-09 Thread Elijah Newren
Signed-off-by: Elijah Newren 
---
 merge-recursive.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index ac27abbd4c..921f8e2d2d 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -539,7 +539,7 @@ static void record_df_conflict_files(struct merge_options 
*o,
 struct string_list *entries)
 {
/* If there is a D/F conflict and the file for such a conflict
-* currently exist in the working tree, we want to allow it to be
+* currently exists in the working tree, we want to allow it to be
 * removed to make room for the corresponding directory if needed.
 * The files underneath the directories of such D/F conflicts will
 * be processed before the corresponding file involved in the D/F
-- 
2.18.0.rc1.6.gffeb3ef585



[PATCH v3 5/6] merge-recursive: rename conflict_rename_*() family of functions

2018-06-09 Thread Elijah Newren
These functions were added because processing of these conflicts needed
to be deferred until process_entry() in order to get D/F conflicts and
such right.  The number of these has grown over time, and now include
some whose name is misleading:
  * conflict_rename_normal() is for handling normal file renames; a
typical rename may need content merging, but we expect conflicts
from that to be more the exception than the rule.
  * conflict_rename_via_dir() will not be a conflict; it was just an
add that turned into a move due to directory rename detection.
(If there was a file in the way of the move, that would have been
detected and reported earlier.)
  * conflict_rename_rename_2to1 and conflict_rename_add (the latter
of which doesn't exist yet but has been submitted before and I
intend to resend) technically might not be conflicts if the
colliding paths happen to match exactly.
Rename this family of functions to handle_rename_*().

Also rename handle_renames() to detect_and_process_renames() both to make
it clearer what it does, and to differentiate it as a pre-processing step
from all the handle_rename_*() functions which are called from
process_entry().

Acked-by: Johannes Schindelin 
Signed-off-by: Elijah Newren 
---
 merge-recursive.c | 86 +++
 1 file changed, 43 insertions(+), 43 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 910f0b70f0..7cf11dc04c 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -1410,10 +1410,10 @@ static int merge_file_one(struct merge_options *o,
return merge_file_1(o, , , , path, branch1, branch2, mfi);
 }
 
-static int conflict_rename_via_dir(struct merge_options *o,
-  struct diff_filepair *pair,
-  const char *rename_branch,
-  const char *other_branch)
+static int handle_rename_via_dir(struct merge_options *o,
+struct diff_filepair *pair,
+const char *rename_branch,
+const char *other_branch)
 {
/*
 * Handle file adds that need to be renamed due to directory rename
@@ -1515,10 +1515,10 @@ static int handle_change_delete(struct merge_options *o,
return ret;
 }
 
-static int conflict_rename_delete(struct merge_options *o,
- struct diff_filepair *pair,
- const char *rename_branch,
- const char *delete_branch)
+static int handle_rename_delete(struct merge_options *o,
+   struct diff_filepair *pair,
+   const char *rename_branch,
+   const char *delete_branch)
 {
const struct diff_filespec *orig = pair->one;
const struct diff_filespec *dest = pair->two;
@@ -1620,8 +1620,8 @@ static int handle_file(struct merge_options *o,
return ret;
 }
 
-static int conflict_rename_rename_1to2(struct merge_options *o,
-  struct rename_conflict_info *ci)
+static int handle_rename_rename_1to2(struct merge_options *o,
+struct rename_conflict_info *ci)
 {
/* One file was renamed in both branches, but to different names. */
struct diff_filespec *one = ci->pair1->one;
@@ -1682,8 +1682,8 @@ static int conflict_rename_rename_1to2(struct 
merge_options *o,
return 0;
 }
 
-static int conflict_rename_rename_2to1(struct merge_options *o,
-  struct rename_conflict_info *ci)
+static int handle_rename_rename_2to1(struct merge_options *o,
+struct rename_conflict_info *ci)
 {
/* Two files, a & b, were renamed to the same thing, c. */
struct diff_filespec *a = ci->pair1->one;
@@ -2425,7 +2425,7 @@ static void apply_directory_rename_modifications(struct 
merge_options *o,
 * "NOTE" in update_stages(), doing so will modify the current
 * in-memory index which will break calls to would_lose_untracked()
 * that we need to make.  Instead, we need to just make sure that
-* the various conflict_rename_*() functions update the index
+* the various handle_rename_*() functions update the index
 * explicitly rather than relying on unpack_trees() to have done it.
 */
get_tree_entry(>object.oid,
@@ -2829,12 +2829,12 @@ static void initial_cleanup_rename(struct 
diff_queue_struct *pairs,
free(pairs);
 }
 
-static int handle_renames(struct merge_options *o,
- struct tree *common,
- struct tree *head,
- struct tree *merge,
- struct string_list *entries,
- struct rename_info *ri)
+static int detect_and_process_renames(struct 

[PATCH v3 2/6] merge-recursive: fix numerous argument alignment issues

2018-06-09 Thread Elijah Newren
Various refactorings throughout the code have left lots of alignment
issues that were driving me crazy; fix them.

Acked-by: Johannes Schindelin 
Signed-off-by: Elijah Newren 
---
 merge-recursive.c | 75 ---
 1 file changed, 38 insertions(+), 37 deletions(-)

diff --git a/merge-recursive.c b/merge-recursive.c
index 921f8e2d2d..09c249b93e 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -309,8 +309,8 @@ static void output_commit_title(struct merge_options *o, 
struct commit *commit)
 }
 
 static int add_cacheinfo(struct merge_options *o,
-   unsigned int mode, const struct object_id *oid,
-   const char *path, int stage, int refresh, int options)
+unsigned int mode, const struct object_id *oid,
+const char *path, int stage, int refresh, int options)
 {
struct cache_entry *ce;
int ret;
@@ -417,8 +417,8 @@ struct tree *write_tree_from_memory(struct merge_options *o)
 }
 
 static int save_files_dirs(const struct object_id *oid,
-   struct strbuf *base, const char *path,
-   unsigned int mode, int stage, void *context)
+  struct strbuf *base, const char *path,
+  unsigned int mode, int stage, void *context)
 {
struct path_hashmap_entry *entry;
int baselen = base->len;
@@ -913,7 +913,7 @@ static int make_room_for_path(struct merge_options *o, 
const char *path)
 */
if (would_lose_untracked(path))
return err(o, _("refusing to lose untracked file at '%s'"),
-path);
+  path);
 
/* Successful unlink is good.. */
if (!unlink(path))
@@ -992,7 +992,7 @@ static int update_file_flags(struct merge_options *o,
unlink(path);
if (symlink(lnk, path))
ret = err(o, _("failed to symlink '%s': %s"),
-   path, strerror(errno));
+ path, strerror(errno));
free(lnk);
} else
ret = err(o,
@@ -1090,7 +1090,7 @@ static int merge_3way(struct merge_options *o,
 }
 
 static int find_first_merges(struct object_array *result, const char *path,
-   struct commit *a, struct commit *b)
+struct commit *a, struct commit *b)
 {
int i, j;
struct object_array merges = OBJECT_ARRAY_INIT;
@@ -1108,7 +1108,7 @@ static int find_first_merges(struct object_array *result, 
const char *path,
 
/* get all revisions that merge commit a */
xsnprintf(merged_revision, sizeof(merged_revision), "^%s",
-   oid_to_hex(>object.oid));
+ oid_to_hex(>object.oid));
init_revisions(, NULL);
rev_opts.submodule = path;
/* FIXME: can't handle linked worktrees in submodules yet */
@@ -1250,12 +1250,12 @@ static int merge_submodule(struct merge_options *o,
output(o, 2, _("Found a possible merge resolution for the 
submodule:\n"));
print_commit((struct commit *) merges.objects[0].item);
output(o, 2, _(
-   "If this is correct simply add it to the index "
-   "for example\n"
-   "by using:\n\n"
-   "  git update-index --cacheinfo 16 %s \"%s\"\n\n"
-   "which will accept this suggestion.\n"),
-   oid_to_hex([0].item->oid), path);
+  "If this is correct simply add it to the index "
+  "for example\n"
+  "by using:\n\n"
+  "  git update-index --cacheinfo 16 %s \"%s\"\n\n"
+  "which will accept this suggestion.\n"),
+  oid_to_hex([0].item->oid), path);
break;
 
default:
@@ -1332,10 +1332,10 @@ static int merge_file_1(struct merge_options *o,
result->clean = (merge_status == 0);
} else if (S_ISGITLINK(a->mode)) {
result->clean = merge_submodule(o, >oid,
-  one->path,
-  >oid,
-  >oid,
-  >oid);
+   one->path,
+   >oid,
+   >oid,
+   >oid);
} else if (S_ISLNK(a->mode)) {
switch (o->recursive_variant) {
case MERGE_RECURSIVE_NORMAL:
@@ -1443,13 +1443,13 @@ static int 

Re: [RFC PATCH 0/2] contrib/credential/netrc Makefile & test improvements

2018-06-09 Thread Todd Zullinger
Hi Luis,

Luis Marsano wrote:
> Thanks for looking into this and addressing these issues.

And thank you for digging further. :)

> On Thu, Jun 7, 2018 at 1:20 AM Todd Zullinger  wrote:
>> I noticed failures from the contrib/credential/netrc tests
>> while building 2.18.0 release candidates.  I was surprised
>> to see the tests being run when called with a simple 'make'
>> command.
>>
>> The first patch in the series adds an empty 'all::' make
>> target to match most of our other Makefiles and avoid the
>> surprise of running tests by default.  (When the netrc
>> helper was added to the fedora builds, it copied the same
>> 'make -C contrib/credential/...' pattern from other
>> credential helpers -- despite the lack of anything to
>> build.)
> 
> I think this is a good idea.
> 
>> The actual test failures were initially due to my build
>> environment lacking the perl autodie module, which was added
>> in 786ef50a23 ("git-credential-netrc: accept gpg option",
>> 2018-05-12).
> 
> I added 'use autodie;' without realizing it had external dependencies.
> According to the documentation
> http://perldoc.perl.org/autodie.html
> it's a pragma since perl 5.10.1
> Removing 'use autodie;' should be fine: it's not critical.

I should clarify that part of why autodie isn't in my build
environment is that the Fedora and RHEL7+ perl packages
split out many modules which are shipped as part of the core
perl tarball.  So while all the platforms I care about have
perl >= 5.10.1, the Fedora and newer RHEL systems have the
autodie module in a separate package.

That said, the INSTALL docs still suggest that we only
require perl >= 5.8, so if autodie is easily removed, that
would probably be a good plan.

Ævar brought up bumping the minimum supported perl to 5.10.0
last December, in <20171223174400.26668-1-ava...@gmail.com>
(https://public-inbox.org/git/20171223174400.26668-1-ava...@gmail.com/).
The general consensus seemed positive, but I don't think
it's happened.  Even so, that was 5.10.0, not the 5.10.1
which added autodie.

>> After installing the autodie module, the failures were due
>> to the build environment lacking a git install (specifically
>> the perl Git module).  The tests needing a pre-installed
>> perl Git seemed odd and worth fixing.
> 
> I mistakenly thought 'use lib (split(/:/, $ENV{GITPERLLIB}));' in
> test.pl handled this.
> Since it doesn't, and I was only following an example from
> t/t9700/test.pl that doesn't fit, this line should be removed and it
> might make more sense to set the environment from
> t-git-credential-netrc.sh near '. ./test-lib.sh', which also sets the
> environment.
> Something like
> 
> diff --git a/contrib/credential/netrc/t-git-credential-netrc.sh
> b/contrib/credential/netrc/t-git-credential-netrc.sh
> index 58191a6..9e18611 100755
> --- a/contrib/credential/netrc/t-git-credential-netrc.sh
> +++ b/contrib/credential/netrc/t-git-credential-netrc.sh
> @@ -23,5 +23,6 @@
> # The external test will outputs its own plan
> test_external_has_tap=1
> 
> +   export PERL5LIB="$GITPERLLIB"
> test_external \
>  'git-credential-netrc' \
> 
> Your solution, however, is reasonable, and I don't know which is preferred.

I think your placement is better.  As you say, it could also
be placed closer to '. ./test-lib.sh'.

It doesn't come up very often, but I wonder if there's any
downside to having test-lib.sh export PERL5LIB?

> It looks like you found an issue with t/t9700/test.pl, too.
> When altered to fail, it first reports ok (then reports failed and
> returns non-0).
> 
> not ok 46 - unquote simple quoted path
> not ok 47 - unquote escape sequences
> 1..47
> # test_external test Perl API was ok
> # test_external_without_stderr test no stderr: Perl API failed: perl
> /home/luism/project/git/t/t9700/test.pl:
> $ echo $?
> 1

Oops.  Nice catch.  At least that does exit non-zero I
guess.

> To make git-credential-netrc tests behave correctly, I ended up making
> the changes below.
> They might be okay, unless someone knows better.
> 
> diff --git a/contrib/credential/netrc/t-git-credential-netrc.sh
> b/contrib/credential/netrc/t-git-credential-netrc.sh
> index 58191a6..9e18611 100755
> --- a/contrib/credential/netrc/t-git-credential-netrc.sh
> +++ b/contrib/credential/netrc/t-git-credential-netrc.sh
> @@ -23,9 +23,10 @@
> # The external test will outputs its own plan
> test_external_has_tap=1
> 
> +   export PERL5LIB="$GITPERLLIB"
> test_external \
>  'git-credential-netrc' \
> -perl "$TEST_DIRECTORY"/../contrib/credential/netrc/test.pl
> +perl "$GIT_BUILD_DIR"/contrib/credential/netrc/test.pl
> 
> test_done
>  )

This reminds me, while we're here it might be worth adding a
whitespace cleanup commit to indent the lines following
test_expect_success and test_external with 2 tabs.

> diff --git a/contrib/credential/netrc/test.pl 
> b/contrib/credential/netrc/test.pl
> index 1e10010..abc9081 100755
> --- 

Re: GDPR compliance best practices?

2018-06-09 Thread Theodore Y. Ts'o
On Sat, Jun 09, 2018 at 11:50:32PM +0100, Philip Oakley wrote:
> I just want to remind folks that Gmane disappeared as a regular list because
> of a legal challenge, the SCO v IBM Unix court case keeps rumbling on, so
> clarifying the legal case for:
> a) holding the 'personal git meta data', and
> b) disclosing (publishing) 'personal git meta data'
> under various copyright and other legal issue scenarios relative to GDPR is
> worth clarifying.

And I suspect the best way of clarifying things is for laywers at the
major corporations (e.g., Red Hat, Microsoft now that it owns github,
Google since it publishes Android sources at sources.android.com,
Canonical, etc.) to figure it out.

Those situations may very well differ depend on whether they have a
CLA or Copyright Assignment Agreement which they require of
contributors.  But fortunately, those organizations are also best set
up to send patches.   :-)

If those organizations are not choosing to send patches, I suspect
that might be a strong hint as to what those lawyers have concluded.

- Ted


Re: [RFC PATCH 3/3] git-rebase.sh: make git-rebase--interactive the default

2018-06-09 Thread Elijah Newren
Hi Dscho,

On Sat, Jun 9, 2018 at 3:11 PM, Johannes Schindelin
 wrote:
> On Thu, 7 Jun 2018, Elijah Newren wrote:
>
>> am-based rebases suffer from an reduced ability to detect directory
>> renames upstream, which is fundamental to the fact that it throws away
>> information about the history: in particular, it dispenses with the
>> original commits involved by turning them into patches, and without the
>> knowledge of the original commits we cannot determine a proper merge base.
>>
>> The am-based rebase will proceed by first trying git-apply, and, only if
>> it fails, will try to reconstruct a provisional merge base using
>> build_fake_ancestor().  This fake ancestor will ONLY contain the files
>> modified in the patch; without the full list of files in the real merge
>> base, renames of any missing files cannot be detected.  Directory rename
>> detection works by looking at individual file renames and deducing when a
>> full directory has been renamed.
>>
>> Trying to modify build_fake_ancestor() to instead create a merge base that
>> includes common file information by looking for a commit that contained
>> all the same blobs as the pre-image of the patch would require a very
>> expensive search through history, may find the wrong commit, and outside
>> of rebase may not find any commit that matches.
>>
>> But we had all the relevant information to start.  So, instead of
>> attempting to fix am which just doesn't have the relevant information,
>> instead note its strength and weaknesses, and change the default rebase
>> machinery to interactive since it does not suffer from the same problems.
>
> I'll let Eric comment on the grammar, and I'll comment on the idea behind
> this commit instead.

Going to dump the hard job on Eric, eh?  ;-)

> IMHO `git rebase -i` is still too slow to be a true replacement for `git
> rebase --am` for the cases where it serves the user well. Maybe we should
> work on making `rebase -i` faster, first?

That sounds fair.

> I imagine, for example, that it might make *tons* of sense to avoid
> writing out the index and worktree files all the time. That was necessary
> in the shell script version because if the ridiculous limitations we
> subjected ourselves to, such as: no object-oriented state worth
> mentioning, only string-based processing, etc. But we could now start to
> do everything in memory (*maybe* write out the new blob/tree/commit
> objects immediately, but maybe not) until the time when we either have
> succeeded in the rebase, or when there was a problem and we have to exit
> with an error. And only then write the files and the index.

Hmm...are you still planning on using cherry-pick (internally rather
than forking, of course)?  Because cherry-pick uses the
merge-recursive machinery, and the merge-recursive machinery doesn't
have a nice way of avoiding writing to the working tree or index.
Fixing that is on my radar; see the first block of
https://public-inbox.org/git/cabpp-bg2fzhm3s-yrzxygj3eh+o7_lhlz5pgsthhg2drigs...@mail.gmail.com/
(reading up until "At this point, I'd rather just fix the design flaw
rather than complicate the code further.")

However, also covered in my plans is a few things to speed up the
merge-recursive machinery, which should provide some other performance
benefits for interactive rebases.

> In any case, I think that the rather noticeable change of the default
> would probably necessitate a deprecation phase.

Why is it a "rather noticable change of the default"?  If we popped up
the editor and asked the user to edit the list of options, I'd agree,
or if folks thought that it was significantly slower by a big enough
margin (though you already suggested waiting and making sure we don't
do that).  What else remains that qualifies?

(Okay, the default behavior to just skip empty patches rather than
halt the rebase and ask the user to advise is different, but we could
fix that up too.  Is there anything else?)


Re: [RFC PATCH 2/3] rebase: Implement --merge via git-rebase--interactive

2018-06-09 Thread Elijah Newren
Hi Dscho,

On Sat, Jun 9, 2018 at 3:04 PM, Johannes Schindelin
 wrote:
> On Thu, 7 Jun 2018, Elijah Newren wrote:
>
>> diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
>> index 451252c173..28d1658d7a 100644
>> --- a/Documentation/git-rebase.txt
>> +++ b/Documentation/git-rebase.txt
>> @@ -275,6 +275,10 @@ branch on top of the  branch.  Because of 
>> this, when a merge
>>  conflict happens, the side reported as 'ours' is the so-far rebased
>>  series, starting with , and 'theirs' is the working branch.  In
>>  other words, the sides are swapped.
>> ++
>> +This uses the `--interactive` machinery internally, and as such,
>> +anything that is incompatible with --interactive is incompatible
>> +with this option.
>
> I am not sure I like this change, as it describes an implementation detail
> that users should neither know, nor care about, nor rely on.

Let me back up for just a second to see if I can point out the real
problem I'm trying to address here, which you may have a better
solution for.  What should happen if a user runs
   git rebase --merge --whitespace=fix
?

git currently silently ignores the --whitepsace=fix argument, leaving
the whitespace damage present at the end of the rebase.  Same goes for
--interactive combined with any am-specific options  (Fix proposed at
https://public-inbox.org/git/20180607050654.19663-2-new...@gmail.com/).
This silent ignoring of some options depending on which other options
were specified has caused me problems in the past.

So, while I totally agree with you that users shouldn't need to know
implementation details, they do need to know how to use commands and
which options go well together and which are mutually incompatible.
Do you have any suggestions on alternate wording that would convey the
sets of mutually incompatible options without talking about
implementation details?  Would you rather only address that in the
code and not mention it in the documentation?

See also https://public-inbox.org/git/20180607050654.19663-1-new...@gmail.com/,
which proposes testcases for these incompatibility sets.


>> diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
>> index 1f2401f702..dcc4a26a78 100644
>> --- a/git-rebase--interactive.sh
>> +++ b/git-rebase--interactive.sh
>> @@ -885,7 +885,10 @@ init_basic_state () {
>>   mkdir -p "$state_dir" || die "$(eval_gettext "Could not create 
>> temporary \$state_dir")"
>>   rm -f "$(git rev-parse --git-path REBASE_HEAD)"
>>
>> - : > "$state_dir"/interactive || die "$(gettext "Could not mark as 
>> interactive")"
>> + if test -n "$actually_interactive"
>> + then
>> + : > "$state_dir"/interactive || die "$(gettext "Could not mark 
>> as interactive")"
>> + fi
>
> Do we really care at this stage? IOW what breaks if we still write that
> file, even in --merge mode?

Two things will change if we do that:
  * bash prompt will be different for those using git-prompt.sh
(REBASE-m vs. REBASE-i).
  * git-status output is no longer the same ('rebase in progress' vs.
'interactive rebase in progress...last command done:...pick 0dea123 my
wonderful commit')

I don't think the prompt is a big deal.  The status output might not
be either, but showing the "last command done" may be weird to someone
that never saw or edited a list of commands.  (Then again, that same
argument could be made for users of --exec, --rebase-merges, or
--keep-empty without an explicit --interactive)

Opinions on whether these two difference matter?  If others don't
think these differences are significant, I'm happy to update any
necessary testcases and just unconditionally create the
$state_dir/interactive file.


>> @@ -501,17 +502,11 @@ fi
>>
>>  if test -n "$git_am_opt"; then
>>   incompatible_opts=`echo "$git_am_opt" | sed -e 's/ -q//'`
>> - if test -n "$interactive_rebase"
>> + if test -n "$incompatible_opts"
>
> Did you not mean to turn this into a test for actually_interactve ||
> do_merge?
>
>>   then
>> - if test -n "$incompatible_opts"
>> + if test -n "$actually_interactive" || test "$do_merge"
>
> This could now be combined with the previous if (and the `-n` could be
> added to the latter test):
>
> if test -n "$actually_interactive" -o -n "$do_merge" &&
> test -n "$incompatible_opts"
>
> The indentation would change, but this hunk is already confusing to read,
> anyway, so...

I'm happy to switch the order of the nesting as you suggest and agree
that it would make it easier to read.  However, I hesitate to combine
the two if-lines.  When I read your combined suggested line, I parsed
it as follows (using invalid pseduo-syntax just to convey grouping):

  ( -n "$actually_interactive ) || ( -n "$do_merge" && -n "$incompatible_opts" )

Granted, I parsed it wrong, putting the parentheses in the wrong
places, and bash parses it as you intended.  While you may have
precedence and left- vs. right- associativity rules down pat, 

test

2018-06-09 Thread Qingyun
Sorry to intrude, but I can't send a patch to the maillist using qq.com SMTP 
server.

can not send mail

2018-06-09 Thread 陶青云
Sorry to intrude, but I can't send a patch to the maillist using
qq.com and 163.com SMTP server.


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread brian m. carlson
On Sat, Jun 09, 2018 at 08:56:00AM +0200, Johannes Sixt wrote:
> Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:
> > 
> > On Fri, Jun 08 2018, Johannes Sixt wrote:
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Of course they can. But would we, the Git community do that?
> 
> From the design document:
> 
> > The goal of the telemetry feature is to be able to gather usage data
> > across a group of production users to identify real-world performance
> > problems in production.  Additionally, it might help identify common
> > user errors and guide future user training.
> 
> The goal to gather usage data may be valid for a small subset of Git
> installations. But it is wrong to put this into the software itself, in
> particular when the implementations includes scary things like loading
> unspecified dynamic libraries:
> 
> > If the config setting "telemetry.plugin" contains the pathname to a
> > shared library, the library will be dynamically loaded during start up
> > and events will be sent to it using the plugin API.
> 
> When you want usage data, ask your users for feedback. Look over their
> shoulders. But do not ask the software itself to gather usage data. It will
> be abused.
> 
> Do not offer open source software that has a "call-home" method built-in.
> 
> If you want to peek into the workplaces of YOUR users, then monkey-patch
> survaillance into YOUR version of Git. But please do not burden the rest of
> us.

I understand there's an interest in supporting the most people with the
fewest amount of staff.  I'm certainly in the situation where I, with
only minimal assistance, support every Git user in my division of the
company, regardless of technical ability, and I know how overwhelming
that can be.  (Burnout, I can tell you, is a thing.)

I also have to look at this issue from the interests of what is best for
the FLOSS community and for users as a whole.  Adding in functionality
that sends off usage data from a command-line tool, especially one that
is as widely used as Git is, is not in the interests of users as a
whole, nor is it common practice in FLOSS tools.

As a highly capable and technical user, I would find it very undesirable
to have my development tools reporting data like this, even if it is to
make my experience better.

The ability to load arbitrary libraries makes me concerned about people
using this to spirit away personal or company data or to subtly steal
data in a rootkit-like situation.  These are real threats in the kinds
of environments I distribute to in my work role.

I agree with Duy's point of view that GIT_TRACE-level output to a file
descriptor or file is fine, but a persistently enabled feature is not.

I expect this feature, if implemented, would be patched out of Debian's
Git, and it would be patched out of any Git I would distribute in my
work role for legal and ethical reasons.

As developers, we have a duty to be mindful of how our software can be
misused and abused and try to avoid that when possible.  I don't think
this feature is on the right side of that balance.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [RFC PATCH 1/3] git-rebase, sequencer: add a --quiet option for the interactive machinery

2018-06-09 Thread Elijah Newren
Hi Dscho,

On Sat, Jun 9, 2018 at 2:45 PM, Johannes Schindelin
 wrote:
> On Thu, 7 Jun 2018, Elijah Newren wrote:
>
>> While 'quiet' and 'interactive' may sound like antonyms, the interactive
>> machinery actually has logic that implements several
>> interactive_rebase=implied cases (--exec, --keep-empty, --rebase-merges)
>> which won't pop up an editor.  Further, we want to make the interactive
>> machinery also take over for git-rebase--merge and become the default
>> merge strategy, so it makes sense for these other cases to have a quiet
>> option.
>>
>> git-rebase--interactive was already somewhat quieter than
>> git-rebase--merge and git-rebase--am, possibly because cherry-pick has
>> just traditionally been quieter.  As such, we only drop a few
>> informational messages -- "Rebasing (n/m)" and "Succesfully rebased..."
>
> Makes sense. As long as it is coordinated with Alban and Pratik, as both
> of their GSoC projects are affected by this.

I had Alban cc'ed, and had looked briefly at the GSoC projects but
somehow missed that Pratik was also working in the area.  Adding him
to cc.

> In particular Pratik's project, I think, would actually *benefit* from
> your work, as it might even make it possible to turn all modes but
> --preserve-merges into pure builtin code, which would be awesome.

:-)

>> @@ -713,6 +717,7 @@ Commit or stash your changes, and then run
>>   "$hook" rebase < "$rewritten_list"
>>   true # we don't care if this hook failed
>>   fi &&
>> + test -z "$GIT_QUIET" &&
>>   warn "$(eval_gettext "Successfully rebased and updated 
>> \$head_name.")"
>
> In general, I tried the statements to return success at all times. That
> means that
>
> test -n "$GIT_QUIET" ||
>
> would be better in this case.

Good point, I'll switch it over.

>> diff --git a/git-rebase.sh b/git-rebase.sh
>> index 7d1612b31b..b639c0d4fe 100755
>> --- a/git-rebase.sh
>> +++ b/git-rebase.sh
>> @@ -136,7 +136,7 @@ write_basic_state () {
>>   echo "$head_name" > "$state_dir"/head-name &&
>>   echo "$onto" > "$state_dir"/onto &&
>>   echo "$orig_head" > "$state_dir"/orig-head &&
>> - echo "$GIT_QUIET" > "$state_dir"/quiet &&
>> + test t = "$GIT_QUIET" && : > "$state_dir"/quiet
>
> Maybe it would be better to `echo t` into that file? That way, scripts
> that used the value in that file would continue to work. (But maybe there
> were no scripts that could use it, as only the interactive rebase allows
> scripting, and it did not handle that flag before?)

Right, I don't think we had users before, and I'd rather make the code
a little more self-consistent.  In particular, since
$state_dir/verbose work based off the presence of the file rather than
the contents, I'd rather we just did the same for $state_dir/quiet.

However, there is one bug here; in order to make it like verbose, I
need to also make the following change:

diff --git a/git-rebase.sh b/git-rebase.sh
index c8c3d0d05a..8f0c7a4738 100755
--- a/git-rebase.sh
+++ b/git-rebase.sh
@@ -119,7 +119,7 @@ read_basic_state () {
else
orig_head=$(cat "$state_dir"/head)
fi &&
-   GIT_QUIET=$(cat "$state_dir"/quiet) &&
+   test -f "$state_dir"/quiet && GIT_QUIET=t
test -f "$state_dir"/verbose && verbose=t
test -f "$state_dir"/strategy && strategy="$(cat "$state_dir"/strategy)"
test -f "$state_dir"/strategy_opts &&


> The rest looks obviously good.


Thanks!
Elijah


Re: GDPR compliance best practices?

2018-06-09 Thread Philip Oakley

From: "Theodore Y. Ts'o" 
Sent: Friday, June 08, 2018 3:53 AM

On Fri, Jun 08, 2018 at 01:21:29AM +0200, Peter Backes wrote:

On Thu, Jun 07, 2018 at 03:38:49PM -0700, David Lang wrote:
> > Again: The GDPR certainly allows you to keep a proof of copyright
> > privately if you have it. However, it does not allow you to keep
> > publishing it if someone exercises his right to be forgotten.
> someone is granting the world the right to use the code and you are
> claiming
> that the evidence that they have granted this right is illegal to have?

Hell no! Please read what I wrote:

- "allows you to keep a proof ... privately"
- "However, it does not allow you to keep publishing it"


The problem is you've left undefined who is "you"?  With an open
source project, anyone who has contributed to open source project has
a copyright interest.  That hobbyist in German who submitted a patch?
They have a copyright interest.  That US Company based in Redmond,
Washington?  They own a copyright interest.  Huawei in China?  They
have a copyright interest.

So there is no "privately".  And "you" numbers in the thousands and
thousands of copyright holders of portions of the open source code.

And of course, that's the other thing you seem to fundamentally not
understand about how git works.  Every developer in the world working
on that open source project has their own copy.  There is
fundamentally no way that you can expunge that information from every
single git repository in the world.  You can remote a git note from a
single repository.  But that doesn't affect my copy of the repository
on my laptop.  And if I push that repository to my server, it git note
will be out there for the whole world to see.

So someone could *try* sending a public request to the entire world,
saying, "I am a European and I demand that you disassociate commit
DEADBEF12345 from my name".  They could try serving legal papers on
everyone.  But at this point, it's going to trigger something called
the "Streisand Effect".  If you haven't heard of it, I suggest you
look it up:

http://mentalfloss.com/article/67299/how-barbra-streisand-inspired-streisand-effect

Regards,

- Ted


Hi Ted,

I just want to remind folks that Gmane disappeared as a regular list because
of a legal challenge, the SCO v IBM Unix court case keeps rumbling on, so
clarifying the legal case for:
a) holding the 'personal git meta data', and
b) disclosing (publishing) 'personal git meta data'
under various copyright and other legal issue scenarios relative to GDPR is
worth clarifying.

I'm of the opinion that the GPL should be able to allow both holding and
disclosing that data, though it may need a few more clarifications as to
verifying that the author is 'correct' (e.g. not a child) and if a DCO is
needed, etc.

We are already looking at a change to the hash, so the technical challenge
could be addressed, but may create too many logical conflicts if 'right to
be forgotten' is allowed (one hash change is enough;-)

Philip



Hash algorithm analysis

2018-06-09 Thread brian m. carlson
== Discussion of Candidates

I've implemented and tested the following algorithms, all of which are
256-bit (in alphabetical order):

* BLAKE2b (libb2)
* BLAKE2bp (libb2)
* KangarooTwelve (imported from the Keccak Code Package)
* SHA-256 (OpenSSL)
* SHA-512/256 (OpenSSL)
* SHA3-256 (OpenSSL)
* SHAKE128 (OpenSSL)

I also rejected some other candidates.  I couldn't find any reference or
implementation of SHA256×16, so I didn't implement it.  I didn't
consider SHAKE256 because it is nearly identical to SHA3-256 in almost
all characteristics (including performance).

I imported the optimized 64-bit implementation of KangarooTwelve.  The
AVX2 implementation was not considered for licensing reasons (it's
partially generated from external code, which falls foul of the GPL's
"preferred form for modifications" rule).

=== BLAKE2b and BLAKE2bp

These are the non-parallelized and parallelized 64-bit variants of
BLAKE2.

Benefits:
* Both algorithms provide 256-bit preimage resistance.

Downsides:
* Some people are uncomfortable that the security margin has been
  decreased from the original SHA-3 submission, although it is still
  considered secure.
* BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore
  multithreading) by default.  It was no longer possible to run the
  testsuite with -j3 on my laptop in this configuration.

=== Keccak-based Algorithms

SHA3-256 is the 256-bit Keccak algorithm with 24 rounds, processing 136
bytes at a time.  SHAKE128 is an extendable output function with 24
rounds, processing 168 bytes at a time.  KangarooTwelve is an extendable
output function with 12 rounds, processing 136 bytes at a time.

Benefits:
* SHA3-256 provides 256-bit preimage resistance.
* SHA3-256 has been heavily studied and is believed to have a large
  security margin.

I noted the following downsides:
* There's a lack of a availability of KangarooTwelve in other
  implementations.  It may be the least available option in terms of
  implementations.
* Some people are uncomfortable that the security margin of
  KangarooTwelve has been decreased, although it is still considered
  secure.
* SHAKE128 and KangarooTwelve provide only 128-bit preimage resistance.

=== SHA-256 and SHA-512/256

These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in
size.

I noted the following benefits:
* Both algorithms are well known and heavily analyzed.
* Both algorithms provide 256-bit preimage resistance.

== Implementation Support

|===
| Implementation | OpenSSL | libb2 | NSS | ACC | gcrypt | Nettle| CL  |
| SHA-1  |    |   |    |    |   |  | {1} |
| BLAKE2b| f   |  | | |   |   | {2} |
| BLAKE2bp   | |  | | ||   | |
| KangarooTwelve | |   | | ||   | |
| SHA-256|    |   |    |    |   |  | {1} |
| SHA-512/256|    |   | | ||  | {3} |
| SHA3-256   |    |   | | |   |  | {4} |
| SHAKE128   |    |   | | |   |   | {5} |
|===

f: future version (expected 1.2.0)
ACC: Apple Common Crypto
CL: Command-line

:1: OpenSSL, coreutils, Perl Digest::SHA.
:2: OpenSSL, coreutils.
:3: OpenSSL
:4: OpenSSL, Perl Digest::SHA3.
:5: Perl Digest::SHA3.

=== Performance Analysis

The test system used below is my personal laptop, a 2016 Lenovo ThinkPad
X1 Carbon with an Intel i7-6600U CPU (2.60 GHz) running Debian unstable.

I implemented a test tool helper to compute speed much like OpenSSL
does.  Below is a comparison of speeds.  The columns indicate the speed
in KiB/s for chunks of the given size.  The runs are representative of
multiple similar runs.

256 and 1024 bytes were chosen to represent common tree and commit
object sizes and the 8 KiB is an approximate average blob size.

Algorithms are sorted by performance on the 1 KiB column.

|===
| Implementation | 256 B  | 1 KiB  | 8 KiB  | 16 KiB |
| SHA-1 (OpenSSL)| 513963 | 685966 | 748993 | 754270 |
| BLAKE2b (libb2)| 488123 | 552839 | 576246 | 579292 |
| SHA-512/256 (OpenSSL)  | 181177 | 349002 | 499113 | 495169 |
| BLAKE2bp (libb2)   | 139891 | 344786 | 488390 | 522575 |
| SHA-256 (OpenSSL)  | 264276 | 333560 | 357830 | 355761 |
| KangarooTwelve | 239305 | 307300 | 355257 | 364261 |
| SHAKE128 (OpenSSL) | 154775 | 253344 | 337811 | 346732 |
| SHA3-256 (OpenSSL) | 128597 | 185381 | 198931 | 207365 |
| BLAKE2bp (libb2; threaded) |  12223 |  49306 | 132833 | 179616 |
|===

SUPERCOP (a crypto benchmarking tool;
https://bench.cr.yp.to/results-hash.html) has also benchmarked these
algorithms.  Note that BLAKE2bp is not listed, KangarooTwelve is k12,
SHA-512/256 is equivalent to sha512, SHA3-256 is keccakc512, and SHAKE128 is
keccakc256.

Information is for kizomba, a Kaby Lake system.  Counts are in cycles
per byte (smaller is better; sorted by 

Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Sixt

Am 09.06.2018 um 22:43 schrieb Johannes Schindelin:

On Sat, 9 Jun 2018, Johannes Sixt wrote:

When you want usage data, ask your users for feedback. Look over their
shoulders. But do not ask the software itself to gather usage data. It will be
abused.

Do not offer open source software that has a "call-home" method built-in.

If you want to peek into the workplaces of YOUR users, then monkey-patch
survaillance into YOUR version of Git. But please do not burden the rest of
us.


We already offer hooks. You can do anything with those hooks. Even, if you
do not pay close attention, to transfer all your bitcoin to a specific
account.

I agree with Peff: this is something you as a user need to be aware of,
and need to make sure you configure your Git just like you want. As long
as this is a purely opt-in feature, it is useful and helpful.


The problem with this feature is not so much that it enables someone to 
do bad things, but that it is specifically targeted at recording *how 
users use Git*.



We do need it in-house, for the many thousands of Git users we try to
support with a relatively small team of Git developers.


Then please build your private version and sell it to your thousands of 
Git users. "in-house" == Microsoft? Small team of Git developers? So, 
instead of shelling out the bucks for this extra burden, they put the 
burden on the Community?


-- Hannes


Re: [RFC PATCH 3/3] git-rebase.sh: make git-rebase--interactive the default

2018-06-09 Thread Johannes Schindelin
Hi Elijah,

On Thu, 7 Jun 2018, Elijah Newren wrote:

> am-based rebases suffer from an reduced ability to detect directory
> renames upstream, which is fundamental to the fact that it throws away
> information about the history: in particular, it dispenses with the
> original commits involved by turning them into patches, and without the
> knowledge of the original commits we cannot determine a proper merge base.
> 
> The am-based rebase will proceed by first trying git-apply, and, only if
> it fails, will try to reconstruct a provisional merge base using
> build_fake_ancestor().  This fake ancestor will ONLY contain the files
> modified in the patch; without the full list of files in the real merge
> base, renames of any missing files cannot be detected.  Directory rename
> detection works by looking at individual file renames and deducing when a
> full directory has been renamed.
> 
> Trying to modify build_fake_ancestor() to instead create a merge base that
> includes common file information by looking for a commit that contained
> all the same blobs as the pre-image of the patch would require a very
> expensive search through history, may find the wrong commit, and outside
> of rebase may not find any commit that matches.
> 
> But we had all the relevant information to start.  So, instead of
> attempting to fix am which just doesn't have the relevant information,
> instead note its strength and weaknesses, and change the default rebase
> machinery to interactive since it does not suffer from the same problems.

I'll let Eric comment on the grammar, and I'll comment on the idea behind
this commit instead.

IMHO `git rebase -i` is still too slow to be a true replacement for `git
rebase --am` for the cases where it serves the user well. Maybe we should
work on making `rebase -i` faster, first?

I imagine, for example, that it might make *tons* of sense to avoid
writing out the index and worktree files all the time. That was necessary
in the shell script version because if the ridiculous limitations we
subjected ourselves to, such as: no object-oriented state worth
mentioning, only string-based processing, etc. But we could now start to
do everything in memory (*maybe* write out the new blob/tree/commit
objects immediately, but maybe not) until the time when we either have
succeeded in the rebase, or when there was a problem and we have to exit
with an error. And only then write the files and the index.

In any case, I think that the rather noticeable change of the default
would probably necessitate a deprecation phase.

Ciao,
Dscho


Re: [RFC PATCH 2/3] rebase: Implement --merge via git-rebase--interactive

2018-06-09 Thread Johannes Schindelin
Hi Elijah,

On Thu, 7 Jun 2018, Elijah Newren wrote:

> diff --git a/Documentation/git-rebase.txt b/Documentation/git-rebase.txt
> index 451252c173..28d1658d7a 100644
> --- a/Documentation/git-rebase.txt
> +++ b/Documentation/git-rebase.txt
> @@ -275,6 +275,10 @@ branch on top of the  branch.  Because of 
> this, when a merge
>  conflict happens, the side reported as 'ours' is the so-far rebased
>  series, starting with , and 'theirs' is the working branch.  In
>  other words, the sides are swapped.
> ++
> +This uses the `--interactive` machinery internally, and as such,
> +anything that is incompatible with --interactive is incompatible
> +with this option.

I am not sure I like this change, as it describes an implementation detail
that users should neither know, nor care about, nor rely on.

> @@ -328,8 +332,8 @@ which makes little sense.
>   and after each change.  When fewer lines of surrounding
>   context exist they all must match.  By default no context is
>   ever ignored.
> - Incompatible with the --merge and --interactive options, or
> - anything that implies those options or their machinery.
> + Incompatible with the --interactive option, or anything that
> + uses the `--interactive` machinery.
>  
>  -f::
>  --force-rebase::
> @@ -361,15 +365,15 @@ default is `--no-fork-point`, otherwise the default is 
> `--fork-point`.
>  --whitespace=::
>   These flag are passed to the 'git apply' program
>   (see linkgit:git-apply[1]) that applies the patch.
> - Incompatible with the --merge and --interactive options, or
> - anything that implies those options or their machinery.
> + Incompatible with the --interactive option, or anything that
> + uses the `--interactive` machinery.
>  
>  --committer-date-is-author-date::
>  --ignore-date::
>   These flags are passed to 'git am' to easily change the dates
>   of the rebased commits (see linkgit:git-am[1]).
> - Incompatible with the --merge and --interactive options, or
> - anything that implies those options or their machinery.
> + Incompatible with the --interactive option, or anything that
> + uses the `--interactive` machinery.
>  
>  --signoff::
>   Add a Signed-off-by: trailer to all the rebased commits. Note

For above-mentioned reason, I'd suggest dropping these three hunks.

> diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
> index 1f2401f702..dcc4a26a78 100644
> --- a/git-rebase--interactive.sh
> +++ b/git-rebase--interactive.sh
> @@ -885,7 +885,10 @@ init_basic_state () {
>   mkdir -p "$state_dir" || die "$(eval_gettext "Could not create 
> temporary \$state_dir")"
>   rm -f "$(git rev-parse --git-path REBASE_HEAD)"
>  
> - : > "$state_dir"/interactive || die "$(gettext "Could not mark as 
> interactive")"
> + if test -n "$actually_interactive"
> + then
> + : > "$state_dir"/interactive || die "$(gettext "Could not mark 
> as interactive")"
> + fi

Do we really care at this stage? IOW what breaks if we still write that
file, even in --merge mode?

> diff --git a/git-rebase.sh b/git-rebase.sh
> index b639c0d4fe..5ac1dee30b 100755
> --- a/git-rebase.sh
> +++ b/git-rebase.sh
> @@ -239,12 +239,10 @@ then
>   state_dir="$apply_dir"
>  elif test -d "$merge_dir"
>  then
> + type=interactive
>   if test -f "$merge_dir"/interactive
>   then
> - type=interactive
>   interactive_rebase=explicit
> - else
> - type=merge
>   fi
>   state_dir="$merge_dir"
>  fi

This makes me think even more that we don't care...

> @@ -481,13 +479,16 @@ then
>   test -z "$interactive_rebase" && interactive_rebase=implied
>  fi
>  
> +actually_interactive=
>  if test -n "$interactive_rebase"
>  then
>   type=interactive
> + actually_interactive=t
>   state_dir="$merge_dir"
>  elif test -n "$do_merge"
>  then
> - type=merge
> + interactive_rebase=implied
> + type=interactive
>   state_dir="$merge_dir"
>  else
>   type=am
> @@ -501,17 +502,11 @@ fi
>  
>  if test -n "$git_am_opt"; then
>   incompatible_opts=`echo "$git_am_opt" | sed -e 's/ -q//'`
> - if test -n "$interactive_rebase"
> + if test -n "$incompatible_opts"

Did you not mean to turn this into a test for actually_interactve ||
do_merge?

>   then
> - if test -n "$incompatible_opts"
> + if test -n "$actually_interactive" || test "$do_merge"

This could now be combined with the previous if (and the `-n` could be
added to the latter test):

if test -n "$actually_interactive" -o -n "$do_merge" &&
test -n "$incompatible_opts"

The indentation would change, but this hunk is already confusing to read,
anyway, so...

>   then
> - die "$(gettext "error: cannot combine interactive 
> options (--interactive, --exec, --rebase-merges, --preserve-merges, 
> --keep-empty, --root + --onto) with am options 

Re: [RFC PATCH 1/3] git-rebase, sequencer: add a --quiet option for the interactive machinery

2018-06-09 Thread Johannes Schindelin
Hi Elijah,

On Thu, 7 Jun 2018, Elijah Newren wrote:

> While 'quiet' and 'interactive' may sound like antonyms, the interactive
> machinery actually has logic that implements several
> interactive_rebase=implied cases (--exec, --keep-empty, --rebase-merges)
> which won't pop up an editor.  Further, we want to make the interactive
> machinery also take over for git-rebase--merge and become the default
> merge strategy, so it makes sense for these other cases to have a quiet
> option.
> 
> git-rebase--interactive was already somewhat quieter than
> git-rebase--merge and git-rebase--am, possibly because cherry-pick has
> just traditionally been quieter.  As such, we only drop a few
> informational messages -- "Rebasing (n/m)" and "Succesfully rebased..."

Makes sense. As long as it is coordinated with Alban and Pratik, as both
of their GSoC projects are affected by this.

In particular Pratik's project, I think, would actually *benefit* from
your work, as it might even make it possible to turn all modes but
--preserve-merges into pure builtin code, which would be awesome.

> diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
> index 06a7b79307..1f2401f702 100644
> --- a/git-rebase--interactive.sh
> +++ b/git-rebase--interactive.sh
> @@ -139,8 +139,12 @@ mark_action_done () {
>   if test "$last_count" != "$new_count"
>   then
>   last_count=$new_count
> - eval_gettext "Rebasing (\$new_count/\$total)"; printf "\r"
> - test -z "$verbose" || echo
> + if test -z "$GIT_QUIET"
> + then
> + eval_gettext "Rebasing (\$new_count/\$total)";
> + printf "\r"
> + test -z "$verbose" || echo
> + fi
>   fi
>  }
>  
> @@ -713,6 +717,7 @@ Commit or stash your changes, and then run
>   "$hook" rebase < "$rewritten_list"
>   true # we don't care if this hook failed
>   fi &&
> + test -z "$GIT_QUIET" &&
>   warn "$(eval_gettext "Successfully rebased and updated 
> \$head_name.")"

In general, I tried the statements to return success at all times. That
means that

test -n "$GIT_QUIET" ||

would be better in this case.

> diff --git a/git-rebase.sh b/git-rebase.sh
> index 7d1612b31b..b639c0d4fe 100755
> --- a/git-rebase.sh
> +++ b/git-rebase.sh
> @@ -136,7 +136,7 @@ write_basic_state () {
>   echo "$head_name" > "$state_dir"/head-name &&
>   echo "$onto" > "$state_dir"/onto &&
>   echo "$orig_head" > "$state_dir"/orig-head &&
> - echo "$GIT_QUIET" > "$state_dir"/quiet &&
> + test t = "$GIT_QUIET" && : > "$state_dir"/quiet

Maybe it would be better to `echo t` into that file? That way, scripts
that used the value in that file would continue to work. (But maybe there
were no scripts that could use it, as only the interactive rebase allows
scripting, and it did not handle that flag before?)

The rest looks obviously good.

Thank you!
Dscho


Re: State of NewHash work, future directions, and discussion

2018-06-09 Thread Ævar Arnfjörð Bjarmason


On Sat, Jun 09 2018, brian m. carlson wrote:

> Since there's been a lot of questions recently about the state of the
> NewHash work, I thought I'd send out a summary.

Thanks for all your work on this.

> I know that we have long tried to avoid discussing the specific
> algorithm to use, in part because the last discussion generated more
> heat than light, and settled on referring to it as NewHash for the time
> being.  However, I think it's time to pick this topic back up, since I
> can't really continue work in this direction without us picking a
> NewHash.
>
> If people are interested, I've done some analysis on availability of
> implementations, performance, and other attributes described in the
> transition plan and can send that to the list.

Let's see it!


Re: [RFC PATCH 0/2] contrib/credential/netrc Makefile & test improvements

2018-06-09 Thread Luis Marsano
Thanks for looking into this and addressing these issues.

On Thu, Jun 7, 2018 at 1:20 AM Todd Zullinger  wrote:
>
> Hi,
>
> I noticed failures from the contrib/credential/netrc tests
> while building 2.18.0 release candidates.  I was surprised
> to see the tests being run when called with a simple 'make'
> command.
>
> The first patch in the series adds an empty 'all::' make
> target to match most of our other Makefiles and avoid the
> surprise of running tests by default.  (When the netrc
> helper was added to the fedora builds, it copied the same
> 'make -C contrib/credential/...' pattern from other
> credential helpers -- despite the lack of anything to
> build.)

I think this is a good idea.

> The actual test failures were initially due to my build
> environment lacking the perl autodie module, which was added
> in 786ef50a23 ("git-credential-netrc: accept gpg option",
> 2018-05-12).

I added 'use autodie;' without realizing it had external dependencies.
According to the documentation
http://perldoc.perl.org/autodie.html
it's a pragma since perl 5.10.1
Removing 'use autodie;' should be fine: it's not critical.

> After installing the autodie module, the failures were due
> to the build environment lacking a git install (specifically
> the perl Git module).  The tests needing a pre-installed
> perl Git seemed odd and worth fixing.

I mistakenly thought 'use lib (split(/:/, $ENV{GITPERLLIB}));' in
test.pl handled this.
Since it doesn't, and I was only following an example from
t/t9700/test.pl that doesn't fit, this line should be removed and it
might make more sense to set the environment from
t-git-credential-netrc.sh near '. ./test-lib.sh', which also sets the
environment.
Something like

diff --git a/contrib/credential/netrc/t-git-credential-netrc.sh
b/contrib/credential/netrc/t-git-credential-netrc.sh
index 58191a6..9e18611 100755
--- a/contrib/credential/netrc/t-git-credential-netrc.sh
+++ b/contrib/credential/netrc/t-git-credential-netrc.sh
@@ -23,5 +23,6 @@
# The external test will outputs its own plan
test_external_has_tap=1

+   export PERL5LIB="$GITPERLLIB"
test_external \
 'git-credential-netrc' \

Your solution, however, is reasonable, and I don't know which is preferred.

> The second patch in the series aims to fix this.  I'm not
> sure if there's a better or more preferable way to fix this,
> which is one of the reasons for the RFC tag. (It's also why
> I added you to the Cc Ævar, as you're one of the
> knowledgeable perl folks here.)
>
> The other reason for the RFC tag is that I'm unsure of how
> to fix the last issue I found.  The tests exit cleanly even
> when there are failures, which seems undesirable.  I'm not
> familiar with the perl test_external framework to suggest a
> fix in patch form.  It might be a matter of adding something
> like this, from t/t9700/test.pl:
>
> my $is_passing = eval { Test::More->is_passing };
> exit($is_passing ? 0 : 1) unless $@ =~ /Can't locate object method/;
>
> ?  But that's a wild guess which I haven't tested.

Good idea.
It looks like you found an issue with t/t9700/test.pl, too.
When altered to fail, it first reports ok (then reports failed and
returns non-0).

not ok 46 - unquote simple quoted path
not ok 47 - unquote escape sequences
1..47
# test_external test Perl API was ok
# test_external_without_stderr test no stderr: Perl API failed: perl
/home/luism/project/git/t/t9700/test.pl:
$ echo $?
1

To make git-credential-netrc tests behave correctly, I ended up making
the changes below.
They might be okay, unless someone knows better.

diff --git a/contrib/credential/netrc/t-git-credential-netrc.sh
b/contrib/credential/netrc/t-git-credential-netrc.sh
index 58191a6..9e18611 100755
--- a/contrib/credential/netrc/t-git-credential-netrc.sh
+++ b/contrib/credential/netrc/t-git-credential-netrc.sh
@@ -23,9 +23,10 @@
# The external test will outputs its own plan
test_external_has_tap=1

+   export PERL5LIB="$GITPERLLIB"
test_external \
 'git-credential-netrc' \
-perl "$TEST_DIRECTORY"/../contrib/credential/netrc/test.pl
+perl "$GIT_BUILD_DIR"/contrib/credential/netrc/test.pl

test_done
 )
diff --git a/contrib/credential/netrc/test.pl b/contrib/credential/netrc/test.pl
index 1e10010..abc9081 100755
--- a/contrib/credential/netrc/test.pl
+++ b/contrib/credential/netrc/test.pl
@@ -1,6 +1,4 @@
 #!/usr/bin/perl
-use lib (split(/:/, $ENV{GITPERLLIB}));
-
 use warnings;
 use strict;
 use Test::More qw(no_plan);
@@ -12,7 +10,6 @@ BEGIN
# t-git-credential-netrc.sh kicks off our testing, so we have to go
# from there.
Test::More->builder->current_test(1);
-   Test::More->builder->no_ending(1);
 }

 my @global_credential_args = @ARGV;
@@ -104,6 +101,9 @@ BEGIN

 ok(scalar keys %$cred == 2, 'Got keys decrypted by command option');

+my $is_passing = eval { Test::More->is_passing };
+exit($is_passing ? 0 : 1) unless $@ =~ /Can't locate object 

State of NewHash work, future directions, and discussion

2018-06-09 Thread brian m. carlson
Since there's been a lot of questions recently about the state of the
NewHash work, I thought I'd send out a summary.

== Status

I have patches to make the entire codebase work, including passing all
tests, when Git is converted to use a 256-bit hash algorithm.
Obviously, such a Git is incompatible with the current version, but it
means that we've fixed essentially all of the hard-coded 20 and 40
constants (and therefore Git doesn't segfault).

I'm working on getting a 256-bit Git to work with SHA-1 being the
default.  Currently, this involves doing things like writing transport
code, since in order to clone a repository, you need to be able to set
up the hash algorithm correctly.  I know that this was a non-goal in the
transition plan, but since the testsuite doesn't pass without it, it's
become necessary.

Some of these patches will be making their way to the list soon.
They're hanging out in the normal places in the object-id-part14 branch
(which may be rebased).

== Future Design

The work I've done necessarily involves porting everything to use
the_hash_algo.  Essentially, when the piece I'm currently working on is
complete, we'll have a transition stage 4 implementation (all NewHash).
Stage 2 and 3 will be implemented next.

My vision of how data is stored is that the .git directory is, except
for pack indices and the loose object lookup table, entirely in one
format.  It will be all SHA-1 or all NewHash.  This algorithm will be
stored in the_hash_algo.

I plan on introducing an array of hash algorithms into struct repository
(and wrapper macros) which stores, in order, the output hash, and if
used, the additional input hash.

Functions like get_oid_hex and parse_oid_hex will acquire an internal
version, which knows about parsing things (like refs) in the internal
format, and one which knows about parsing in the UI formats.  Similarly,
oid_to_hex will have an internal version that handles data in the .git
directory, and an external version that produces data in the output
format.  Translation will take place at the outer edges of the program.

The transition plan anticipates a stage 1 where accept only SHA-1 on
input and produce only SHA-1 on output, but store in NewHash.  As I've
worked with our tests, I've realized such an implementation is not
entirely possible.  We have various tools that expect to accept invalid
object IDs, and obviously there's no way to have those continue to work.
We'd have to either reject invalid data in such a case or combine stages
1 and 2.

== Compatibility with this Work

If you're working on new features and you'd like to implement the best
possible compatibility with this work, here are some recommendations:

* Assume everything in the .git directory but pack indices and the loose
  object index will be in the same algorithm and that that algorithm is
  the_hash_algo.
* For the moment, use the_hash_algo to look up the size of all
  hash-related constants.  Use GIT_MAX_* for allocations.
* If you are writing a new data format, add a version number.
* If you need to serialize an algorithm identifier into your data
  format, use the format_id field of struct git_hash_algo.  It's
  designed specifically for that purpose.
* You can safely assume that the_hash_algo will be suitably initialized
  to the correct algorithm for your repository.
* Keep using the object ID functions and struct object_id.
* Try not to use mmap'd structs for reading and writing formats on disk,
  since these are hard to make hash size agnostic.

== Discussion about an Actual NewHash

Since I'll be writing new code, I'll be writing tests for this code.
However, writing tests for creating and initializing repositories
requires that I be able to test that objects are being serialized
correctly, and therefore requires that I actually know what the hash
algorithm is going to be.  I also can't submit code for multi-hash packs
when we officially only support one hash algorithm.

I know that we have long tried to avoid discussing the specific
algorithm to use, in part because the last discussion generated more
heat than light, and settled on referring to it as NewHash for the time
being.  However, I think it's time to pick this topic back up, since I
can't really continue work in this direction without us picking a
NewHash.

If people are interested, I've done some analysis on availability of
implementations, performance, and other attributes described in the
transition plan and can send that to the list.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Schindelin
Hi Hannes,

On Sat, 9 Jun 2018, Johannes Sixt wrote:

> Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:
> > 
> > On Fri, Jun 08 2018, Johannes Sixt wrote:
> > 
> > > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> > > > I for my part would much rather prefer that to be a compile time
> > > > option so that I don't need to check on every git update on windows
> > > > if  this is now enabled or not.
> > >
> > > This exactly my concern, too! A compile-time option may make it a good
> > > deal less worrisome.
> > 
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Of course they can. But would we, the Git community do that?
> 
> From the design document:
> 
> > The goal of the telemetry feature is to be able to gather usage data
> > across a group of production users to identify real-world performance
> > problems in production.  Additionally, it might help identify common
> > user errors and guide future user training.
> 
> The goal to gather usage data may be valid for a small subset of Git
> installations. But it is wrong to put this into the software itself, in
> particular when the implementations includes scary things like loading
> unspecified dynamic libraries:
> 
> > If the config setting "telemetry.plugin" contains the pathname to a
> > shared library, the library will be dynamically loaded during start up
> > and events will be sent to it using the plugin API.
> 
> When you want usage data, ask your users for feedback. Look over their
> shoulders. But do not ask the software itself to gather usage data. It will be
> abused.
> 
> Do not offer open source software that has a "call-home" method built-in.
> 
> If you want to peek into the workplaces of YOUR users, then monkey-patch
> survaillance into YOUR version of Git. But please do not burden the rest of
> us.

We already offer hooks. You can do anything with those hooks. Even, if you
do not pay close attention, to transfer all your bitcoin to a specific
account.

I agree with Peff: this is something you as a user need to be aware of,
and need to make sure you configure your Git just like you want. As long
as this is a purely opt-in feature, it is useful and helpful.

We do need it in-house, for the many thousands of Git users we try to
support with a relatively small team of Git developers.

Ciao,
Dscho

Re: Why is there no force pull?

2018-06-09 Thread Ævar Arnfjörð Bjarmason


On Sat, Jun 09 2018, Elijah Newren wrote:

> On Sat, Jun 9, 2018 at 12:01 PM, Christoph Böhmwalder
>  wrote:
>> Hi,
>>
>> Since this is a use case that actually comes up quite often in
>> day-to-day use, especially among git beginners, I was wondering: is
>> there a specific reason why a command like "fetch changes from remote,
>> overwriting everything in my current working directory including all
>> commits I've made" doesn't exist? Now, I'm quite aware that something
>> like
>>
>> $ git fetch origin/branch
>> $ git reset --hard origin/branch
>>
>> will do the trick just fine, but (like I mentioned, especially for
>> beginners) this kind of seems like a crook. Why not have a single
>> command for accomplishing this? Afterall we do have a `--force` flag on
>> `git push`, which practically does the same thing in reverse.
>>
>> Just reaching out to get some input on this, as it seems like a quite
>> curious inconsistency to me.
>
> Upon reading the subject and before reading the body, I assumed you
> were going to ask for a 'git pull --force' that would throw away
> *uncommitted* changes (i.e. do a 'git reset --hard HEAD' before the
> rest of the pull).  But then you asked for both uncommitted and
> committed changes to be thrown away.  That difference isn't something
> you have to consider with a push.
>
> That might be a reason such an option would be confusing, or it might
> just be a warning to document the option carefully.  Anyway, thought
> I'd mention it.

More generally, "git pull"'s params are passed to "git fetch", and then
we either "git merge" or "git rebase".

This proposed behavior doesn't fit into that at all.

But it would if we added a third mode, similar to how we added "rebase",
where we'd dispatch to "reset" instead, so:

git pull --reset --hard

Meaning (in the general case):

git fetch &&
git reset --hard @{u}


Re: Why is there no force pull?

2018-06-09 Thread Christoph Böhmwalder
On Sat, Jun 09, 2018 at 01:04:30PM -0700, Elijah Newren wrote:
> Upon reading the subject and before reading the body, I assumed you
> were going to ask for a 'git pull --force' that would throw away
> *uncommitted* changes (i.e. do a 'git reset --hard HEAD' before the
> rest of the pull).  But then you asked for both uncommitted and
> committed changes to be thrown away.  That difference isn't something
> you have to consider with a push.
> 
> That might be a reason such an option would be confusing, or it might
> just be a warning to document the option carefully.  Anyway, thought
> I'd mention it.

Interesting, I hadn't taken that first scenario into consideration at
all. So I guess two very aptly named flags would be necessary to
implement this kind of feature...

--
Regards,
Christoph


Re: [PATCH] gitworkflows: fix grammar in 'Merge upwards' rule

2018-06-09 Thread Elijah Newren
Hi Kyle,

On Sat, Jun 9, 2018 at 8:19 AM, Kyle Meyer  wrote:
> Signed-off-by: Kyle Meyer 
> ---
>  Documentation/gitworkflows.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/gitworkflows.txt b/Documentation/gitworkflows.txt
> index 926e044d0..ca11c7bda 100644
> --- a/Documentation/gitworkflows.txt
> +++ b/Documentation/gitworkflows.txt
> @@ -107,7 +107,7 @@ the unstable branch into the stable one.  Hence the 
> following:
>  .Merge upwards
>  [caption="Rule: "]
>  =
> -Always commit your fixes to the oldest supported branch that require
> +Always commit your fixes to the oldest supported branch that requires
>  them.  Then (periodically) merge the integration branches upwards into each
>  other.
>  =

Thanks for sending in the fix.

Reviewed-by: Elijah Newren 


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Schindelin
Hi Peff,


On Sat, 9 Jun 2018, Jeff King wrote:

> On Sat, Jun 09, 2018 at 08:31:58AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> > 1) I really don't see the basis for this argument that they'd need to
> >patch it out, they're not patching out e.g. GIT_TRACE now, which has
> >all the same sort of concerns, it's just a format that's more of a
> >hassle to parse than this proposed format.
> >
> > 2) I think you and Johannes are just seeing the "telemetry" part of
> >this, but if you look past that all this *really* is is "GIT_TRACE
> >facility that doesn't suck to parse".
> 
> So that actually takes me to another question, which is: instead of
> introducing a new "telemetry" feature, could we make GIT_TRACE suck less
> to parse?

That's a different question, and even a different concern.

Please understand that this feature is needed for our own in-house use, to
be able to help our users pro-actively. We really want to be able to
contact a user when they struggle with the performance of any given Git
command, before they have to think of reaching out for assistance.

We discussed this plan at the contributor summit at GitMerge, and IIRC at
least Autodesk and DropBox were interested (but of course, the changes of
both Lars' and Alex' roles might make them less interested now), hence
JeffHost wanted to contribute this, for the benefit of the larger Git
community.

> E.g., could we have a flag or environment variable to have the existing
> traces output JSON? I guess right now they're inherently free-form via
> trace_printf, so it would involve adding some structured interface
> calls. Which is more or less what I guess JeffH's proposed feature to
> look like.

I think that is a much larger project than what JeffHost proposed, and
would unfortunately put too much of a brake on his project.

Ciao,
Dscho

Re: Why is there no force pull?

2018-06-09 Thread Elijah Newren
On Sat, Jun 9, 2018 at 12:01 PM, Christoph Böhmwalder
 wrote:
> Hi,
>
> Since this is a use case that actually comes up quite often in
> day-to-day use, especially among git beginners, I was wondering: is
> there a specific reason why a command like "fetch changes from remote,
> overwriting everything in my current working directory including all
> commits I've made" doesn't exist? Now, I'm quite aware that something
> like
>
> $ git fetch origin/branch
> $ git reset --hard origin/branch
>
> will do the trick just fine, but (like I mentioned, especially for
> beginners) this kind of seems like a crook. Why not have a single
> command for accomplishing this? Afterall we do have a `--force` flag on
> `git push`, which practically does the same thing in reverse.
>
> Just reaching out to get some input on this, as it seems like a quite
> curious inconsistency to me.

Upon reading the subject and before reading the body, I assumed you
were going to ask for a 'git pull --force' that would throw away
*uncommitted* changes (i.e. do a 'git reset --hard HEAD' before the
rest of the pull).  But then you asked for both uncommitted and
committed changes to be thrown away.  That difference isn't something
you have to consider with a push.

That might be a reason such an option would be confusing, or it might
just be a warning to document the option carefully.  Anyway, thought
I'd mention it.


Re: [PATCH v4 00/23] Fix incorrect use of the_index

2018-06-09 Thread Elijah Newren
On Thu, Jun 7, 2018 at 12:44 AM, Elijah Newren  wrote:
> On Wed, Jun 6, 2018 at 9:49 AM, Nguyễn Thái Ngọc Duy  
> wrote:
>> v4 fixes some commit messages and killed a couple more the_index
>> references after I tried to merge this with 'pu'
>
> Thanks for tackling this.  The first 8 patches look good to me other
> than the typo I pointed out on patch 4.  However, my eyes glazed over
> a bit on the attr.c stuff in patch 7 that you specifically mentioned
> in the commit message, so my "looks good" doesn't count for as much on
> that one.

I re-read the attr.c stuff in patch 7; looks good to me and the new
series makes the code much more readable in the end.

> I'm getting sleepy, but I'll try to circle back and look over the rest
> of the patches in the next few days.

I read over the rest.  Found a small grammatical error in a commit
message.  Found multiple places that still need conversion, from
pushing up _index usages to callers of ll-merge.c and sha1-file.c
instead of having them in those files, to mixes of _cache_ and _index_
functions as in apply.c and merge-recursive.c.  However, Duy pointed
out there was more work to do, and this series is a step in the right
direction and a good start.


Re: [PATCH v4 17/23] read-cache.c: remove an implicit dependency on the_index

2018-06-09 Thread Elijah Newren
On Sat, Jun 9, 2018 at 11:45 AM, Duy Nguyen  wrote:
> On Sat, Jun 9, 2018 at 8:10 PM Elijah Newren  wrote:
>>
>> On Wed, Jun 6, 2018 at 10:02 AM, Nguyễn Thái Ngọc Duy  
>> wrote:
>> > diff --git a/merge-recursive.c b/merge-recursive.c
>> > index b404ebac7c..4f054d6dbb 100644
>> > --- a/merge-recursive.c
>> > +++ b/merge-recursive.c
>> > @@ -315,7 +315,7 @@ static int add_cacheinfo(struct merge_options *o,
>> > struct cache_entry *ce;
>> > int ret;
>> >
>> > -   ce = make_cache_entry(mode, oid ? oid->hash : null_sha1, path, 
>> > stage, 0);
>> > +   ce = make_index_entry(_index, mode, oid ? oid->hash : 
>> > null_sha1, path, stage, 0);
>> > if (!ce)
>> > return err(o, _("add_cacheinfo failed for path '%s'; merge 
>> > aborting."), path);
>>
>> There's also a refresh_cache_entry() call about ten lines after this;
>> since you converted all other make_cache_entry() and
>> refresh_cache_entry() calls in this patch, I'm curious if that one was
>> left out for a reason or was just an oversight.
>
> Ah I didn't mean to convert or kill refresh_cache_entry(), not outside
> read-cache.c. I rely on NO_THE_INDEX_COMPATIBILITY_MACROS to catch
> *cache* functions and if we set it in this file, we're going to have a
> lot more work to do and plenty of the_index will show up.
>
>> There are also a lot of add_cache_entry() calls in this function.  I'm
>> guessing we should either convert all of those too, or just change
>> back this particular make_index_entry to make_cache_entry() as it was;
>> it seems weird to have a mix of explicit the_index and implicit
>> the_index usages in the same function.
>
> Yes some files still have the mix of the_index and *cache*(). This one
> and apply.c come to mind. There's more work to do to kill all
> the_index outside builtin/

Yeah, there's also low-level common files like sha1-file.c or
ll-merge.c which now have _index in them (or more references to
it).  Clearly those need to be propagated up, but as you say, there's
more work to do.  Your patch series just moves things in the right
direction.

>> If we convert them all,
>> perhaps we should consider having merge_options store the index we're
>> working on?  If you want to punt this until later or leave it for me
>> while I make all my ongoing merge-recursive changes, that's fine.
>> Just thought I'd point it out.
>
> Right you're updating merge-recursive.c, I'd love it if you could
> define NO_THE_INDEX_COMPATIBILITY_MACROS here. Yes merge_options
> sounds like a good place to tell merge-recursive where to get a struct
> index_state.

Sure, I can tackle that.


Why is there no force pull?

2018-06-09 Thread Christoph Böhmwalder
Hi,

Since this is a use case that actually comes up quite often in
day-to-day use, especially among git beginners, I was wondering: is
there a specific reason why a command like "fetch changes from remote,
overwriting everything in my current working directory including all
commits I've made" doesn't exist? Now, I'm quite aware that something
like

$ git fetch origin/branch
$ git reset --hard origin/branch

will do the trick just fine, but (like I mentioned, especially for
beginners) this kind of seems like a crook. Why not have a single
command for accomplishing this? Afterall we do have a `--force` flag on
`git push`, which practically does the same thing in reverse.

Just reaching out to get some input on this, as it seems like a quite
curious inconsistency to me.

--
Regards,
Christoph


Re: [PATCH v4 17/23] read-cache.c: remove an implicit dependency on the_index

2018-06-09 Thread Duy Nguyen
On Sat, Jun 9, 2018 at 8:10 PM Elijah Newren  wrote:
>
> On Wed, Jun 6, 2018 at 10:02 AM, Nguyễn Thái Ngọc Duy  
> wrote:
> > diff --git a/merge-recursive.c b/merge-recursive.c
> > index b404ebac7c..4f054d6dbb 100644
> > --- a/merge-recursive.c
> > +++ b/merge-recursive.c
> > @@ -315,7 +315,7 @@ static int add_cacheinfo(struct merge_options *o,
> > struct cache_entry *ce;
> > int ret;
> >
> > -   ce = make_cache_entry(mode, oid ? oid->hash : null_sha1, path, 
> > stage, 0);
> > +   ce = make_index_entry(_index, mode, oid ? oid->hash : 
> > null_sha1, path, stage, 0);
> > if (!ce)
> > return err(o, _("add_cacheinfo failed for path '%s'; merge 
> > aborting."), path);
>
> There's also a refresh_cache_entry() call about ten lines after this;
> since you converted all other make_cache_entry() and
> refresh_cache_entry() calls in this patch, I'm curious if that one was
> left out for a reason or was just an oversight.

Ah I didn't mean to convert or kill refresh_cache_entry(), not outside
read-cache.c. I rely on NO_THE_INDEX_COMPATIBILITY_MACROS to catch
*cache* functions and if we set it in this file, we're going to have a
lot more work to do and plenty of the_index will show up.

> There are also a lot of add_cache_entry() calls in this function.  I'm
> guessing we should either convert all of those too, or just change
> back this particular make_index_entry to make_cache_entry() as it was;
> it seems weird to have a mix of explicit the_index and implicit
> the_index usages in the same function.

Yes some files still have the mix of the_index and *cache*(). This one
and apply.c come to mind. There's more work to do to kill all
the_index outside builtin/

> If we convert them all,
> perhaps we should consider having merge_options store the index we're
> working on?  If you want to punt this until later or leave it for me
> while I make all my ongoing merge-recursive changes, that's fine.
> Just thought I'd point it out.

Right you're updating merge-recursive.c, I'd love it if you could
define NO_THE_INDEX_COMPATIBILITY_MACROS here. Yes merge_options
sounds like a good place to tell merge-recursive where to get a struct
index_state.
-- 
Duy


Re: [PATCH 23/23] midx: clear midx on repack

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee  wrote:
>
> If a 'git repack' command replaces existing packfiles, then we must
> clear the existing multi-pack-index before moving the packfiles it
> references.

I think there are other places where we add or remove pack files and
need to reprepare_packed_git(). Any midx invalidation should be part
of that as well.

>
> Signed-off-by: Derrick Stolee 
> ---
>  builtin/repack.c | 8 
>  midx.c   | 8 
>  midx.h   | 1 +
>  3 files changed, 17 insertions(+)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 6c636e159e..66a7d8e8ea 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -8,6 +8,7 @@
>  #include "strbuf.h"
>  #include "string-list.h"
>  #include "argv-array.h"
> +#include "midx.h"
>
>  static int delta_base_offset = 1;
>  static int pack_kept_objects = -1;
> @@ -174,6 +175,7 @@ int cmd_repack(int argc, const char **argv, const char 
> *prefix)
> int no_update_server_info = 0;
> int quiet = 0;
> int local = 0;
> +   int midx_cleared = 0;
>
> struct option builtin_repack_options[] = {
> OPT_BIT('a', NULL, _everything,
> @@ -340,6 +342,12 @@ int cmd_repack(int argc, const char **argv, const char 
> *prefix)
> continue;
> }
>
> +   if (!midx_cleared) {
> +   /* if we move a packfile, it will invalidated 
> the midx */

What about removing packs, which also happens in repack? If the
removed pack is part of midx, then midx becomes invalid as well.

> +   clear_midx_file(get_object_directory());
> +   midx_cleared = 1;
> +   }
> +
> fname_old = mkpathdup("%s/old-%s%s", packdir,
> item->string, exts[ext].name);
> if (file_exists(fname_old))
> diff --git a/midx.c b/midx.c
> index e46f392fa4..1043c01fa7 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -913,3 +913,11 @@ int write_midx_file(const char *object_dir)
> FREE_AND_NULL(pack_names);
> return 0;
>  }
> +
> +void clear_midx_file(const char *object_dir)

delete_ may be more obvious than clear_

> +{
> +   char *midx = get_midx_filename(object_dir);
> +
> +   if (remove_path(midx))
> +   die(_("failed to clear multi-pack-index at %s"), midx);

die_errno()

> +}
> diff --git a/midx.h b/midx.h
> index 6996b5ff6b..46f9f44c94 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -18,5 +18,6 @@ int midx_contains_pack(struct midxed_git *m, const char 
> *idx_name);
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
> +void clear_midx_file(const char *object_dir);
>
>  #endif
> --
> 2.18.0.rc1
>


-- 
Duy


Re: [PATCH v4 17/23] read-cache.c: remove an implicit dependency on the_index

2018-06-09 Thread Elijah Newren
On Wed, Jun 6, 2018 at 10:02 AM, Nguyễn Thái Ngọc Duy  wrote:
> diff --git a/merge-recursive.c b/merge-recursive.c
> index b404ebac7c..4f054d6dbb 100644
> --- a/merge-recursive.c
> +++ b/merge-recursive.c
> @@ -315,7 +315,7 @@ static int add_cacheinfo(struct merge_options *o,
> struct cache_entry *ce;
> int ret;
>
> -   ce = make_cache_entry(mode, oid ? oid->hash : null_sha1, path, stage, 
> 0);
> +   ce = make_index_entry(_index, mode, oid ? oid->hash : null_sha1, 
> path, stage, 0);
> if (!ce)
> return err(o, _("add_cacheinfo failed for path '%s'; merge 
> aborting."), path);

There's also a refresh_cache_entry() call about ten lines after this;
since you converted all other make_cache_entry() and
refresh_cache_entry() calls in this patch, I'm curious if that one was
left out for a reason or was just an oversight.

There are also a lot of add_cache_entry() calls in this function.  I'm
guessing we should either convert all of those too, or just change
back this particular make_index_entry to make_cache_entry() as it was;
it seems weird to have a mix of explicit the_index and implicit
the_index usages in the same function.  If we convert them all,
perhaps we should consider having merge_options store the index we're
working on?  If you want to punt this until later or leave it for me
while I make all my ongoing merge-recursive changes, that's fine.
Just thought I'd point it out.


Re: [PATCH 21/23] midx: prevent duplicate packfile loads

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee  wrote:
>
> If the multi-pack-index contains a packfile, then we do not need to add
> that packfile to the packed_git linked list or the MRU list.

Because...?

I think I see the reason, but I'd like it spelled out to avoid any
misunderstanding.

>
> Signed-off-by: Derrick Stolee 
> ---
>  midx.c | 23 +++
>  midx.h |  1 +
>  packfile.c |  7 +++
>  3 files changed, 31 insertions(+)
>
> diff --git a/midx.c b/midx.c
> index 388d79b7d9..3242646fe0 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -278,6 +278,29 @@ int fill_midx_entry(const struct object_id *oid, struct 
> pack_entry *e, struct mi
> return nth_midxed_pack_entry(m, e, pos);
>  }
>
> +int midx_contains_pack(struct midxed_git *m, const char *idx_name)
> +{
> +   uint32_t first = 0, last = m->num_packs;
> +
> +   while (first < last) {
> +   uint32_t mid = first + (last - first) / 2;
> +   const char *current;
> +   int cmp;
> +
> +   current = m->pack_names[mid];
> +   cmp = strcmp(idx_name, current);
> +   if (!cmp)
> +   return 1;
> +   if (cmp > 0) {
> +   first = mid + 1;
> +   continue;
> +   }
> +   last = mid;
> +   }
> +
> +   return 0;
> +}
> +
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir)
>  {
> struct midxed_git *m = r->objects->midxed_git;
> diff --git a/midx.h b/midx.h
> index 497bdcc77c..c1db58d8c4 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -13,6 +13,7 @@ struct object_id *nth_midxed_object_oid(struct object_id 
> *oid,
> struct midxed_git *m,
> uint32_t n);
>  int fill_midx_entry(const struct object_id *oid, struct pack_entry *e, 
> struct midxed_git *m);
> +int midx_contains_pack(struct midxed_git *m, const char *idx_name);
>  int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
> diff --git a/packfile.c b/packfile.c
> index 059b2aa097..479cb69b9f 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -746,6 +746,11 @@ static void prepare_packed_git_one(struct repository *r, 
> char *objdir, int local
> DIR *dir;
> struct dirent *de;
> struct string_list garbage = STRING_LIST_INIT_DUP;
> +   struct midxed_git *m = r->objects->midxed_git;
> +
> +   /* look for the multi-pack-index for this object directory */
> +   while (m && strcmp(m->object_dir, objdir))
> +   m = m->next;
>
> strbuf_addstr(, objdir);
> strbuf_addstr(, "/pack");
> @@ -772,6 +777,8 @@ static void prepare_packed_git_one(struct repository *r, 
> char *objdir, int local
> base_len = path.len;
> if (strip_suffix_mem(path.buf, _len, ".idx")) {
> /* Don't reopen a pack we already have. */
> +   if (m && midx_contains_pack(m, de->d_name))
> +   continue;
> for (p = r->objects->packed_git; p;
>  p = p->next) {
> size_t len;
> --
> 2.18.0.rc1
>


-- 
Duy


Re: [PATCH 20/23] midx: use midx in approximate_object_count

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee  wrote:
>
> Signed-off-by: Derrick Stolee 
> ---
>  packfile.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/packfile.c b/packfile.c
> index 638e113972..059b2aa097 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -819,11 +819,14 @@ unsigned long approximate_object_count(void)
>  {
> if (!the_repository->objects->approximate_object_count_valid) {
> unsigned long count;
> +   struct midxed_git *m;
> struct packed_git *p;
>
> prepare_packed_git(the_repository);
> count = 0;
> -   for (p = the_repository->objects->packed_git; p; p = p->next) 
> {
> +   for (m = get_midxed_git(the_repository); m; m = m->next)
> +   count += m->num_objects;
> +   for (p = get_packed_git(the_repository); p; p = p->next) {

Please don't change this line, it's not related to this patch. Same
concern applies, if we have already counted objects in midx we should
ignore packs that belong to it or we double count.

> if (open_pack_index(p))
> continue;
> count += p->num_objects;
> --
> 2.18.0.rc1
>


-- 
Duy


Re: [PATCH 18/23] midx: use midx in abbreviation calculations

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee  wrote:
> @@ -565,8 +632,11 @@ static void find_abbrev_len_for_pack(struct packed_git 
> *p,
>
>  static void find_abbrev_len_packed(struct min_abbrev_data *mad)
>  {
> +   struct midxed_git *m;
> struct packed_git *p;
>
> +   for (m = get_midxed_git(the_repository); m; m = m->next)
> +   find_abbrev_len_for_midx(m, mad);

If all the packs are in midx, we don't need to run the second loop
below, do we? Otherwise I don't see why we waste cycles on finding
abbrev length on midx at all.

> for (p = get_packed_git(the_repository); p; p = p->next)
> find_abbrev_len_for_pack(p, mad);
>  }
-- 
Duy


Re: [PATCH v4 15/23] attr: remove index from git_attr_set_direction()

2018-06-09 Thread Elijah Newren
On Wed, Jun 6, 2018 at 10:02 AM, Nguyễn Thái Ngọc Duy  wrote:
> Since attr checking API now take the index, there's no need to set an
> index in advance with this call. Most call sites are straightforward
> because they either pass the_index or NULL (which defaults back to
> the_index previously). There's only one suspicious call site in
> unpack-trees.c where it sets a different index.
>
> This code in unpack-trees is about to checking out entries from the

minor nit: s/checking/check/ ?

...

> diff --git a/attr.c b/attr.c
> index 863fad3bd1..98e4953f6e 100644
> --- a/attr.c
> +++ b/attr.c
> @@ -708,10 +708,8 @@ static struct attr_stack *read_attr_from_array(const 
> char **list)
>   * another thread could potentially be calling into the attribute system.
>   */
>  static enum git_attr_direction direction;
> -static const struct index_state *use_index;
>
> -void git_attr_set_direction(enum git_attr_direction new_direction,
> -   const struct index_state *istate)
> +void git_attr_set_direction(enum git_attr_direction new_direction)
>  {
> if (is_bare_repository() && new_direction != GIT_ATTR_INDEX)
> BUG("non-INDEX attr direction in a bare repo");
> @@ -720,7 +718,6 @@ void git_attr_set_direction(enum git_attr_direction 
> new_direction,
> drop_all_attr_stacks();
>
> direction = new_direction;
> -   use_index = istate;
>  }
>
>  static struct attr_stack *read_attr_from_file(const char *path, int macro_ok)
> @@ -750,17 +747,11 @@ static struct attr_stack *read_attr_from_index(const 
> struct index_state *istate,
> struct attr_stack *res;
> char *buf, *sp;
> int lineno = 0;
> -   const struct index_state *to_read_from;
>
> -   /*
> -* Temporary workaround for c24f3abace (apply: file commited
> -* with CRLF should roundtrip diff and apply - 2017-08-19)
> -*/
> -   to_read_from = use_index ? use_index : istate;
> -   if (!to_read_from)
> +   if (!istate)
> return NULL;
>
> -   buf = read_blob_data_from_index(to_read_from, path, NULL);
> +   buf = read_blob_data_from_index(istate, path, NULL);

This code is much clearer and easier to reason about than stashing off
use_index.  I had to dig for a bit through history to try to
understand the old code and why it was written that way, but the new
code just makes sense.


Re: [PATCH 17/23] midx: read objects from multi-pack-index

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 6:55 PM Derrick Stolee  wrote:
>
> Signed-off-by: Derrick Stolee 
> ---
>  midx.c | 96 --
>  midx.h |  2 ++
>  object-store.h |  1 +
>  packfile.c |  8 -
>  4 files changed, 104 insertions(+), 3 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 5e9290ca8f..6eca8f1b12 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -3,6 +3,7 @@
>  #include "dir.h"
>  #include "csum-file.h"
>  #include "lockfile.h"
> +#include "sha1-lookup.h"
>  #include "object-store.h"
>  #include "packfile.h"
>  #include "midx.h"
> @@ -64,7 +65,7 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>
> m = xcalloc(1, sizeof(*m) + strlen(object_dir) + 1);
> strcpy(m->object_dir, object_dir);
> -   m->data = midx_map;
> +   m->data = (const unsigned char*)midx_map;

Hmm? Why is this typecast only needed now? Or is it not really needed at all?

>
> m->signature = get_be32(m->data);
> if (m->signature != MIDX_SIGNATURE) {
> @@ -145,7 +146,9 @@ struct midxed_git *load_midxed_git(const char *object_dir)
>
> m->num_objects = ntohl(m->chunk_oid_fanout[255]);
>
> -   m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
> +   m->packs = xcalloc(m->num_packs, sizeof(*m->packs));
> +
> +   ALLOC_ARRAY(m->pack_names, m->num_packs);

Please make this ALLOC_ARRAY change in the patch that adds
xcalloc(m->num_packs).

> for (i = 0; i < m->num_packs; i++) {
> if (i) {
> if (ntohl(m->chunk_pack_lookup[i]) <= 
> ntohl(m->chunk_pack_lookup[i - 1])) {
> @@ -175,6 +178,95 @@ struct midxed_git *load_midxed_git(const char 
> *object_dir)
> exit(1);
>  }
>
> +static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
> +{
> +   struct strbuf pack_name = STRBUF_INIT;
> +
> +   if (pack_int_id >= m->num_packs)
> +   BUG("bad pack-int-id");
> +
> +   if (m->packs[pack_int_id])
> +   return 0;
> +
> +   strbuf_addstr(_name, m->object_dir);
> +   strbuf_addstr(_name, "/pack/");
> +   strbuf_addstr(_name, m->pack_names[pack_int_id]);

Just use strbuf_addf()

> +
> +   m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 
> 1);
> +   strbuf_release(_name);
> +   return !m->packs[pack_int_id];

This is a weird return value convention. Normally we go zero/negative
or non-zero/zero for success/failure.

> +}
-- 
Duy


Re: [PATCH 16/23] midx: prepare midxed_git struct

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 7:02 PM Derrick Stolee  wrote:
>
> Signed-off-by: Derrick Stolee 
> ---
>  midx.c | 22 ++
>  midx.h |  2 ++
>  object-store.h |  7 +++
>  packfile.c |  6 +-
>  4 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/midx.c b/midx.c
> index a49300bf75..5e9290ca8f 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -175,6 +175,28 @@ struct midxed_git *load_midxed_git(const char 
> *object_dir)
> exit(1);
>  }
>
> +int prepare_midxed_git_one(struct repository *r, const char *object_dir)
> +{
> +   struct midxed_git *m = r->objects->midxed_git;
> +   struct midxed_git *m_search;
> +
> +   if (!core_midx)
> +   return 0;
> +
> +   for (m_search = m; m_search; m_search = m_search->next)
> +   if (!strcmp(object_dir, m_search->object_dir))
> +   return 1;
> +
> +   r->objects->midxed_git = load_midxed_git(object_dir);
> +
> +   if (r->objects->midxed_git) {
> +   r->objects->midxed_git->next = m;
> +   return 1;
> +   }
> +
> +   return 0;
> +}
> +
>  static size_t write_midx_header(struct hashfile *f,
> unsigned char num_chunks,
> uint32_t num_packs)
> diff --git a/midx.h b/midx.h
> index a1d18ed991..793203fc4a 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -5,8 +5,10 @@
>  #include "cache.h"
>  #include "object-store.h"
>  #include "packfile.h"
> +#include "repository.h"
>
>  struct midxed_git *load_midxed_git(const char *object_dir);
> +int prepare_midxed_git_one(struct repository *r, const char *object_dir);
>
>  int write_midx_file(const char *object_dir);
>
> diff --git a/object-store.h b/object-store.h
> index 9b671f1b0a..7908d46e34 100644
> --- a/object-store.h
> +++ b/object-store.h
> @@ -130,6 +130,13 @@ struct raw_object_store {
>  */
> struct oidmap *replace_map;
>
> +   /*
> +* private data
> +*
> +* should only be accessed directly by packfile.c and midx.c
> +*/
> +   struct midxed_git *midxed_git;
> +
> /*
>  * private data
>  *
> diff --git a/packfile.c b/packfile.c
> index 1a714fbde9..b91ca9b9f5 100644
> --- a/packfile.c
> +++ b/packfile.c
> @@ -15,6 +15,7 @@
>  #include "tree-walk.h"
>  #include "tree.h"
>  #include "object-store.h"
> +#include "midx.h"
>
>  char *odb_pack_name(struct strbuf *buf,
> const unsigned char *sha1,
> @@ -893,10 +894,13 @@ static void prepare_packed_git(struct repository *r)
>
> if (r->objects->packed_git_initialized)
> return;
> +   prepare_midxed_git_one(r, r->objects->objectdir);
> prepare_packed_git_one(r, r->objects->objectdir, 1);
> prepare_alt_odb(r);
> -   for (alt = r->objects->alt_odb_list; alt; alt = alt->next)
> +   for (alt = r->objects->alt_odb_list; alt; alt = alt->next) {
> +   prepare_midxed_git_one(r, alt->path);
> prepare_packed_git_one(r, alt->path, 0);
> +   }

Ah, so the object path and the linked list in midxed_git is for
alternates. Makes sense. Would have saved me the trouble if you only
introduced those fields now, when they are actually used (and become
self explanatory)

> rearrange_packed_git(r);
> prepare_packed_git_mru(r);
> r->objects->packed_git_initialized = 1;
> --
> 2.18.0.rc1
>
-- 
Duy


Re: [PATCH 14/23] midx: write object offsets

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 7:02 PM Derrick Stolee  wrote:
> +static size_t write_midx_large_offsets(struct hashfile *f, uint32_t 
> nr_large_offset,
> +  struct pack_midx_entry *objects, 
> uint32_t nr_objects)
> +{
> +   struct pack_midx_entry *list = objects;
> +   size_t written = 0;
> +
> +   while (nr_large_offset) {
> +   struct pack_midx_entry *obj = list++;
> +   uint64_t offset = obj->offset;
> +
> +   if (!(offset >> 31))
> +   continue;
> +
> +   hashwrite_be32(f, offset >> 32);
> +   hashwrite_be32(f, offset & 0x);

Not sure if you need UL suffix or something here on 32-bit platform.

> +   written += 2 * sizeof(uint32_t);
> +
> +   nr_large_offset--;
> +   }
> +
> +   return written;
> +}
> +
-- 
Duy


Re: [PATCH 13/23] midx: write object id fanout chunk

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee  wrote:
> @@ -117,9 +123,13 @@ struct midxed_git *load_midxed_git(const char 
> *object_dir)
> die("MIDX missing required pack lookup chunk");
> if (!m->chunk_pack_names)
> die("MIDX missing required pack-name chunk");
> +   if (!m->chunk_oid_fanout)
> +   die("MIDX missing required OID fanout chunk");

_()

> @@ -501,9 +540,13 @@ int write_midx_file(const char *object_dir)
> chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_packs * 
> sizeof(uint32_t);
>
> cur_chunk++;
> -   chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;
> +   chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDFANOUT;

Err.. mistake?

> chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + 
> pack_name_concat_len;
>
> +   cur_chunk++;
> +   chunk_ids[cur_chunk] = MIDX_CHUNKID_OIDLOOKUP;

Same here.

> +   chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + 
> MIDX_CHUNK_FANOUT_SIZE;
> +
> cur_chunk++;
> chunk_ids[cur_chunk] = 0;
> chunk_offsets[cur_chunk] = chunk_offsets[cur_chunk - 1] + nr_entries 
> * MIDX_HASH_LEN;
>
-- 
Duy


Re: [PATCH 12/23] midx: write object ids in a chunk

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 4:07 PM Derrick Stolee  wrote:
>
> Signed-off-by: Derrick Stolee 
> ---
>  Documentation/technical/pack-format.txt |  4 ++
>  builtin/midx.c  |  2 +
>  midx.c  | 50 +++--
>  object-store.h  |  1 +
>  t/t5319-midx.sh |  4 +-
>  5 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/pack-format.txt 
> b/Documentation/technical/pack-format.txt
> index 29bf87283a..de9ac778b6 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -307,6 +307,10 @@ CHUNK DATA:
> name. This is the only chunk not guaranteed to be a multiple of 
> four
> bytes in length, so should be the last chunk for alignment 
> reasons.
>
> +   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)

So N is the number of objects and H is hash size? Please don't let me guess.

> +   The OIDs for all objects in the MIDX are stored in lexicographic
> +   order in this chunk.

The reason we keep all hashes together, packed right, is to reduce
cache footprint. Another observation is it takes us usually just 12
bytes or less to uniquely identify an object, which means we could
pack even tighter if we split he object hash into two chunks, do
bsearch in the first chunk with just  bytes then verify that the
remaining 20- bytes is matched in the second chunk. This may matter
more when we move to larger hashes. The split would of course be
configurable since different project may have different optimal value,
but default value could be 10/10 bytes.

> +
> (This section intentionally left incomplete.)
>
>  TRAILER:
> diff --git a/builtin/midx.c b/builtin/midx.c
> index 3a261e9bbf..86edd30174 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -35,6 +35,8 @@ static int read_midx_file(const char *object_dir)
> printf(" pack_lookup");
> if (m->chunk_pack_names)
> printf(" pack_names");
> +   if (m->chunk_oid_lookup)
> +   printf(" oid_lookup");
>
> printf("\n");
>
> diff --git a/midx.c b/midx.c
> index b20d52713c..d06bc6876a 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -14,10 +14,11 @@
>  #define MIDX_HASH_LEN 20
>  #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
>
> -#define MIDX_MAX_CHUNKS 2
> +#define MIDX_MAX_CHUNKS 3
>  #define MIDX_CHUNK_ALIGNMENT 4
>  #define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
>  #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
> +#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>  #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
>
>  static char *get_midx_filename(const char *object_dir)
> @@ -95,6 +96,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
> m->chunk_pack_names = m->data + chunk_offset;
> break;
>
> +   case MIDX_CHUNKID_OIDLOOKUP:
> +   m->chunk_oid_lookup = m->data + chunk_offset;
> +   break;


I just now realized, how do you protect from duplicate chunks? From
this patch, it looks like you could accept two oidlookup chunks just
fine then siliently ignore the first one.

> case 0:
> die("terminating MIDX chunk id appears 
> earlier than expected");
> break;
> @@ -112,6 +117,8 @@ struct midxed_git *load_midxed_git(const char *object_dir)
> die("MIDX missing required pack lookup chunk");
> if (!m->chunk_pack_names)
> die("MIDX missing required pack-name chunk");
> +   if (!m->chunk_oid_lookup)
> +   die("MIDX missing required OID lookup chunk");

_()

>
> m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
> for (i = 0; i < m->num_packs; i++) {
> @@ -370,6 +377,32 @@ static size_t write_midx_pack_names(struct hashfile *f,
> return written;
>  }
>
> +static size_t write_midx_oid_lookup(struct hashfile *f, unsigned char 
> hash_len,
> +   struct pack_midx_entry *objects,
> +   uint32_t nr_objects)
> +{
> +   struct pack_midx_entry *list = objects;
> +   uint32_t i;
> +   size_t written = 0;
> +
> +   for (i = 0; i < nr_objects; i++) {
> +   struct pack_midx_entry *obj = list++;
> +
> +   if (i < nr_objects - 1) {
> +   struct pack_midx_entry *next = list;
> +   if (oidcmp(>oid, >oid) >= 0)
> +   BUG("OIDs not in order: %s >= %s",
> +   oid_to_hex(>oid),
> +   oid_to_hex(>oid));

Indentation. I almost thought oid_to_hex() was a separate statement.

> +   }
> +
> +  

Re: [PATCH 11/23] midx: sort and deduplicate objects from packfiles

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 4:06 PM Derrick Stolee  wrote:
>
> Before writing a list of objects and their offsets to a multi-pack-index
> (MIDX), we need to collect the list of objects contained in the
> packfiles. There may be multiple copies of some objects, so this list
> must be deduplicated.

Can you just do merge-sort with a slight modification to ignore duplicates?

>
> It is possible to artificially get into a state where there are many
> duplicate copies of objects. That can create high memory pressure if we
> are to create a list of all objects before de-duplication. To reduce
> this memory pressure without a significant performance drop,
> automatically group objects by the first byte of their object id. Use
> the IDX fanout tables to group the data, copy to a local array, then
> sort.
>
> Copy only the de-duplicated entries. Select the duplicate based on the
> most-recent modified time of a packfile containing the object.
>
> Signed-off-by: Derrick Stolee 
> ---
>  midx.c | 138 +
>  1 file changed, 138 insertions(+)
>
> diff --git a/midx.c b/midx.c
> index 923acda72e..b20d52713c 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -4,6 +4,7 @@
>  #include "csum-file.h"
>  #include "lockfile.h"
>  #include "object-store.h"
> +#include "packfile.h"
>  #include "midx.h"
>
>  #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
> @@ -190,6 +191,140 @@ static void sort_packs_by_name(char **pack_names, 
> uint32_t nr_packs, uint32_t *p
> }
>  }
>
> +static uint32_t get_pack_fanout(struct packed_git *p, uint32_t value)
> +{
> +   const uint32_t *level1_ofs = p->index_data;
> +
> +   if (!level1_ofs) {
> +   if (open_pack_index(p))
> +   return 0;
> +   level1_ofs = p->index_data;
> +   }
> +
> +   if (p->index_version > 1) {
> +   level1_ofs += 2;
> +   }
> +
> +   return ntohl(level1_ofs[value]);
> +}

Maybe keep this in packfile,c, refactor fanout code in there if
necessary, keep .idx file format info in that file instead of
spreading out more.

> +
> +struct pack_midx_entry {
> +   struct object_id oid;
> +   uint32_t pack_int_id;
> +   time_t pack_mtime;
> +   uint64_t offset;
> +};
> +
> +static int midx_oid_compare(const void *_a, const void *_b)
> +{
> +   struct pack_midx_entry *a = (struct pack_midx_entry *)_a;
> +   struct pack_midx_entry *b = (struct pack_midx_entry *)_b;

Try not to lose "const" while typecasting.

> +   int cmp = oidcmp(>oid, >oid);
> +
> +   if (cmp)
> +   return cmp;
> +
> +   if (a->pack_mtime > b->pack_mtime)
> +   return -1;
> +   else if (a->pack_mtime < b->pack_mtime)
> +   return 1;
> +
> +   return a->pack_int_id - b->pack_int_id;
> +}
> +
> +static void fill_pack_entry(uint32_t pack_int_id,
> +   struct packed_git *p,
> +   uint32_t cur_object,
> +   struct pack_midx_entry *entry)
> +{
> +   if (!nth_packed_object_oid(>oid, p, cur_object))
> +   die("failed to located object %d in packfile", cur_object);

_()

> +
> +   entry->pack_int_id = pack_int_id;
> +   entry->pack_mtime = p->mtime;
> +
> +   entry->offset = nth_packed_object_offset(p, cur_object);
> +}
> +
> +/*
> + * It is possible to artificially get into a state where there are many
> + * duplicate copies of objects. That can create high memory pressure if
> + * we are to create a list of all objects before de-duplication. To reduce
> + * this memory pressure without a significant performance drop, automatically
> + * group objects by the first byte of their object id. Use the IDX fanout
> + * tables to group the data, copy to a local array, then sort.
> + *
> + * Copy only the de-duplicated entries (selected by most-recent modified time
> + * of a packfile containing the object).
> + */
> +static struct pack_midx_entry *get_sorted_entries(struct packed_git **p,
> + uint32_t *perm,
> + uint32_t nr_packs,
> + uint32_t *nr_objects)
> +{
> +   uint32_t cur_fanout, cur_pack, cur_object;
> +   uint32_t nr_fanout, alloc_fanout, alloc_objects, total_objects = 0;
> +   struct pack_midx_entry *entries_by_fanout = NULL;
> +   struct pack_midx_entry *deduplicated_entries = NULL;
> +
> +   for (cur_pack = 0; cur_pack < nr_packs; cur_pack++) {
> +   if (open_pack_index(p[cur_pack]))
> +   continue;

Is it a big problem if you fail to open .idx for a certain pack?
Should we error out and abort instead of continuing on? Later on in
the second pack loop code when get_fanout return zero (failure), you
don't seem to catch it and skip the pack.

> +
> +   total_objects += p[cur_pack]->num_objects;
> +   }
> +
> +

Re: [PATCH 10/23] midx: write a lookup into the pack names chunk

2018-06-09 Thread Duy Nguyen
On Thu, Jun 7, 2018 at 7:01 PM Derrick Stolee  wrote:
>
> Signed-off-by: Derrick Stolee 
> ---
>  Documentation/technical/pack-format.txt |  5 +++
>  builtin/midx.c  |  7 
>  midx.c  | 56 +++--
>  object-store.h  |  2 +
>  t/t5319-midx.sh | 11 +++--
>  5 files changed, 75 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/pack-format.txt 
> b/Documentation/technical/pack-format.txt
> index 2b37be7b33..29bf87283a 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -296,6 +296,11 @@ CHUNK LOOKUP:
>
>  CHUNK DATA:
>
> +   Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
> +   P * 4 bytes storing the offset in the packfile name chunk for
> +   the null-terminated string containing the filename for the
> +   ith packfile.
> +

Commit message is too light on this one. Why does this need to be
stored? Isn't the cost of rebuilding this in-core cheap?

Adding this chunk on disk in my opinion only adds more burden. Now you
have to verify that these offsets actually point to the right place.

> Packfile Names (ID: {'P', 'N', 'A', 'M'})
> Stores the packfile names as concatenated, null-terminated 
> strings.
> Packfiles must be listed in lexicographic order for fast lookups 
> by
> diff --git a/builtin/midx.c b/builtin/midx.c
> index fe56560853..3a261e9bbf 100644
> --- a/builtin/midx.c
> +++ b/builtin/midx.c
> @@ -16,6 +16,7 @@ static struct opts_midx {
>
>  static int read_midx_file(const char *object_dir)
>  {
> +   uint32_t i;
> struct midxed_git *m = load_midxed_git(object_dir);
>
> if (!m)
> @@ -30,11 +31,17 @@ static int read_midx_file(const char *object_dir)
>
> printf("chunks:");
>
> +   if (m->chunk_pack_lookup)
> +   printf(" pack_lookup");
> if (m->chunk_pack_names)
> printf(" pack_names");
>
> printf("\n");
>
> +   printf("packs:\n");
> +   for (i = 0; i < m->num_packs; i++)
> +   printf("%s\n", m->pack_names[i]);
> +
> printf("object_dir: %s\n", m->object_dir);
>
> return 0;
> diff --git a/midx.c b/midx.c
> index d4f4a01a51..923acda72e 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -13,8 +13,9 @@
>  #define MIDX_HASH_LEN 20
>  #define MIDX_MIN_SIZE (MIDX_HEADER_SIZE + MIDX_HASH_LEN)
>
> -#define MIDX_MAX_CHUNKS 1
> +#define MIDX_MAX_CHUNKS 2
>  #define MIDX_CHUNK_ALIGNMENT 4
> +#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
>  #define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
>  #define MIDX_CHUNKLOOKUP_WIDTH (sizeof(uint32_t) + sizeof(uint64_t))
>
> @@ -85,6 +86,10 @@ struct midxed_git *load_midxed_git(const char *object_dir)
> uint64_t chunk_offset = get_be64(m->data + 16 + 
> MIDX_CHUNKLOOKUP_WIDTH * i);
>
> switch (chunk_id) {
> +   case MIDX_CHUNKID_PACKLOOKUP:
> +   m->chunk_pack_lookup = (uint32_t *)(m->data + 
> chunk_offset);
> +   break;
> +
> case MIDX_CHUNKID_PACKNAMES:
> m->chunk_pack_names = m->data + chunk_offset;
> break;
> @@ -102,9 +107,32 @@ struct midxed_git *load_midxed_git(const char 
> *object_dir)
> }
> }
>
> +   if (!m->chunk_pack_lookup)
> +   die("MIDX missing required pack lookup chunk");
> if (!m->chunk_pack_names)
> die("MIDX missing required pack-name chunk");
>
> +   m->pack_names = xcalloc(m->num_packs, sizeof(const char *));
> +   for (i = 0; i < m->num_packs; i++) {
> +   if (i) {
> +   if (ntohl(m->chunk_pack_lookup[i]) <= 
> ntohl(m->chunk_pack_lookup[i - 1])) {
> +   error("MIDX pack lookup value %d before %d",
> + ntohl(m->chunk_pack_lookup[i - 1]),
> + ntohl(m->chunk_pack_lookup[i]));
> +   goto cleanup_fail;
> +   }
> +   }
> +
> +   m->pack_names[i] = (const char *)(m->chunk_pack_names + 
> ntohl(m->chunk_pack_lookup[i]));
> +
> +   if (i && strcmp(m->pack_names[i], m->pack_names[i - 1]) <= 0) 
> {
> +   error("MIDX pack names out of order: '%s' before 
> '%s'",
> + m->pack_names[i - 1],
> + m->pack_names[i]);
> +   goto cleanup_fail;
> +   }
> +   }
> +
> return m;
>
>  cleanup_fail:
> @@ -162,6 +190,20 @@ static void sort_packs_by_name(char **pack_names, 
> uint32_t nr_packs, uint32_t *p
> }
>  }
>
> +static size_t write_midx_pack_lookup(struct hashfile *f,
> +   

Re: [RFC PATCH 3/6] commit-graph: enable replace-object and grafts

2018-06-09 Thread Jakub Narebski
Derrick Stolee  writes:

First, this commit seems to try to do two things at once, and in its
current form it should be split into adding destroy_commit_graph() and
into grafts / replacements check.  Those are separate and unconnected
features, and should be in separate patches, in my opinion.

> Create destroy_commit_graph() method to delete the commit-graph file
> when history is altered by a replace-object call. If the commit-graph
> is rebuilt after that, we will load the correct object while reading
> the commit-graph.

I'm not sure if this is the best solution.  It would work both for
invalidation, and for turning off the feature, but in my opinion there
are better solutions for both:
 - in case of invalidation, wouldn't it be better to re-create the file?
 - in case of turning off, wouldn't it be better to simply set
   core_commit_graph to false?

Nevertheless I think destroy_commit_graph(), however it would be named
at the end, would be a good to have.

> When parsing a commit, first check if the commit was grafted. If so,
> then ignore the commit-graph for that commit and insted use the parents
> loaded by parsing the commit buffer and comparing against the graft
> file.

Unless uyou turn off using generation numbers and other such indices,
this is not enough to prevent from generating incorrect results if
current commit-graph file doesn't agree with current altered view of the
project history.

Take a simple example of joining two histories using replacements
(git-replace) or grafts.  Numbers denote generation numbers:

Original history:

A<---B<---C <===  master
123


a<---b<---c<---d<---e   <===  historical
12345

Altered view of history:

a<---b<---c<---d<---e<---[A]<---B<---C   <===  master

In the new view of history, 'e' is reachable from 'C'.  But going by
pre-altering generation numbers, gen_{pre}(e) = 5  >  gen_{pre}(C) = 3.
This means that we don't even arrive at '[A]', where replacement object
or graft changed parents compared to what's loaded from commit-graph
file, isn't it?

>
> Signed-off-by: Derrick Stolee 
> ---
>  builtin/replace.c |  3 +++
>  commit-graph.c| 20 +++-
>  commit-graph.h|  9 +
>  commit.c  |  5 +
>  4 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/replace.c b/builtin/replace.c
> index 9f01f3fc7f..d553aadcdc 100644
> --- a/builtin/replace.c
> +++ b/builtin/replace.c
> @@ -15,6 +15,7 @@
>  #include "parse-options.h"
>  #include "run-command.h"
>  #include "tag.h"
> +#include "commit-graph.h"
>  
>  static const char * const git_replace_usage[] = {
>   N_("git replace [-f]  "),
> @@ -468,6 +469,8 @@ int cmd_replace(int argc, const char **argv, const char 
> *prefix)
>   usage_msg_opt("--raw only makes sense with --edit",
> git_replace_usage, options);
>  
> + destroy_commit_graph(get_object_directory());
> +
>   switch (cmdmode) {
>   case MODE_DELETE:
>   if (argc < 1)
> diff --git a/commit-graph.c b/commit-graph.c
> index e9195dfb17..95af4ed519 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -6,6 +6,7 @@
>  #include "pack.h"
>  #include "packfile.h"
>  #include "commit.h"
> +#include "dir.h"
>  #include "object.h"
>  #include "refs.h"
>  #include "revision.h"
> @@ -240,15 +241,22 @@ static struct commit_list **insert_parent_or_die(struct 
> commit_graph *g,
>  {
>   struct commit *c;
>   struct object_id oid;
> + const unsigned char *real;
>  
>   if (pos >= g->num_commits)
>   die("invalid parent position %"PRIu64, pos);
>  
>   hashcpy(oid.hash, g->chunk_oid_lookup + g->hash_len * pos);
> +
> + real = lookup_replace_object(oid.hash);

So this function returns either original SHA-1 / OID of object, or if
object should be replaced then it returns the replacement object's name
(replaced recursively, if necessary).

This deals with replacements, but not with grafts (from graft file)!

> +
>   c = lookup_commit();
>   if (!c)
>   die("could not find commit %s", oid_to_hex());
> - c->graph_pos = pos;
> +
> + if (!hashcmp(real, oid.hash))
> + c->graph_pos = pos;

I don't quite understand this reasoning here.  The serialized
commit-graph file can either follow the current state of altered
history, or not.  In the later case the commit-graph can either follow
original history, or altered but not current view of history.

What does it matter of the object was replaced or not when marking
object as coming from graph at position 'pos'?  If commit is replaced,
it still comes from commit-graph at position 'pos', isn't it?

Also, what if the starting commit was replaced?  Then this logic
wouldn't file, is it?


I don't see how it relates to what's written in the commit message:

DS> When parsing a commit, first check if the commit was grafted. If so,
DS> then ignore the commit-graph for that commit and 

Re: [PATCH 20/20] abbrev: add a core.validateAbbrev setting

2018-06-09 Thread Martin Ågren
On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason  wrote:

> Instead of trying really hard to find an unambiguous SHA-1 we can with
> core.validateAbbrev=false, and preferably combined with the new
> support for relative core.abbrev values (such as +4) unconditionally
> print a short SHA-1 without doing any disambiguation check. I.e. it

This first paragraph read weirdly the first time. On the second attempt
I knew how to structure it and got it right. It might be easier to read
if the part about core.appreb=+4 were in a separate second sentence.

That last "it" is "the combination of these two configs", right?

> allows for picking a trade-off between performance, and the odds that
> future or remote (or current and local) short SHA-1 will be ambiguous.

> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index abf07be7b6..df31d1351f 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -925,6 +925,49 @@ means to add or subtract N characters from the SHA-1 
> that Git would
>  otherwise print, this allows for producing more future-proof SHA-1s
>  for use within a given project, while adjusting the value for the
>  current approximate number of objects.
> ++
> +This is especially useful in combination with the
> +`core.validateAbbrev` setting, or to get more future-proof hashes to
> +reference in the future in a repository whose number of objects is
> +expected to grow.

Maybe s/validateAbbrev/validateAbbrev = false/?

> +
> +core.validateAbbrev::
> +   If set to false (true by default) don't do any validation when
> +   printing abbreviated object names to see if they're really
> +   unique. This makes printing objects more performant at the
> +   cost of potentially printing object names that aren't unique
> +   within the current repository.

Good. I understand why I'd want to use it, and why not.

> ++
> +When printing abbreviated object names Git needs to look through the
> +local object store. This is an `O(log N)` operation assuming all the
> +objects are in a single pack file, but `X * O(log N)` given `X` pack
> +files, which can get expensive on some larger repositories.

This might be very close to too much information.

> ++
> +This setting changes that to `O(1)`, but with the trade-off that
> +depending on the value of `core.abbrev` we may be printing abbreviated
> +hashes that collide. Too see how likely this is, try running:
> ++
> +---
> +git log --all --pretty=format:%h --abbrev=4 | perl -nE 'chomp; say length' | 
> sort | uniq -c | sort -nr
> +---
> ++
> +This shows how many commits were found at each abbreviation length. On
> +linux.git in June 2018 this shows a bit more than 750,000 commits,
> +with just 4 needing 11 characters to be fully abbreviated, and the
> +default heuristic picks a length of 12.

These last few paragraphs seem like too much to me.

> ++
> +Even without `core.validateAbbrev=false` the results abbreviation
> +already a bit of a probability game. They're guaranteed at the moment
> +of generation, but as more objects are added, ambiguities may be
> +introduced. Likewise, what's unambiguous for you may not be for
> +somebody else you're communicating with, if they have their own clone.

This seems more useful.

> ++
> +Therefore the default of `core.validateAbbrev=true` may not save you
> +in practice if you're sharing the SHA-1 or noting it now to use after
> +a `git fetch`. You may be better off setting `core.abbrev` to
> +e.g. `+2` to add 2 extra characters to the SHA-1, and possibly combine
> +that with `core.validateAbbrev=false` to get a reasonable trade-off
> +between safety and performance.

Makes sense. As before, I'd suggest s/SHA-1/object ID/.

I do wonder if this documentation could be a bit less verbose without
sacrificing too much correctness and clarity. No brilliant suggestions
on how to do that, sorry.

Martin


Re: [PATCH 19/20] abbrev: support relative abbrev values

2018-06-09 Thread Martin Ågren
On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason  wrote:
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index ab641bf5a9..abf07be7b6 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -919,6 +919,12 @@ core.abbrev::
> in your repository, which hopefully is enough for
> abbreviated object names to stay unique for some time.
> The minimum length is 4.
> ++
> +This can also be set to relative values such as `+2` or `-2`, which
> +means to add or subtract N characters from the SHA-1 that Git would
> +otherwise print, this allows for producing more future-proof SHA-1s
> +for use within a given project, while adjusting the value for the
> +current approximate number of objects.

How about s/, this/. This/ to break it up a little? Also, you write "+2"
and "-2" but then "N". Unify it?

Also, I'd suggest s/SHA-1/object ID/ to be future-proof.

> diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
> index f466600972..f1114a7b8d 100644
> --- a/Documentation/diff-options.txt
> +++ b/Documentation/diff-options.txt
> @@ -384,6 +384,9 @@ endif::git-format-patch[]
> independent of the `--full-index` option above, which controls
> the diff-patch output format.  Non default number of
> digits can be specified with `--abbrev=`.
> ++
> +Can also be set to a relative value, see `core.abbrev` in
> +linkgit:git-diff[1].

Good. You then add this paragraph to lots of other places...

> diff --git a/config.c b/config.c
> index 12f762ad92..cd95c6bdfb 100644
> --- a/config.c
> +++ b/config.c
> @@ -1151,6 +1151,17 @@ static int git_default_core_config(const char *var, 
> const char *value)
> return config_error_nonbool(var);
> if (!strcasecmp(value, "auto")) {
> default_abbrev = -1;
> +   } else if (*value == '+' || *value == '-') {
> +   int relative = git_config_int(var, value);
> +   if (relative == 0)
> +   die(_("bad core.abbrev value %s. "

Trailing period? Same below.

> + "relative values must be non-zero"),
> +   value);
> +   if (abs(relative) > GIT_SHA1_HEXSZ)
> +   die(_("bad core.abbrev value %s. "
> + "impossibly out of range"),
> +   value);
> +   default_abbrev_relative = relative;
> } else {
> int abbrev = git_config_int(var, value);
> if (abbrev < minimum_abbrev || abbrev > 40)
> diff --git a/diff.c b/diff.c
> index e0141cfbc0..f7861b8472 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -4801,16 +4801,28 @@ int diff_opt_parse(struct diff_options *options,
> else if (!strcmp(arg, "--abbrev"))
> options->abbrev = DEFAULT_ABBREV;
> else if (skip_prefix(arg, "--abbrev=", )) {
> +   int v;
> char *end;
> if (!strcmp(arg, ""))
> die("--abbrev expects a value, got '%s'", arg);
> -   options->abbrev = strtoul(arg, , 10);
> +   v = strtoul(arg, , 10);
> if (*end)
> die("--abbrev expects a numerical value, got '%s'", 
> arg);
> -   if (options->abbrev < MINIMUM_ABBREV) {
> +   if (*arg == '+' || *arg == '-') {
> +   if (v == 0) {
> +   die("relative abbrev must be non-zero");
> +   } else if (abs(v) > the_hash_algo->hexsz) {
> +   die("relative abbrev impossibly out of 
> range");
> +   } else {
> +   default_abbrev_relative = v;
> +   options->abbrev = DEFAULT_ABBREV;
> +   }
> +   } else if (v < MINIMUM_ABBREV) {
> options->abbrev = MINIMUM_ABBREV;
> -   } else if (the_hash_algo->hexsz < options->abbrev) {
> +   } else if (the_hash_algo->hexsz < v) {
> options->abbrev = the_hash_algo->hexsz;
> +   } else {
> +   options->abbrev = v;

I've cut out a few instances of more-or-less repeated code. Any
possibility of extracting this into a helper? Maybe after you've done
the preparatory work of unifying these sites. Or as part of it, i.e.,
"let's switch this spot to use the helper; that makes it stricter in
this-and-that sense".

These can't all be entirely unified, I guess, but maybe "mostly"?

Martin


[PATCH] gitworkflows: fix grammar in 'Merge upwards' rule

2018-06-09 Thread Kyle Meyer
Signed-off-by: Kyle Meyer 
---
 Documentation/gitworkflows.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/gitworkflows.txt b/Documentation/gitworkflows.txt
index 926e044d0..ca11c7bda 100644
--- a/Documentation/gitworkflows.txt
+++ b/Documentation/gitworkflows.txt
@@ -107,7 +107,7 @@ the unstable branch into the stable one.  Hence the 
following:
 .Merge upwards
 [caption="Rule: "]
 =
-Always commit your fixes to the oldest supported branch that require
+Always commit your fixes to the oldest supported branch that requires
 them.  Then (periodically) merge the integration branches upwards into each
 other.
 =
-- 
2.11.0



Re: [PATCH 17/20] abbrev: unify the handling of empty values

2018-06-09 Thread Martin Ågren
On 9 June 2018 at 16:24, Martin Ågren  wrote:
> On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason  wrote:
>> For no good reason the --abbrev= command-line option was less strict
>> than the core.abbrev config option, which came down to the latter
>> using git_config_int() which rejects an empty string, but the rest of
>> the parsing using strtoul() which will convert it to 0.
>
> It will still be less strict in that it accepts trailing garbage, e.g.,
> `--abbrev=7a`. Probably ok to leave it at that in this series, but
> possibly useful to mention here that this only makes these options "less
> differently strict".

Hmpf, please ignore. That's what I get for looking at a few patches,
taking a break, picking it up again and completely forgetting what's
going on...


Re: [PATCH 17/20] abbrev: unify the handling of empty values

2018-06-09 Thread Martin Ågren
On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason  wrote:
> For no good reason the --abbrev= command-line option was less strict
> than the core.abbrev config option, which came down to the latter
> using git_config_int() which rejects an empty string, but the rest of
> the parsing using strtoul() which will convert it to 0.

It will still be less strict in that it accepts trailing garbage, e.g.,
`--abbrev=7a`. Probably ok to leave it at that in this series, but
possibly useful to mention here that this only makes these options "less
differently strict".

I also notice that the documentation of `--abbrev` starts with "Instead
of showing the full 40-byte hexadecimal object name" which doesn't seem
right. I get much shorter IDs and I can't see that I'd have any
configuration causing this. Anyway, that might not be anything this
series needs to do anything about.

> +   if (!strcmp(arg, ""))
> +   die("--abbrev expects a value, got '%s'", arg);

> +   if (!strcmp(arg, ""))
> +   return opterror(opt, "expects a value", 0);

> +   if (!strcmp(optarg, ""))
> +   die("--abbrev expects a value, got '%s'", optarg);

> +   test_must_fail git branch -v --abbrev= 2>stderr &&
> +   test_i18ngrep "expects a value" stderr &&

> +   test_must_fail git log --abbrev= -1 --pretty=format:%h 2>stderr &&
> +   test_i18ngrep "expects a value" stderr &&

> +   test_must_fail git diff --raw --abbrev= HEAD~ 2>stderr &&
> +   test_i18ngrep "expects a value" stderr &&

Mismatch re i18n-ed or not between implementations and tests?

Martin


Re: [PATCH 09/20] abbrev tests: test for "git-log" behavior

2018-06-09 Thread Martin Ågren
On 9 June 2018 at 11:56, Ævar Arnfjörð Bjarmason  wrote:
>
> On Sat, Jun 09 2018, Martin Ågren wrote:
>
>> On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason  wrote:
>>> The "log" family of commands does its own parsing for --abbrev in
>>> revision.c, so having dedicated tests for it makes sense.
>>
>>> +for i in $(test_seq 4 40)
>>
>> I've just been skimming so might have missed something, but I see
>> several instances of this construct, and I wonder what this brute-force
>> approach really buys us. An alternative would be, e.g., "for i in 4 23
>> 40". That is, min/max and some arbitrary number in between (odd because
>> the others are even).
>>
>> Of course, we might have a bug which magically happens for the number 9,
>> but I'd expect us to test for that only if we have some reason to
>> believe that number 9 is indeed magical.
>
> Good point, I'll change this in v2, or at least guard it with
> EXPENSIVE. I hacked it up like this while exhaustively testing things
> during development, and discovered some edge cases (e.g. "0" is special
> sometimes).

Ah, "useful during hacking" explains why you did it like this. Of your
two approaches, I'd probably favour "make it cheaper" over "mark it as
EXPENSIVE". Nothing I feel strongly about.

>> Also, 40 is of course tied to SHA-1. You could perhaps define a variable
>> at the top of this file to simplify a future generalization. (Same for
>> 39/41 which are related to 40.)
>
> I forgot to note this in the commit message, but I intentionally didn't
> guard this test with the SHA1 prereq, there's nothing per-se specific to
> SHA-1 here, it's not a given that whatever our NewHash is that we won't
> use 40 characters, and the rest of the magic constants like 4 and 7 is
> something we're likely to retain with NewHash.

I'd tend to agree about not marking this SHA1.

> Although maybe we should expose GIT_SHA1_HEXSZ to the test suite.

It seems like brian's "test_translate"-approach [1] would be a good
choice of tool for this. That is, you'd just define something at the top
of this file for now, then once that tool is in place, a one-line change
could get "hexsz" from `test_translate` instead.

[1] 
https://public-inbox.org/git/20180604235229.279814-2-sand...@crustytoothpaste.net/

Martin


Re: [PATCH 2/2] builtin/blame: highlight recently changed lines

2018-06-09 Thread René Scharfe
Am 17.04.2018 um 23:30 schrieb Stefan Beller:
> +static void parse_color_fields(const char *s)
> +{
> + struct string_list l = STRING_LIST_INIT_DUP;
> + struct string_list_item *item;
> + enum { EXPECT_DATE, EXPECT_COLOR } next = EXPECT_COLOR;
> +
> + colorfield_nr = 0;
> +
> + /* Ideally this would be stripped and split at the same time? */

Why?  Both approxidate() and color_parse() handle spaces.

> + string_list_split(, s, ',', -1);
> + ALLOC_GROW(colorfield, colorfield_nr + 1, colorfield_alloc);
> +
> + for_each_string_list_item(item, ) {
> + switch (next) {
> + case EXPECT_DATE:
> + colorfield[colorfield_nr].hop = 
> approxidate(item->string);
> + next = EXPECT_COLOR;
> + colorfield_nr++;
> + ALLOC_GROW(colorfield, colorfield_nr + 1, 
> colorfield_alloc);
> + break;
> + case EXPECT_COLOR:
> + if (color_parse(item->string, 
> colorfield[colorfield_nr].col))
> + die(_("expecting a color: %s"), item->string);
> + next = EXPECT_DATE;
> + break;
> + }
> + }
> +
> + if (next == EXPECT_COLOR)
> + die (_("must end with a color"));
> +
> + colorfield[colorfield_nr].hop = TIME_MAX;
> +}

This adds a minor memory leak; fix below.

-- >8 --
Subject: [PATCH] blame: release string_list after use in parse_color_fields()

Signed-off-by: Rene Scharfe 
---
 builtin/blame.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/builtin/blame.c b/builtin/blame.c
index 4202584f97..3295718841 100644
--- a/builtin/blame.c
+++ b/builtin/blame.c
@@ -411,6 +411,7 @@ static void parse_color_fields(const char *s)
die (_("must end with a color"));
 
colorfield[colorfield_nr].hop = TIME_MAX;
+   string_list_clear(, 0);
 }
 
 static void setup_default_color_by_age(void)
-- 
2.17.1


Re: BUG: submodule code prints '(null)'

2018-06-09 Thread Duy Nguyen
On Tue, Jun 05, 2018 at 05:31:41PM +0200, Duy Nguyen wrote:
> I do not know how to reproduce this (and didn't bother to look deeply
> into it after I found it was not a trivial fix) but one of my "git
> fetch" showed
> 
> warning: Submodule in commit be2db96a6c506464525f588da59cade0cedddb5e
> at path: '(null)' collides with a submodule named the same. Skipping
> it.

The problem is default_name_or_path() can return NULL when a submodule
is not populated. The fix could simply be printing path instead of
name (because we are talking about path in the commit message), like
below.

But I don't really understand c68f837576 (implement fetching of moved
submodules - 2017-10-16), the commit that made this change, and not
sure if we should be reporting name here or path. Heiko?

diff --git a/submodule.c b/submodule.c
index 939d6870ec..61c2177755 100644
--- a/submodule.c
+++ b/submodule.c
@@ -745,7 +745,7 @@ static void collect_changed_submodules_cb(struct 
diff_queue_struct *q,
warning("Submodule in commit %s at path: "
"'%s' collides with a submodule named "
"the same. Skipping it.",
-   oid_to_hex(commit_oid), name);
+   oid_to_hex(commit_oid), p->two->path);
name = NULL;
}
}



> 
> I think it's reported that some libc implementation will not be able
> to gracefully handle NULL strings like glibc and may crash instead of
> printing '(null)' here. I'll leave it to submodule people to fix this
> :)
> -- 
> Duy


Re: [PATCH 09/20] abbrev tests: test for "git-log" behavior

2018-06-09 Thread Ævar Arnfjörð Bjarmason


On Sat, Jun 09 2018, Martin Ågren wrote:

> On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason  wrote:
>> The "log" family of commands does its own parsing for --abbrev in
>> revision.c, so having dedicated tests for it makes sense.
>
>> +for i in $(test_seq 4 40)
>
> I've just been skimming so might have missed something, but I see
> several instances of this construct, and I wonder what this brute-force
> approach really buys us. An alternative would be, e.g., "for i in 4 23
> 40". That is, min/max and some arbitrary number in between (odd because
> the others are even).
>
> Of course, we might have a bug which magically happens for the number 9,
> but I'd expect us to test for that only if we have some reason to
> believe that number 9 is indeed magical.

Good point, I'll change this in v2, or at least guard it with
EXPENSIVE. I hacked it up like this while exhaustively testing things
during development, and discovered some edge cases (e.g. "0" is special
sometimes).

> Also, 40 is of course tied to SHA-1. You could perhaps define a variable
> at the top of this file to simplify a future generalization. (Same for
> 39/41 which are related to 40.)

I forgot to note this in the commit message, but I intentionally didn't
guard this test with the SHA1 prereq, there's nothing per-se specific to
SHA-1 here, it's not a given that whatever our NewHash is that we won't
use 40 characters, and the rest of the magic constants like 4 and 7 is
something we're likely to retain with NewHash.

Although maybe we should expose GIT_SHA1_HEXSZ to the test suite.


Re: [PATCH v2 2/2] fsck: avoid looking at NULL blob->object

2018-06-09 Thread SZEDER Gábor
> +test_expect_success 'fsck detects non-blob .gitmodules' '
> + git init non-blob &&
> + (
> + cd non-blob &&
> +
> + # As above, make the funny tree directly to avoid index
> + # restrictions.
> + mkdir subdir &&
> + cp ../.gitmodules subdir/file &&
> + git add subdir/file &&
> + git commit -m ok &&
> + tree=$(git ls-tree HEAD | sed s/subdir/.gitmodules/ | git 
> mktree) &&

And $tree won't be used here, either.

> +
> + test_must_fail git fsck 2>output &&
> + grep gitmodulesBlob output
> + )
> +'
> +
>  test_done
> -- 
> 2.18.0.rc1.446.g4486251e51
> 


Re: [PATCH v2 1/2] t7415: don't bother creating commit for symlink test

2018-06-09 Thread SZEDER Gábor
> As a result, there's no need to create a commit in our
> tests. Let's drop it in the name of simplicity.
> 
> Signed-off-by: Jeff King 
> ---
>  t/t7415-submodule-names.sh | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/t/t7415-submodule-names.sh b/t/t7415-submodule-names.sh
> index a770d92a55..e2aae587ae 100755
> --- a/t/t7415-submodule-names.sh
> +++ b/t/t7415-submodule-names.sh
> @@ -141,7 +141,6 @@ test_expect_success 'fsck detects symlinked .gitmodules 
> file' '

I add few more lines of context here:

tree=$(
{
printf "100644 blob $content\t$tricky\n" &&

>   printf "12 blob $target\t.gitmodules\n"
>   } | git mktree
>   ) &&
> - commit=$(git commit-tree $tree) &&

This was the only case where that $tree variable was used, so perhaps
that can go away as well, in the name of even more simplicity?

>  
>   # Check not only that we fail, but that it is due to the
>   # symlink detector; this grep string comes from the config
> -- 
> 2.18.0.rc1.446.g4486251e51
> 
> 


[PATCH v2 2/2] fsck: avoid looking at NULL blob->object

2018-06-09 Thread Jeff King
Commit 159e7b080b (fsck: detect gitmodules files,
2018-05-02) taught fsck to look at the content of
.gitmodules files. If the object turns out not to be a blob
at all, we just complain and punt on checking the content.
And since this was such an obvious and trivial code path, I
didn't even bother to add a test.

Except it _does_ do one non-trivial thing, which is call the
report() function, which wants us to pass a pointer to a
"struct object". Which we don't have (we have only a "struct
object_id"). So we erroneously pass a NULL object to
report(), which gets dereferenced and causes a segfault.

It seems like we could refactor report() to just take the
object_id itself. But we pass the object pointer along to
a callback function, and indeed this ends up in
builtin/fsck.c's objreport() which does want to look at
other parts of the object (like the type).

So instead, let's just use lookup_unknown_object() to get
the real "struct object", and pass that.

Signed-off-by: Jeff King 
---
 fsck.c |  3 ++-
 t/t7415-submodule-names.sh | 18 ++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fsck.c b/fsck.c
index bcae2c30e6..48e7e36869 100644
--- a/fsck.c
+++ b/fsck.c
@@ -1036,7 +1036,8 @@ int fsck_finish(struct fsck_options *options)
 
blob = lookup_blob(oid);
if (!blob) {
-   ret |= report(options, >object,
+   struct object *obj = lookup_unknown_object(oid->hash);
+   ret |= report(options, obj,
  FSCK_MSG_GITMODULES_BLOB,
  "non-blob found at .gitmodules");
continue;
diff --git a/t/t7415-submodule-names.sh b/t/t7415-submodule-names.sh
index e2aae587ae..d7868c11fa 100755
--- a/t/t7415-submodule-names.sh
+++ b/t/t7415-submodule-names.sh
@@ -150,4 +150,22 @@ test_expect_success 'fsck detects symlinked .gitmodules 
file' '
)
 '
 
+test_expect_success 'fsck detects non-blob .gitmodules' '
+   git init non-blob &&
+   (
+   cd non-blob &&
+
+   # As above, make the funny tree directly to avoid index
+   # restrictions.
+   mkdir subdir &&
+   cp ../.gitmodules subdir/file &&
+   git add subdir/file &&
+   git commit -m ok &&
+   tree=$(git ls-tree HEAD | sed s/subdir/.gitmodules/ | git 
mktree) &&
+
+   test_must_fail git fsck 2>output &&
+   grep gitmodulesBlob output
+   )
+'
+
 test_done
-- 
2.18.0.rc1.446.g4486251e51


[PATCH v2 1/2] t7415: don't bother creating commit for symlink test

2018-06-09 Thread Jeff King
Early versions of the fsck .gitmodules detection code
actually required a tree to be at the root of a commit for
it to be checked for .gitmodules. What we ended up with in
159e7b080b (fsck: detect gitmodules files, 2018-05-02),
though, finds a .gitmodules file in _any_ tree (see that
commit for more discussion).

As a result, there's no need to create a commit in our
tests. Let's drop it in the name of simplicity.

Signed-off-by: Jeff King 
---
 t/t7415-submodule-names.sh | 1 -
 1 file changed, 1 deletion(-)

diff --git a/t/t7415-submodule-names.sh b/t/t7415-submodule-names.sh
index a770d92a55..e2aae587ae 100755
--- a/t/t7415-submodule-names.sh
+++ b/t/t7415-submodule-names.sh
@@ -141,7 +141,6 @@ test_expect_success 'fsck detects symlinked .gitmodules 
file' '
printf "12 blob $target\t.gitmodules\n"
} | git mktree
) &&
-   commit=$(git commit-tree $tree) &&
 
# Check not only that we fail, but that it is due to the
# symlink detector; this grep string comes from the config
-- 
2.18.0.rc1.446.g4486251e51



[PATCH v2 0/2] .gitmodules fsck cleanups

2018-06-09 Thread Jeff King
Here's a v2 which fixes the comment typo and drops the unnecessary
commit creation in the tests. I've added a preparatory patch to do the
same cleanup in the existing test for consistency.

As before, these should apply on 'maint'.

  [1/2]: t7415: don't bother creating commit for symlink test
  [2/2]: fsck: avoid looking at NULL blob->object

 fsck.c |  3 ++-
 t/t7415-submodule-names.sh | 19 ++-
 2 files changed, 20 insertions(+), 2 deletions(-)

-Peff


Re: [PATCH] fsck: avoid looking at NULL blob->object

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 10:50:36AM +0200, Martin Ågren wrote:

> On 9 June 2018 at 10:32, Jeff King  wrote:
> > Except it _does_ do one non-trivial thing, which is call the
> > report() function, which wants us to pass a pointer to a
> > "struct object". Which we don't have (we have only a "struct
> > object_id"). So we erroneously passed the NULL object, which
> 
> s/passed/dereferenced/? Probably doesn't affect the fix though.

Well, we passed it, and then that function dereferenced it. :)

I'm going to re-roll for the minor bits that Eric pointed out, so I'll
try to word this better.

-Peff


Re: [PATCH] fsck: avoid looking at NULL blob->object

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 10:45:15AM +0200, Duy Nguyen wrote:

> > It seems like we could refactor report() to just take the
> > object_id itself. But we pass the object pointer along to
> > a callback function, and indeed this ends up in
> > builtin/fsck.c's objreport() which does want to look at
> > other parts of the object (like the type).
> 
> And objreport() can handle OBJ_NONE well, which is the type given by
> lookup_unknown_object(). So yeah this looks good.

Actually, we know that we have some "real" type, because otherwise
lookup_blob() would not have returned NULL. In fact, we could actually
use lookup_object() here to find that object, knowing that it exists in
the hash. But IMHO that would be depending too much on how lookup_blob()
works. :)

-Peff


Re: [PATCH] fsck: avoid looking at NULL blob->object

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 04:38:54AM -0400, Eric Sunshine wrote:

> On Sat, Jun 9, 2018 at 4:32 AM, Jeff King  wrote:
> > Commit 159e7b080b (fsck: detect gitmodules files,
> > 2018-05-02) taught fsck to look at the content of
> > .gitmodules files. If the object turns out not to be a blob
> > at all, we just complain and punt on checking the content.
> > And since this was such an obvious and trivial code path, I
> > didn't even bother to add a test.
> > [...]
> > Signed-off-by: Jeff King 
> > ---
> > diff --git a/t/t7415-submodule-names.sh b/t/t7415-submodule-names.sh
> > @@ -151,4 +151,22 @@ test_expect_success 'fsck detects symlinked 
> > .gitmodules file' '
> > +test_expect_success 'fsck detects non-blob .gitmodules' '
> > +   git init non-blob &&
> > +   (
> > +   cd non-blob &&
> > +
> > +   # As above, make the funny directly to avoid index 
> > restrictions.
> 
> Is there a word missing after "funny"?

Oops, should be "funny tree" (that's what I get for trying to wordsmith
it at the last minute).

> > +   mkdir subdir &&
> > +   cp ../.gitmodules subdir/file &&
> > +   git add subdir/file &&
> > +   git commit -m ok &&
> > +   tree=$(git ls-tree HEAD | sed s/subdir/.gitmodules/ | git 
> > mktree) &&
> > +   commit=$(git commit-tree $tree) &&
> 
> I see that this is just mirroring the preceding test, but do you need
> to assign to variable 'commit' which is never consulted by anything
> later in the test?

No (nor above). I think originally I had planned to points refs to these
commits, but it isn't necessary for fsck. In fact, in the final form of
the patches, we do not even need the commit at all, since we will
complain about .gitmodules in _any_ tree (early versions actually did an
extra pass to find which trees were root trees).

-Peff


Re: [PATCH] fsck: avoid looking at NULL blob->object

2018-06-09 Thread Martin Ågren
On 9 June 2018 at 10:32, Jeff King  wrote:
> Except it _does_ do one non-trivial thing, which is call the
> report() function, which wants us to pass a pointer to a
> "struct object". Which we don't have (we have only a "struct
> object_id"). So we erroneously passed the NULL object, which

s/passed/dereferenced/? Probably doesn't affect the fix though.

> ends up segfaulting.

> blob = lookup_blob(oid);
> if (!blob) {
> -   ret |= report(options, >object,
> +   struct object *obj = lookup_unknown_object(oid->hash);
> +   ret |= report(options, obj,
>   FSCK_MSG_GITMODULES_BLOB,
>   "non-blob found at .gitmodules");


Re: [PATCH] fsck: avoid looking at NULL blob->object

2018-06-09 Thread Duy Nguyen
On Sat, Jun 9, 2018 at 10:34 AM Jeff King  wrote:
>
> Commit 159e7b080b (fsck: detect gitmodules files,
> 2018-05-02) taught fsck to look at the content of
> .gitmodules files. If the object turns out not to be a blob
> at all, we just complain and punt on checking the content.
> And since this was such an obvious and trivial code path, I
> didn't even bother to add a test.
>
> Except it _does_ do one non-trivial thing, which is call the
> report() function, which wants us to pass a pointer to a
> "struct object". Which we don't have (we have only a "struct
> object_id"). So we erroneously passed the NULL object, which
> ends up segfaulting.
>
> It seems like we could refactor report() to just take the
> object_id itself. But we pass the object pointer along to
> a callback function, and indeed this ends up in
> builtin/fsck.c's objreport() which does want to look at
> other parts of the object (like the type).

And objreport() can handle OBJ_NONE well, which is the type given by
lookup_unknown_object(). So yeah this looks good.

> So instead, let's just use lookup_unknown_object() to get
> the real "struct object", and pass that.
-- 
Duy


Re: [PATCH 09/20] abbrev tests: test for "git-log" behavior

2018-06-09 Thread Martin Ågren
On 9 June 2018 at 00:41, Ævar Arnfjörð Bjarmason  wrote:
> The "log" family of commands does its own parsing for --abbrev in
> revision.c, so having dedicated tests for it makes sense.

> +for i in $(test_seq 4 40)

I've just been skimming so might have missed something, but I see
several instances of this construct, and I wonder what this brute-force
approach really buys us. An alternative would be, e.g., "for i in 4 23
40". That is, min/max and some arbitrary number in between (odd because
the others are even).

Of course, we might have a bug which magically happens for the number 9,
but I'd expect us to test for that only if we have some reason to
believe that number 9 is indeed magical.

Also, 40 is of course tied to SHA-1. You could perhaps define a variable
at the top of this file to simplify a future generalization. (Same for
39/41 which are related to 40.)

Martin


Re: [PATCH] fsck: avoid looking at NULL blob->object

2018-06-09 Thread Eric Sunshine
On Sat, Jun 9, 2018 at 4:32 AM, Jeff King  wrote:
> Commit 159e7b080b (fsck: detect gitmodules files,
> 2018-05-02) taught fsck to look at the content of
> .gitmodules files. If the object turns out not to be a blob
> at all, we just complain and punt on checking the content.
> And since this was such an obvious and trivial code path, I
> didn't even bother to add a test.
> [...]
> Signed-off-by: Jeff King 
> ---
> diff --git a/t/t7415-submodule-names.sh b/t/t7415-submodule-names.sh
> @@ -151,4 +151,22 @@ test_expect_success 'fsck detects symlinked .gitmodules 
> file' '
> +test_expect_success 'fsck detects non-blob .gitmodules' '
> +   git init non-blob &&
> +   (
> +   cd non-blob &&
> +
> +   # As above, make the funny directly to avoid index 
> restrictions.

Is there a word missing after "funny"?

> +   mkdir subdir &&
> +   cp ../.gitmodules subdir/file &&
> +   git add subdir/file &&
> +   git commit -m ok &&
> +   tree=$(git ls-tree HEAD | sed s/subdir/.gitmodules/ | git 
> mktree) &&
> +   commit=$(git commit-tree $tree) &&

I see that this is just mirroring the preceding test, but do you need
to assign to variable 'commit' which is never consulted by anything
later in the test?

> +   test_must_fail git fsck 2>output &&
> +   grep gitmodulesBlob output
> +   )
> +'


[PATCH] fsck: avoid looking at NULL blob->object

2018-06-09 Thread Jeff King
Commit 159e7b080b (fsck: detect gitmodules files,
2018-05-02) taught fsck to look at the content of
.gitmodules files. If the object turns out not to be a blob
at all, we just complain and punt on checking the content.
And since this was such an obvious and trivial code path, I
didn't even bother to add a test.

Except it _does_ do one non-trivial thing, which is call the
report() function, which wants us to pass a pointer to a
"struct object". Which we don't have (we have only a "struct
object_id"). So we erroneously passed the NULL object, which
ends up segfaulting.

It seems like we could refactor report() to just take the
object_id itself. But we pass the object pointer along to
a callback function, and indeed this ends up in
builtin/fsck.c's objreport() which does want to look at
other parts of the object (like the type).

So instead, let's just use lookup_unknown_object() to get
the real "struct object", and pass that.

Signed-off-by: Jeff King 
---
The problem is in v2.17.1, and of course the v2.18 release candidates.

 fsck.c |  3 ++-
 t/t7415-submodule-names.sh | 18 ++
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fsck.c b/fsck.c
index bcae2c30e6..48e7e36869 100644
--- a/fsck.c
+++ b/fsck.c
@@ -1036,7 +1036,8 @@ int fsck_finish(struct fsck_options *options)
 
blob = lookup_blob(oid);
if (!blob) {
-   ret |= report(options, >object,
+   struct object *obj = lookup_unknown_object(oid->hash);
+   ret |= report(options, obj,
  FSCK_MSG_GITMODULES_BLOB,
  "non-blob found at .gitmodules");
continue;
diff --git a/t/t7415-submodule-names.sh b/t/t7415-submodule-names.sh
index a770d92a55..63c767bf99 100755
--- a/t/t7415-submodule-names.sh
+++ b/t/t7415-submodule-names.sh
@@ -151,4 +151,22 @@ test_expect_success 'fsck detects symlinked .gitmodules 
file' '
)
 '
 
+test_expect_success 'fsck detects non-blob .gitmodules' '
+   git init non-blob &&
+   (
+   cd non-blob &&
+
+   # As above, make the funny directly to avoid index restrictions.
+   mkdir subdir &&
+   cp ../.gitmodules subdir/file &&
+   git add subdir/file &&
+   git commit -m ok &&
+   tree=$(git ls-tree HEAD | sed s/subdir/.gitmodules/ | git 
mktree) &&
+   commit=$(git commit-tree $tree) &&
+
+   test_must_fail git fsck 2>output &&
+   grep gitmodulesBlob output
+   )
+'
+
 test_done
-- 
2.18.0.rc1.446.g4486251e51


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Duy Nguyen
On Sat, Jun 9, 2018 at 8:32 AM Ævar Arnfjörð Bjarmason  wrote:
> Let's say you're in a corporate environment with Linux, OSX and Windows
> boxes, but all of whom have some shared mounts provisioned & ability to
> ship an /etc/gitconfig (wherever that lives on Windows).
>
> It's much easier to just do that than figure out how to build a custom
> Git on all three platforms.

Let's say I don't care about making it easier on corporate
environment. You have resources to do that.

> > Not making it a compile-time option could force [1] linux distro
> > to carry this function to everybody even if they don't use it (and
> > it's kinda dangerous to misuse if you don't anonymize the data
> > properly). I also prefer this a compile time option.
>
> Setting GIT_TRACE to a filename that you published is also similarly
> dangerous, so would setting up a trivial 4-line shell alias to wrap
> "git" and log what it's doing.
>
> > [1] Of course many distros can choose to patch it out. But it's the
> > same argument as bringing this option in in the first place: you guys
> > already have that code in private and now want to put it in stock git
> > to reduce maintenance cost, why add extra cost on linux distro
> > maintenance?
>
> Because:
>
> 1) I really don't see the basis for this argument that they'd need to
>patch it out, they're not patching out e.g. GIT_TRACE now, which has
>all the same sort of concerns, it's just a format that's more of a
>hassle to parse than this proposed format.
>
> 2) I think you and Johannes are just seeing the "telemetry" part of
>this, but if you look past that all this *really* is is "GIT_TRACE
>facility that doesn't suck to parse".

If it's simply an improvement over GIT_TRACE, then I have no problem
with that. That is:

- there is no new, permanent way to turn this one like config keys

- there's no plugin support or whatever for keeping output. Keep it to
either a file or a file descriptor like we do with GIT_TRACE.

>There's a lot of use-cases for that which have nothing to do with
>what this facility is originally written for, for example, a novice
>git user could turn it on and have it log in ~ somewhere, and then
>run some contrib script which analyzes his git usage and spews out
>suggestions ("you use this command/option, but not this related
>useful command/option").
>
>Users asking for help on the #git IRC channel or on this mailing list
>could turn this on if they have a problem, and paste it into some
>tool they could show to others to see exactly what they're doing /
>where it went wrong.

And they can use GIT_TRACE now. If GIT_TRACE does not give enough info
(e.g. too few trace points), add them. This new proposal is more about
automation. With GIT_TRACE, which is more human friendly, I could skim
over the output and remove sensitive info before I send it out for
help. Machine-friendly format (even json) tends to be a lot more
complex and harder to read/filter this way.
-- 
Duy


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 09:04:16AM +0200, Johannes Sixt wrote:

> > AFAICT this telemetry data is the same thing, but for performance. When
> > somebody says "why does this command take so long", wouldn't it be nice
> > for us to be able to tell them to flip a switch that will collect data
> > on which operations are taking a long time?
> 
> Why do we need long-term survaillance to answer this question and why can it
> not be made a mode of GIT_TRACE?

I guess I don't see how this isn't simply a mode of GIT_TRACE.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Sixt

Am 09.06.2018 um 08:51 schrieb Jeff King:

I actually think this could be useful for normal users, too. We have
GIT_TRACE for dumping debug output, and we sometimes ask users to turn
it on to help diagnose a problem (not to mention that we use it
ourselves).


The BIG difference is in "we ask the users". Asking for a trace is a 
one-shot thing; this telemetry thing is obviously intended for long-term 
survaillance.



AFAICT this telemetry data is the same thing, but for performance. When
somebody says "why does this command take so long", wouldn't it be nice
for us to be able to tell them to flip a switch that will collect data
on which operations are taking a long time?


Why do we need long-term survaillance to answer this question and why 
can it not be made a mode of GIT_TRACE?


-- Hannes


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 08:31:58AM +0200, Ævar Arnfjörð Bjarmason wrote:

> 1) I really don't see the basis for this argument that they'd need to
>patch it out, they're not patching out e.g. GIT_TRACE now, which has
>all the same sort of concerns, it's just a format that's more of a
>hassle to parse than this proposed format.
> 
> 2) I think you and Johannes are just seeing the "telemetry" part of
>this, but if you look past that all this *really* is is "GIT_TRACE
>facility that doesn't suck to parse".

So that actually takes me to another question, which is: instead of
introducing a new "telemetry" feature, could we make GIT_TRACE suck less
to parse?

E.g., could we have a flag or environment variable to have the existing
traces output JSON? I guess right now they're inherently free-form via
trace_printf, so it would involve adding some structured interface
calls. Which is more or less what I guess JeffH's proposed feature to
look like.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Johannes Sixt

Am 09.06.2018 um 00:20 schrieb Ævar Arnfjörð Bjarmason:


On Fri, Jun 08 2018, Johannes Sixt wrote:


Am 08.06.2018 um 18:00 schrieb Thomas Braun:

I for my part would much rather prefer that to be a compile time
option so that I don't need to check on every git update on windows
if  this is now enabled or not.


This exactly my concern, too! A compile-time option may make it a good
deal less worrisome.


Can you elaborate on how someone who can maintain inject malicious code
into your git package + config would be thwarted by this being some
compile-time option, wouldn't they just compile it in?


Of course they can. But would we, the Git community do that?

From the design document:

> The goal of the telemetry feature is to be able to gather usage data
> across a group of production users to identify real-world performance
> problems in production.  Additionally, it might help identify common
> user errors and guide future user training.

The goal to gather usage data may be valid for a small subset of Git 
installations. But it is wrong to put this into the software itself, in 
particular when the implementations includes scary things like loading 
unspecified dynamic libraries:


> If the config setting "telemetry.plugin" contains the pathname to a
> shared library, the library will be dynamically loaded during start up
> and events will be sent to it using the plugin API.

When you want usage data, ask your users for feedback. Look over their 
shoulders. But do not ask the software itself to gather usage data. It 
will be abused.


Do not offer open source software that has a "call-home" method built-in.

If you want to peek into the workplaces of YOUR users, then monkey-patch 
survaillance into YOUR version of Git. But please do not burden the rest 
of us.


-- Hannes


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Jeff King
On Sat, Jun 09, 2018 at 07:03:53AM +0200, Duy Nguyen wrote:

> > > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
> > >> I for my part would much rather prefer that to be a compile time
> > >> option so that I don't need to check on every git update on windows
> > >> if  this is now enabled or not.
> > >
> > > This exactly my concern, too! A compile-time option may make it a good
> > > deal less worrisome.
> >
> > Can you elaborate on how someone who can maintain inject malicious code
> > into your git package + config would be thwarted by this being some
> > compile-time option, wouldn't they just compile it in?
> 
> Look at this from a different angle. This is driven by the needs to
> collect telemetry in _controlled_ environment (mostly server side, I
> guess) and it should be no problem to make custom builds there for
> you. Not making it a compile-time option could force [1] linux distro
> to carry this function to everybody even if they don't use it (and
> it's kinda dangerous to misuse if you don't anonymize the data
> properly). I also prefer this a compile time option.

I actually think this could be useful for normal users, too. We have
GIT_TRACE for dumping debug output, and we sometimes ask users to turn
it on to help diagnose a problem (not to mention that we use it
ourselves).

AFAICT this telemetry data is the same thing, but for performance. When
somebody says "why does this command take so long", wouldn't it be nice
for us to be able to tell them to flip a switch that will collect data
on which operations are taking a long time?

> [1] Of course many distros can choose to patch it out. But it's the
> same argument as bringing this option in in the first place: you guys
> already have that code in private and now want to put it in stock git
> to reduce maintenance cost, why add extra cost on linux distro
> maintenance?

We don't do performance telemetry like this at GitHub, but we do collect
a few variables that we dump to a custom over a Unix socket (e.g., our
upload-pack records clone vs fetch and shallow vs non-shallow for each
request).

In my experience the maintenance burden is not in the "connect to a
socket" part, but the fact that you have to sprinkle the entry points
throughout the code (e.g., "set 'cloning' flag to 1" or "I'm entering
the pack generation phase of the fetch"). So I'd love to see us do that
sprinkling _once_ where everyone can benefit, whether they want better
tracing for debugging, telemetry across their installed base, or
whatever.

The mechanism to handle those calls is much easier to plug in and out,
then. And I don't have a huge preference for compile-time versus
run-time, or whether bits are in-tree or out-of-tree. Obviously we'd
have some basic "write to stderr or a file" consumer in-tree.

For myself, I'm happy with compile-time (I'm instrumenting the server
side, so I really only care about my specific build) and out-of-tree
(the protocol to our custom daemon consumer is not something anybody
else would care about anyway).

For people collecting from clients, I imagine it's more convenient to be
able to let them use official builds and just tweak a knob, even if they
_could_ build a custom Git and push it out to everybody. I don't know
anything about Windows event tracing, but if it's a standard facility
then I doubt we're talking about a huge maintenance burden to have it
in-tree as a configurable option.

So IMHO that really leaves the "is it too scary to have a config knob
that lets tracing go to this event facility versus to a file"? I guess I
just don't see it, as long as the user has to enable it explicitly. That
seems like complaining that GIT_TRACE could go to syslog. If you don't
want to store it, then don't turn on the feature.

-Peff


Re: [RFC PATCH v1] telemetry design overview (part 1)

2018-06-09 Thread Ævar Arnfjörð Bjarmason


On Sat, Jun 09 2018, Duy Nguyen wrote:

> On Sat, Jun 9, 2018 at 12:22 AM Ævar Arnfjörð Bjarmason
>  wrote:
>>
>>
>> On Fri, Jun 08 2018, Johannes Sixt wrote:
>>
>> > Am 08.06.2018 um 18:00 schrieb Thomas Braun:
>> >> I for my part would much rather prefer that to be a compile time
>> >> option so that I don't need to check on every git update on windows
>> >> if  this is now enabled or not.
>> >
>> > This exactly my concern, too! A compile-time option may make it a good
>> > deal less worrisome.
>>
>> Can you elaborate on how someone who can maintain inject malicious code
>> into your git package + config would be thwarted by this being some
>> compile-time option, wouldn't they just compile it in?
>
>
> Look at this from a different angle. This is driven by the needs to
> collect telemetry in _controlled_ environment (mostly server side, I
> guess) and it should be no problem to make custom builds there for
> you.

Let's say you're in a corporate environment with Linux, OSX and Windows
boxes, but all of whom have some shared mounts provisioned & ability to
ship an /etc/gitconfig (wherever that lives on Windows).

It's much easier to just do that than figure out how to build a custom
Git on all three platforms.

I guess you might make the argument that "that's good", because in
practice that'll mean that it's such a hassle that fewer administrators
will turn this on.

But I think that would be a loss, because that's taking the default view
that people with the rights (i.e. managed config access) to turn on
something like this by default have nefarious motives, and we should do
what we can to stop them.

I don't think that's true, e.g. what I intend to use this for is:

 a) Getting aggregate data on what commands/switches are used, for
purposes of training and prioritizing my upstream contributions.

 b) Aggregate performance data to figure out what hotspots to tackle in
the code.

That's things that'll both benefit the users I'm responsible for, and
the wider git community.

> Not making it a compile-time option could force [1] linux distro
> to carry this function to everybody even if they don't use it (and
> it's kinda dangerous to misuse if you don't anonymize the data
> properly). I also prefer this a compile time option.

Setting GIT_TRACE to a filename that you published is also similarly
dangerous, so would setting up a trivial 4-line shell alias to wrap
"git" and log what it's doing.

> [1] Of course many distros can choose to patch it out. But it's the
> same argument as bringing this option in in the first place: you guys
> already have that code in private and now want to put it in stock git
> to reduce maintenance cost, why add extra cost on linux distro
> maintenance?

Because:

1) I really don't see the basis for this argument that they'd need to
   patch it out, they're not patching out e.g. GIT_TRACE now, which has
   all the same sort of concerns, it's just a format that's more of a
   hassle to parse than this proposed format.

2) I think you and Johannes are just seeing the "telemetry" part of
   this, but if you look past that all this *really* is is "GIT_TRACE
   facility that doesn't suck to parse".

   There's a lot of use-cases for that which have nothing to do with
   what this facility is originally written for, for example, a novice
   git user could turn it on and have it log in ~ somewhere, and then
   run some contrib script which analyzes his git usage and spews out
   suggestions ("you use this command/option, but not this related
   useful command/option").

   Users asking for help on the #git IRC channel or on this mailing list
   could turn this on if they have a problem, and paste it into some
   tool they could show to others to see exactly what they're doing /
   where it went wrong.


Re: Truncating file names with Unicode characters

2018-06-09 Thread Jeff King
On Fri, Jun 08, 2018 at 09:25:29PM +0300, Vitaly Potyarkin wrote:

> # Truncating file names with Unicode characters
> 
> When shortening file names that contain Unicode characters, git performs
> truncation without awareness of two-byte characters. That often leads to
> splitting a character in half and displaying a garbage byte that's left.
> 
> Unawareness of Unicode also means that filename length is calculated 
> incorrectly
> and some output gets misaligned.

Yeah, I agree it would be nice for this to be more aware of multi-byte
glyphs.

Unfortunately, we don't actually know what encoding the paths are in
(and they might not even all be in the same encoding). But hopefully we
could at least make the common case work a bit better. If we know that
core.quotepath is off, then the user is expecting paths in their
terminal encoding. So possibly we could just rely on
i18n.logOutputEncoding at that point (which of course defaults to utf8).

If anyone is interested in working on it, the relevant code is in
diff.c:show_stats(). Looks like we do a straight strlen() to get the
width, which should become utf8_strwidth(). That would solve the
problem where we use the wrong width.

To deal with the broken code points, I think you'd need to copy the name
into a strbuf and use strbuf_utf8_replace() to truncate (it doesn't seem
to be well documented, but you can grep for a few other callers).

-Peff


Re: [PATCH 2/2] pack-bitmap: add free function

2018-06-09 Thread Jeff King
On Thu, Jun 07, 2018 at 12:04:14PM -0700, Jonathan Tan wrote:

> Add a function to free struct bitmap_index instances, and use it where
> needed (except when rebuild_existing_bitmaps() is used, since it creates
> references to the bitmaps within the struct bitmap_index passed to it).
> 
> Note that the hashes field in struct bitmap_index is not freed because
> it points to another field within the same struct. The documentation for
> that field has been updated to clarify that.

Hrm, this one fixes the leaks I noticed in the previous patch. I'm OK
with doing it separately if it breaking it down this way makes the code
easier to reason about. But we should probably say something in the
commit message about it.

-Peff


Re: [PATCH 1/2] pack-bitmap: remove bitmap_git global variable

2018-06-09 Thread Jeff King
On Thu, Jun 07, 2018 at 12:04:13PM -0700, Jonathan Tan wrote:

> Remove the bitmap_git global variable. Instead, generate on demand an
> instance of struct bitmap_index for code that needs to access it.
> 
> This allows us significant control over the lifetime of instances of
> struct bitmap_index. In particular, packs can now be closed without
> worrying if an unnecessarily long-lived "pack" field in struct
> bitmap_index still points to it.
> 
> The bitmap API is also clearer in that we need to first obtain a struct
> bitmap_index, then we use it.

I think this is the right direction, and overall it looks pretty good.
There's one call that gave me pause:

> -int prepare_bitmap_git(void)
> +struct bitmap_index *prepare_bitmap_git(void)
>  {
> - if (bitmap_git.loaded)
> - return 0;
> + struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));

This function used to be idempotent, so any code which wanted to use the
global bitmap_git could call it "just in case". After your patch, it's
not. I think it's probably OK, since such functions would generally now
take a bitmap_git argument and use that (e.g., rebuild_existing_bitmaps
works that way after your patch).

> - if (!open_pack_bitmap())
> - return load_pack_bitmap();
> + if (!open_pack_bitmap(bitmap_git) && !load_pack_bitmap(bitmap_git))
> + return bitmap_git;
>  
> - return -1;
> + return NULL;
>  }

We probably need to free(bitmap_git) before returning NULL here (this is
still in prepare_bitmap_git()).

> @@ -662,12 +686,11 @@ int prepare_bitmap_walk(struct rev_info *revs)
>   struct bitmap *wants_bitmap = NULL;
>   struct bitmap *haves_bitmap = NULL;
>  
> - if (!bitmap_git.loaded) {
> - /* try to open a bitmapped pack, but don't parse it yet
> -  * because we may not need to use it */
> - if (open_pack_bitmap() < 0)
> - return -1;
> - }
> + struct bitmap_index *bitmap_git = xcalloc(1, sizeof(*bitmap_git));
> + /* try to open a bitmapped pack, but don't parse it yet
> +  * because we may not need to use it */
> + if (open_pack_bitmap(bitmap_git) < 0)
> + return NULL;

Ditto here (and probably other error returns lower in the function, but
I didn't go through it carefully).

-Peff