Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
Derrick Stolee writes: > diff --git a/commit-graph.h b/commit-graph.h > new file mode 100644 > index 000..dc8c73a > --- /dev/null > +++ b/commit-graph.h > @@ -0,0 +1,7 @@ > +#ifndef COMMIT_GRAPH_H > +#define COMMIT_GRAPH_H > + > +extern char *write_commit_graph(const char *obj_dir); > + > +#endif > + Trailing blank line at the end of the file. Remove it. t5318 has the same issue. Thanks.
Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
On Mon, Feb 19, 2018 at 7:53 PM, Derrick Stolee wrote: > +static int if_packed_commit_add_to_list(const struct object_id *oid, > + struct packed_git *pack, > + uint32_t pos, > + void *data) > +{ > + struct packed_oid_list *list = (struct packed_oid_list*)data; > + enum object_type type; > + unsigned long size; > + void *inner_data; > + off_t offset = nth_packed_object_offset(pack, pos); > + inner_data = unpack_entry(pack, offset, &type, &size); > + > + if (inner_data) > + free(inner_data); The condition is unnecessary, free() can handle a NULL argument just fine. (Suggested by Coccinelle and 'contrib/coccinelle/free.cocci.patch'.)
Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
On 2/23/2018 2:30 PM, Junio C Hamano wrote: Derrick Stolee writes: jt/binsearch-with-fanout introduces one when there is a 256-entry fanout table (not the case here). The bsearch() method in search.h (and used in pack-write.c:need_large_offset) does not return the _position_ of a found element. Neither of these suit my needs, but I could just be searching for the wrong strings. Also, I could divert my energies in this area to create a generic search in the style of jt/binsearch-with-fanout. ... me goes and digs ... What I had in mind was the one in sha1-lookup.c, actually. Having said that, hand-rolling another one is not too huge a deal; eventually people will notice and consolidate code after the series stabilizes anyway ;-) Ah, sha1_pos(). That definitely satisfies my use case. Thanks! My local branch has this replacement. + num_large_edges++; + parent = parent->next; + } while (parent); It feels somewhat wasteful to traverse the commit's parents list only to count, without populating the octopus table (which I understand is assumed to be minority case under this design). Since we are writing the commit graph file in-order, we cannot write the octopus table until after the chunk lengths are known. Oh, my "minority case" comment was me wondering "since we expect there are only a few, why not keep them in memory while we discover them here, so that the writing phase that come later do not have to go through all commits again counting their parents? would it be more performant and a better trade-off?" We can certainly do such an optimization later (iow, it is not all that crucial issue and certainly I didn't mention the above as something that needs to be "fixed"--there is nothing broken). store the octopus table in memory and then dump it into the file later, but walking the parents is quite fast after all the commits are loaded. I'm not sure the time optimization merits the extra complexity here. (I'm happy to revisit this if we do see this performance lacking.) P.S. I really like the name "octopus table" and will use that for informal discussions of this format. I actually do not mind that name used as the official term. I find it far more descriptive and understandable than "long edge" / "large edge" at least to a Git person. I will consider using this in the format and design documents. You're right that int_id isn't great, and your more-specific "oid_table_pos" shows an extra reason why it isn't great: when we add the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value is NOT a table position. Perhaps I am somewhat biased, but it is quite natural for our codebase and internal API to say something like this: x_pos(table, key) function's return value is the non-negative position for the key in the table when the key is there; when it returns a negative value, it is (-1 - position) where the "position" is the position in the table they key would have been found if it was in there. and store the return value from such a function in a variable called "pos". Surely, sometimes "pos" does not have _the_ position, but that does not make it a bad name. Saying "MISSING is a special value that denotes 'nothing is here'" and allowing it to be set to a variable that meant to hold the position is not such a bad thing, though. After all, that is how you use NULL as a special value for a pointer variable ;-). Same for using the high bit to mean something else. Taking these together you would explain "low 31-bit of pos holds the position for the item in the table. MISSING is a special value that you can use to denote there is nothing. The LAST_EDGE bit denotes that one group of positions ends there", or something like that. I think the current name makes the following call very clear: It is still a strange name nevertheless. +char *write_commit_graph(const char *obj_dir) +{ + struct packed_oid_list oids; + struct packed_commit_list commits; + struct sha1file *f; + int i, count_distinct = 0; + DIR *info_dir; + struct strbuf tmp_file = STRBUF_INIT; + struct strbuf graph_file = STRBUF_INIT; + unsigned char final_hash[GIT_MAX_RAWSZ]; + char *graph_name; + int fd; + uint32_t chunk_ids[5]; + uint64_t chunk_offsets[5]; + int num_chunks; + int num_long_edges; + struct commit_list *parent; + + oids.nr = 0; + oids.alloc = (int)(0.15 * approximate_object_count()); Heh, traditionalist would probably avoid unnecessary use of float and use something like 1/4 or 1/8 ;-) After all, it is merely a ballpark guestimate. + num_long_edges = 0; This again is about naming, but I find it a bit unnatural to call the edge between a chind and its octopus parents "long". Individual edges are not long--the only thing that is long is your "list of edges". Some other
Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
Junio C Hamano writes: >> I think the current name makes the following call very clear: > > It is still a strange name nevertheless. Sorry for simply repeating "strange" without spelling out why in the previous message. This certainly is subjective and depends on your cultural background, but in our codebase, I tried to name functions after "what" they do and "why", rather than "how" they do so. In a way, it's the same kind of uneasiness I feel when I see variables named in hungarian notation. You would inspect the object and treat 'data' as a list and add to the object if it is a commit, and if_packed_commit_add_to_list() certainly is a name that describes all of that well, but does it give readers a good answer when they wonder why the function is doing so? You described with the name of the function how it collects commits that are in the pack, without explicitly saying that you want to collect packed commits and that is why you are inspecting for type and doing so only for commit (i.e. "if_packed_commit" part of the name) and why you are adding to a list.
Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
Derrick Stolee writes: > jt/binsearch-with-fanout introduces one when there is a 256-entry > fanout table (not the case here). > > The bsearch() method in search.h (and used in > pack-write.c:need_large_offset) does not return the _position_ of a > found element. > > Neither of these suit my needs, but I could just be searching for the > wrong strings. Also, I could divert my energies in this area to create > a generic search in the style of jt/binsearch-with-fanout. ... me goes and digs ... What I had in mind was the one in sha1-lookup.c, actually. Having said that, hand-rolling another one is not too huge a deal; eventually people will notice and consolidate code after the series stabilizes anyway ;-) >>> + num_large_edges++; >>> + parent = parent->next; >>> + } while (parent); >> It feels somewhat wasteful to traverse the commit's parents list >> only to count, without populating the octopus table (which I >> understand is assumed to be minority case under this design). > > Since we are writing the commit graph file in-order, we cannot write > the octopus table until after the chunk lengths are known. Oh, my "minority case" comment was me wondering "since we expect there are only a few, why not keep them in memory while we discover them here, so that the writing phase that come later do not have to go through all commits again counting their parents? would it be more performant and a better trade-off?" We can certainly do such an optimization later (iow, it is not all that crucial issue and certainly I didn't mention the above as something that needs to be "fixed"--there is nothing broken). > store the octopus table in memory and then dump it into the file > later, but walking the parents is quite fast after all the commits are > loaded. I'm not sure the time optimization merits the extra complexity > here. (I'm happy to revisit this if we do see this performance > lacking.) > > P.S. I really like the name "octopus table" and will use that for > informal discussions of this format. I actually do not mind that name used as the official term. I find it far more descriptive and understandable than "long edge" / "large edge" at least to a Git person. > You're right that int_id isn't great, and your more-specific > "oid_table_pos" shows an extra reason why it isn't great: when we add > the GRAPH_LAST_EDGE bit or set it to GRAPH_PARENT_MISSING, the value > is NOT a table position. Perhaps I am somewhat biased, but it is quite natural for our codebase and internal API to say something like this: x_pos(table, key) function's return value is the non-negative position for the key in the table when the key is there; when it returns a negative value, it is (-1 - position) where the "position" is the position in the table they key would have been found if it was in there. and store the return value from such a function in a variable called "pos". Surely, sometimes "pos" does not have _the_ position, but that does not make it a bad name. Saying "MISSING is a special value that denotes 'nothing is here'" and allowing it to be set to a variable that meant to hold the position is not such a bad thing, though. After all, that is how you use NULL as a special value for a pointer variable ;-). Same for using the high bit to mean something else. Taking these together you would explain "low 31-bit of pos holds the position for the item in the table. MISSING is a special value that you can use to denote there is nothing. The LAST_EDGE bit denotes that one group of positions ends there", or something like that. > I think the current name makes the following call very clear: It is still a strange name nevertheless. >>> +char *write_commit_graph(const char *obj_dir) >>> +{ >>> + struct packed_oid_list oids; >>> + struct packed_commit_list commits; >>> + struct sha1file *f; >>> + int i, count_distinct = 0; >>> + DIR *info_dir; >>> + struct strbuf tmp_file = STRBUF_INIT; >>> + struct strbuf graph_file = STRBUF_INIT; >>> + unsigned char final_hash[GIT_MAX_RAWSZ]; >>> + char *graph_name; >>> + int fd; >>> + uint32_t chunk_ids[5]; >>> + uint64_t chunk_offsets[5]; >>> + int num_chunks; >>> + int num_long_edges; >>> + struct commit_list *parent; >>> + >>> + oids.nr = 0; >>> + oids.alloc = (int)(0.15 * approximate_object_count()); >> Heh, traditionalist would probably avoid unnecessary use of float >> and use something like 1/4 or 1/8 ;-) After all, it is merely a >> ballpark guestimate. >> >>> + num_long_edges = 0; >> This again is about naming, but I find it a bit unnatural to call >> the edge between a chind and its octopus parents "long". Individual >> edges are not long--the only thing that is long is your "list of >> edges". Some other codepaths in this patch seems to call the same >> concept with s/long/large/, which I found somewhat puzzling. > > How about "num_extra_edges"...
Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
On 2/20/2018 5:57 PM, Junio C Hamano wrote: Derrick Stolee writes: +#define GRAPH_OID_VERSION_SHA1 1 +#define GRAPH_OID_LEN_SHA1 20 This hardcoded 20 on the right hand side of this #define is probably problematic. Unless you are planning to possibly store truncated hash value for some future hash algorithm, GRAPH_OID_LEN_$HASH should always be the same as GIT_$HASH_RAWSZ, I would think. IOW #define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ perhaps? Yes. +static void write_graph_chunk_fanout(struct sha1file *f, +struct commit **commits, +int nr_commits) +{ + uint32_t i, count = 0; + struct commit **list = commits; + struct commit **last = commits + nr_commits; + + /* +* Write the first-level table (the list is sorted, +* but we use a 256-entry lookup to be able to avoid +* having to do eight extra binary search iterations). +*/ + for (i = 0; i < 256; i++) { + while (list < last) { + if ((*list)->object.oid.hash[0] != i) + break; + count++; + list++; + } If count and list are always incremented in unison, perhaps you do not need an extra variable "last". If typeof(nr_commits) is wider than typeof(count), this loop and the next write-be32 is screwed anyway ;-) This comment probably applies equally to some other uses of the same "compute last pointer to compare with running pointer for termination" pattern in this patch. Yes. Also turning i and count into int to match nr_commits. + sha1write_be32(f, count); + } +} +static int commit_pos(struct commit **commits, int nr_commits, + const struct object_id *oid, uint32_t *pos) +{ It is a bit unusual to see something_pos() that returns an integer that does *NOT* return the position as its return value. Dropping the *pos parameter, and returning "mid" when commits[mid] is what we wanted to see, and otherwise returning "-1 - first" to signal the position at which we _would_ have found the object, if it were in the table, would make it more consistent with the usual convention. I can make this change. I found the boolean return to make the consumer's logic simpler, but it isn't that much simpler. Don't we even have such a generalized binary search helper already somewhere in the system? jt/binsearch-with-fanout introduces one when there is a 256-entry fanout table (not the case here). The bsearch() method in search.h (and used in pack-write.c:need_large_offset) does not return the _position_ of a found element. Neither of these suit my needs, but I could just be searching for the wrong strings. Also, I could divert my energies in this area to create a generic search in the style of jt/binsearch-with-fanout. +static void write_graph_chunk_data(struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + struct commit **last = commits + nr_commits; + uint32_t num_large_edges = 0; + + while (list < last) { + struct commit_list *parent; + uint32_t int_id; + uint32_t packedDate[2]; + +... + if (!parent) + int_id = GRAPH_PARENT_NONE; + else if (parent->next) + int_id = GRAPH_LARGE_EDGES_NEEDED | num_large_edges; + else if (!commit_pos(commits, nr_commits, + &(parent->item->object.oid), &int_id)) + int_id = GRAPH_PARENT_MISSING; + + sha1write_be32(f, int_id); + + if (parent && parent->next) { This is equivalent to checking "int_id & GRAPH_LARGE_EDGES_NEEDED", right? Not suggesting to use the other form of checks, but trying to see what's the best way to express it in the most readable way. You're right, we already set the bit above, so let's make use of that check. Important to note that GRAPH_LARGE_EDGES_NEEDED & GRAPH_PARENT_MISSING == 0. + do { + num_large_edges++; + parent = parent->next; + } while (parent); It feels somewhat wasteful to traverse the commit's parents list only to count, without populating the octopus table (which I understand is assumed to be minority case under this design). Since we are writing the commit graph file in-order, we cannot write the octopus table until after the chunk lengths are known. We could store the octopus table in memory and then dump it into the file later, but walking the parents is quite fast after all the commits are loaded. I'm not sure the time optimization merits the extra complexity here. (I'm happy to revisit this if we do see this performance lac
Re: [PATCH v4 04/13] commit-graph: implement write_commit_graph()
Derrick Stolee writes: > +#define GRAPH_OID_VERSION_SHA1 1 > +#define GRAPH_OID_LEN_SHA1 20 This hardcoded 20 on the right hand side of this #define is probably problematic. Unless you are planning to possibly store truncated hash value for some future hash algorithm, GRAPH_OID_LEN_$HASH should always be the same as GIT_$HASH_RAWSZ, I would think. IOW #define GRAPH_OID_LEN_SHA1 GIT_SHA1_RAWSZ perhaps? > +static void write_graph_chunk_fanout(struct sha1file *f, > + struct commit **commits, > + int nr_commits) > +{ > + uint32_t i, count = 0; > + struct commit **list = commits; > + struct commit **last = commits + nr_commits; > + > + /* > + * Write the first-level table (the list is sorted, > + * but we use a 256-entry lookup to be able to avoid > + * having to do eight extra binary search iterations). > + */ > + for (i = 0; i < 256; i++) { > + while (list < last) { > + if ((*list)->object.oid.hash[0] != i) > + break; > + count++; > + list++; > + } If count and list are always incremented in unison, perhaps you do not need an extra variable "last". If typeof(nr_commits) is wider than typeof(count), this loop and the next write-be32 is screwed anyway ;-) This comment probably applies equally to some other uses of the same "compute last pointer to compare with running pointer for termination" pattern in this patch. > + sha1write_be32(f, count); > + } > +} > +static int commit_pos(struct commit **commits, int nr_commits, > + const struct object_id *oid, uint32_t *pos) > +{ It is a bit unusual to see something_pos() that returns an integer that does *NOT* return the position as its return value. Dropping the *pos parameter, and returning "mid" when commits[mid] is what we wanted to see, and otherwise returning "-1 - first" to signal the position at which we _would_ have found the object, if it were in the table, would make it more consistent with the usual convention. Don't we even have such a generalized binary search helper already somewhere in the system? > +static void write_graph_chunk_data(struct sha1file *f, int hash_len, > +struct commit **commits, int nr_commits) > +{ > + struct commit **list = commits; > + struct commit **last = commits + nr_commits; > + uint32_t num_large_edges = 0; > + > + while (list < last) { > + struct commit_list *parent; > + uint32_t int_id; > + uint32_t packedDate[2]; > + > +... > + if (!parent) > + int_id = GRAPH_PARENT_NONE; > + else if (parent->next) > + int_id = GRAPH_LARGE_EDGES_NEEDED | num_large_edges; > + else if (!commit_pos(commits, nr_commits, > + &(parent->item->object.oid), &int_id)) > + int_id = GRAPH_PARENT_MISSING; > + > + sha1write_be32(f, int_id); > + > + if (parent && parent->next) { This is equivalent to checking "int_id & GRAPH_LARGE_EDGES_NEEDED", right? Not suggesting to use the other form of checks, but trying to see what's the best way to express it in the most readable way. > + do { > + num_large_edges++; > + parent = parent->next; > + } while (parent); It feels somewhat wasteful to traverse the commit's parents list only to count, without populating the octopus table (which I understand is assumed to be minority case under this design). > + } > + > + if (sizeof((*list)->date) > 4) > + packedDate[0] = htonl(((*list)->date >> 32) & 0x3); > + else > + packedDate[0] = 0; OK, the undefined pattern in the previous round is now gone ;-) Good. > + packedDate[1] = htonl((*list)->date); > + sha1write(f, packedDate, 8); > + > + list++; > + } > +} > + > +static void write_graph_chunk_large_edges(struct sha1file *f, > + struct commit **commits, > + int nr_commits) > +{ > + struct commit **list = commits; > + struct commit **last = commits + nr_commits; > + struct commit_list *parent; > + > + while (list < last) { > + int num_parents = 0; > + for (parent = (*list)->parents; num_parents < 3 && parent; > + parent = parent->next) > + num_parents++; > + > + if (num_parents <= 2) { > + list++; > + continue; > + } > + > + /* Since num_parents > 2, this initializer is safe. */ > + for (parent = (*list)->parents->next; pa
[PATCH v4 04/13] commit-graph: implement write_commit_graph()
Teach Git to write a commit graph file by checking all packed objects to see if they are commits, then store the file in the given object directory. Signed-off-by: Derrick Stolee --- Makefile | 1 + commit-graph.c | 370 + commit-graph.h | 7 ++ 3 files changed, 378 insertions(+) create mode 100644 commit-graph.c create mode 100644 commit-graph.h diff --git a/Makefile b/Makefile index fc40b81..eeaeb6a 100644 --- a/Makefile +++ b/Makefile @@ -761,6 +761,7 @@ LIB_OBJS += color.o LIB_OBJS += column.o LIB_OBJS += combine-diff.o LIB_OBJS += commit.o +LIB_OBJS += commit-graph.o LIB_OBJS += compat/obstack.o LIB_OBJS += compat/terminal.o LIB_OBJS += config.o diff --git a/commit-graph.c b/commit-graph.c new file mode 100644 index 000..f9e39b0 --- /dev/null +++ b/commit-graph.c @@ -0,0 +1,370 @@ +#include "cache.h" +#include "config.h" +#include "git-compat-util.h" +#include "pack.h" +#include "packfile.h" +#include "commit.h" +#include "object.h" +#include "revision.h" +#include "sha1-lookup.h" +#include "commit-graph.h" + +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */ +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */ +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */ +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */ +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */ + +#define GRAPH_DATA_WIDTH 36 + +#define GRAPH_VERSION_1 0x1 +#define GRAPH_VERSION GRAPH_VERSION_1 + +#define GRAPH_OID_VERSION_SHA1 1 +#define GRAPH_OID_LEN_SHA1 20 +#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1 +#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1 + +#define GRAPH_LARGE_EDGES_NEEDED 0x8000 +#define GRAPH_PARENT_MISSING 0x7fff +#define GRAPH_EDGE_LAST_MASK 0x7fff +#define GRAPH_PARENT_NONE 0x7000 + +#define GRAPH_LAST_EDGE 0x8000 + +#define GRAPH_FANOUT_SIZE (4 * 256) +#define GRAPH_CHUNKLOOKUP_WIDTH 12 +#define GRAPH_CHUNKLOOKUP_SIZE (5 * GRAPH_CHUNKLOOKUP_WIDTH) +#define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \ + GRAPH_OID_LEN + 8) + +static void write_graph_chunk_fanout(struct sha1file *f, +struct commit **commits, +int nr_commits) +{ + uint32_t i, count = 0; + struct commit **list = commits; + struct commit **last = commits + nr_commits; + + /* +* Write the first-level table (the list is sorted, +* but we use a 256-entry lookup to be able to avoid +* having to do eight extra binary search iterations). +*/ + for (i = 0; i < 256; i++) { + while (list < last) { + if ((*list)->object.oid.hash[0] != i) + break; + count++; + list++; + } + + sha1write_be32(f, count); + } +} + +static void write_graph_chunk_oids(struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list, **last = commits + nr_commits; + for (list = commits; list < last; list++) + sha1write(f, (*list)->object.oid.hash, (int)hash_len); +} + +static int commit_pos(struct commit **commits, int nr_commits, + const struct object_id *oid, uint32_t *pos) +{ + uint32_t first = 0, last = nr_commits; + + while (first < last) { + uint32_t mid = first + (last - first) / 2; + struct object_id *current; + int cmp; + + current = &(commits[mid]->object.oid); + cmp = oidcmp(oid, current); + if (!cmp) { + *pos = mid; + return 1; + } + if (cmp > 0) { + first = mid + 1; + continue; + } + last = mid; + } + + *pos = first; + return 0; +} + +static void write_graph_chunk_data(struct sha1file *f, int hash_len, + struct commit **commits, int nr_commits) +{ + struct commit **list = commits; + struct commit **last = commits + nr_commits; + uint32_t num_large_edges = 0; + + while (list < last) { + struct commit_list *parent; + uint32_t int_id; + uint32_t packedDate[2]; + + parse_commit(*list); + sha1write(f, (*list)->tree->object.oid.hash, hash_len); + + parent = (*list)->parents; + + if (!parent) + int_id = GRAPH_PARENT_NONE; + else if (!commit_pos(commits, nr_commits, +&(parent->item->object.oid), &int_id)) + int_id = GRAPH_PARENT_MISSING; + + sha1write_be32(f, int_id); + + if (parent) + parent = pa