Niedriges prozentuales Kreditangebot
Sehr geehrter Herr / Frau Mein Name ist Catherine Graf von CREDIT SUISSE, Bedarfskredit oder Finanzierung fr Investitionen, Projektentwicklungsfinanzierung, Geschftserweiterung, Schuldenfinanzierung Suche ist hier nicht der richtige Ort, um Geld zu bekommen. Der Zinssatz betrgt 1%. Sollten Sie an einer unserer finanziellen Mglichkeiten interessiert sein, fllen Sie bitte das folgende Formular aus. Vor-und Nachname: Der gewnschte Kreditbetrag: Sexuell: Zeitraum: Freundliche Gre, Katharina Graf Direktor E-Mail: creditsuisselo...@outlook.com
Git Merge contributor summit notes
It was great to meet some of you in person! Some notes from the Contributor Summit at Git Merge are below. Taken in haste, so my apologies if there are any mis-statements. - Alex "Does anyone think there's a compelling reason for git to exist?" - peff Partial clone (Jeff Hostetler / Jonathan Tan) - - Request that the server not send everything - Motivated by getting Windows into git - Also by not having to fetch large blobs that are in-tree - Allows client to request a clone that excludes some set of objects, with incomplete packfiles - Decoration on objects that include promise for later on-demand backfill - In `master`, have a way of: - omitting all blobs - omitting large blobs - sparse checkout specification stored on server - Hook in read_object to fetch objects in bulk - Future work: - A way to fetch blobsizes for virtual checkouts - Give me new blobs that this tree references relative to now - Omit some subset of trees - Modify other commits to exclude omitted blobs - Protocol v2 may have better verbs for sparse specification, etc Questions: - Reference server implementation? - In git itself - VSTS does not support - What happens if a commit becomes unreachable? Does promise still apply? - Probably yes? - If the promise is broken, probably crashes - Can differentiate between promise that was made, and one that wasn't => Demanding commitment from server to never GC seems like a strong promise - Interactions with external object db - promises include bulk fetches, as opposed to external db, which is one-at-a-time - dry-run semantics to determine which objects will be needed - very important for small objects, like commits/trees (which is not in `master`, only blobs) - perhaps for protocol V2 - server has to promise more, requires some level of online operation - annotate that only some refs are forever? - requires enabling the "fetch any SHA" flags - rebasing might require now-missing objects? - No, to build on them you must have fetched them - Well, building on someone else's work may mean you don't have all of them - server is less aggressive about GC'ing by keeping "weak references" when there are promises? - hosting requires that you be able to forcibly remove information - being able to know where a reference came from? - as being able to know why an object was needed, for more advanced logic - Does `git grep` attempt to fetch blobs that are deferred? - will always attempt to fetch - one fetch per object, even! - might not be true for sparse checkouts - Maybe limit to skipping "binary files"? - Currently sparse checkout grep "works" because grep defaults to looking at the index, not the commit - Does the above behavior for grepping revisions - Don't yet have a flag to exclude grep on non-fetched objects - Should `git grep -L` die if it can't fetch the file? - Need a config option for "should we die, or try to move on"? - What's the endgame? Only a few codepaths that are aware, or threaded through everywhere? - Fallback to fetch on demand means there's an almost-reasonable fallback - Better prediction with bulk fetching - Are most commands going to _need_ to be sensitive to it? - GVFS has a caching server in the building - A few git commands have been disabled (see recent mail from Stolee); those are likely candidates for code that needs to be aware of de-hydrated objects - Is there an API to know what objects are actually local? - No external API - GVFS has a REST API - Some way to later ask about files? - "virtualized filesystem"? - hook to say "focus on this world of files" - GVFS writes out your index currently - Will this always require turning off reachability checks? - Possibly - Shallow clones, instead of partial? - Don't download the history, just the objects - More of a protocol V2 property - Having all of the trees/commits make this reasonable - GVFS vs this? - GVFS was a first pass - Now trying to mainstream productize that - Goal is to remove features from GVFS and replace with this Protocol V2 (Brandon) - Main problem is that forward compatibility negotiation wasn't possible - Found a way to sneak in the V2 negotiation via side-channel in all transports - "environment variable" GIT_PROTOCOL which server can detect - Ability to transmit and ignore, or not transmit, means forward/backward compat - HTTP header / environment variable - ...s now what? - Keep as similar as possible, but more layout changes to remove bad characteristics - Like fixing flush semantics - Remove ref advertisement (250M of refs every fetch from Android!) - Capabilities are currently in first packet, 1K limit - First response is capabilities from the server,
Re: [PATCH/RFC v3 07/12] pack-objects: move in_pack out of struct object_entry
Nguyễn Thái Ngọc Duywrites: > Instead of using 8 bytes (on 64 bit arch) to store a pointer to a > pack. Use an index isntead since the number of packs should be > relatively small. > > This limits the number of packs we can handle to 16k. For now if you hit > 16k pack files limit, pack-objects will simply fail [1]. > > This technically saves 7 bytes. But we don't see any of that in > practice due to padding. The saving becomes real when we pack this > struct tighter later. Somehow 7 and 16k do not add up. We use 8 bytes in the original code, and a solution that potentially saves 7 bytes would use only 1 byte instead of the original 8, which would allow us to index/identify 1<<8 == 256 packs, but for some reason we can handle up to 16k. > [1] The escape hatch is .keep file to limit the non-kept pack files > below 16k limit. Then you can go for another pack-objects run to > combine another 16k pack files. Repeat until you're satisfied. ;-) > +static inline unsigned int oe_add_pack(struct packing_data *pack, > +struct packed_git *p) > +{ > + if (pack->in_pack_count >= (1 << OE_IN_PACK_BITS)) > + die(_("too many packs to handle in one go. " > + "Please add .keep files to exclude\n" > + "some pack files and keep the number " > + "of non-kept files below %d."), > + 1 << OE_IN_PACK_BITS); OK.
Re: [PATCH/RFC v3 04/12] pack-objects: use bitfield for object_entry::depth
Nguyễn Thái Ngọc Duywrites: > + Maximum value is 4095. > ... > + if (depth > (1 << OE_DEPTH_BITS)) > + die(_("delta chain depth %d is greater than maximum limit %d"), > + depth, (1 << OE_DEPTH_BITS)); > + Do I see an off-by-one here? Ie. if ((1 << OE_DEPATH_BITS) <= depth) die("depth that is 4096 or deeper is too much");
Re: Is context/directory sensitive config possible or even a god idea?
Eddy Petrișorwrites: > When I saw he include.path feature ... Do you mean includeIf..path?
Re: [PATCH/RFC v3 02/12] pack-objects: turn type and in_pack_type to bitfields
Nguyễn Thái Ngọc Duywrites: > @@ -1570,7 +1576,7 @@ static void drop_reused_delta(struct object_entry > *entry) > entry->depth = 0; > > oi.sizep = >size; > - oi.typep = >type; > + oi.typep = > if (packed_object_info(entry->in_pack, entry->in_pack_offset, ) < 0) > { > /* >* We failed to get the info from this pack for some reason; > @@ -1580,6 +1586,10 @@ static void drop_reused_delta(struct object_entry > *entry) >*/ > entry->type = sha1_object_info(entry->idx.oid.hash, > >size); The comment immediately before this pre-context reads as such: /* * We failed to get the info from this pack for some reason; * fall back to sha1_object_info, which may find another copy. * And if that fails, the error will be recorded in entry->type * and dealt with in prepare_pack(). */ The rest of the code relies on the ability of entry->type to record the error by storing an invalid (negative) type; otherwise, it cannot detect an error where (1) the entry in _this_ pack was corrupt, and (2) we wished to find another copy of the object elsewhere (which would overwrite the negative entry->type we assign here), but we didn't find any. How should we propagate the error we found here down the control flow in this new code? > + } else { > + if (type < 0) > + die("BUG: invalid type %d", type); > + entry->type = type; The BUG() on this side is sensible, as packed_object_info() shouldn't report success when it stored negative result in *oi.typep anyway. > unsigned char in_pack_header_size; > + unsigned type:TYPE_BITS; > + unsigned in_pack_type:TYPE_BITS; /* could be delta */
Re: [PATCH/RFC v3 01/12] pack-objects: a bit of document about struct object_entry
Nguyễn Thái Ngọc Duywrites: > The role of this comment block becomes more important after we shuffle > fields around to shrink this struct. It will be much harder to see what > field is related to what. This also documents the holes in this struct > according to pahole. > > A couple of notes on shrinking the struct: > > 1) The reader may notice one thing from this document and the shrinking > business. If "delta" is NULL, all other delta-related fields should be > irrelevant. We could group all these in a separate struct and replace > them all with a pointer to this struct (allocated separately). > > This does not help much though since 85% of objects are deltified > (source: linux-2.6.git). The gain is only from non-delta objects, which > is not that significant. OK. > 2) The field in_pack_offset and idx.offset could be merged. But we need > to be very careful. Up until the very last phase (object writing), > idx.offset is not used and can hold in_pack_offset. Then idx.offset will > be updated with _destination pack's_ offset, not source's. But since we > always write delta's bases first, and we only use in_pack_offset in > writing phase when we reuse objects, we should be ok? By separating the processing in strict phases, I do think the result would be OK, but at the same time, that does smell like an invitation for future bugs. > +/* > + * basic object info > + * - > + * idx.oid is filled up before delta searching starts. idx.crc32 and > + * is only valid after the object is written down and will be used for > + * generating the index. idx.offset will be both gradually set and > + * used in writing phase (base objects get offset first, then deltas > + * refer to them) Here, I'd feel that "written out" somehow would sound more natural than "written down", but that is perhaps because I've seen it used elsewhere and I am confusing familiarlity with naturalness. In any case, if we mean "written to the resulting packdata stream", saying that to be more explicit is probably a good idea. We compute crc32 and learn the offset for each object as we write them to the result. > + * If a delta is cached in memory and is compressed, "delta" points to > + * the data and z_delta_size contains the compressed size. If it's Isn't it "delta_data" (aot "delta") that points at the cached delta data? > + * uncompressed [1], z_delta_size must be zero. delta_size is always > + * the uncompressed size and must be valid even if the delta is not > + * cached. Delta recreation technically only depends on "delta" > + * pointer, but delta_size is still used to verify it's the same as > + * before. > + * > + * [1] during try_delta phase we don't bother with compressing because > + * the delta could be quickly replaced with a better one. > + */ > struct object_entry { > struct pack_idx_entry idx; > unsigned long size; /* uncompressed size */ > @@ -28,6 +74,7 @@ struct object_entry { > unsigned tagged:1; /* near the very tip of refs */ > unsigned filled:1; /* assigned write-order */ > > + /* XXX 28 bits hole, try to pack */ > /* >* State flags for depth-first search used for analyzing delta cycles. >* > @@ -40,6 +87,7 @@ struct object_entry { > DFS_DONE > } dfs_state; > int depth; > + /* size: 136, padding: 4 */ > }; > > struct packing_data {
Is context/directory sensitive config possible or even a god idea?
Hello, In some situations one person could be involved in various multi-repository projects and need to use different user.email information, depending on the project. A simple example can be using the project specific email identity for all repositories part of Android, while using the personal email for personal projects on github. The developer has currently only 2 options, AFAICT: a) use --global identity for the personal email and set the Android identity in all project repos b) use the reverse, store Android identity in --global gitconfig, and make sure to set the identity at repo level for personal projects. Both solutions are suboptimal, especially if both the project in question has many repositories and the number github projects is high, because the user must remember to set the non-global user identity either in all github repos or in each Android repos, after checkout. In my view, there should be a middle ground, having a context/directory dependent setting could be a solution. My idea is somehting like "for all repos cloned in subdirs of directory ~/src/android, use identity johm@android.org", so the only thing to remember to make this work is to make sure all android checkouts are done under ~/src/android, which most people already do because they need a clean home dir. When I saw he include.path feature I thought this was exactly that, but when testing it did not work as I would have liked, the inluded config is simply added to the global config, if I add this to the ~/.gitconfig file and the target file exists. I understand the use case I am thinking of is not the one that was addressed by the include.file feature, so I am thinking git would need some other setting like 'contextsensitive.include=/home/johndoe/src/android/.gitconfig' which could mean include the settings from this config is we're in a dir on the same level a the file, or a directory which has that dir as parent or ancestor. What I see already: 1) it would be a perfomance hit if the test for "do we need to apply any context sensitive setting" must be done on every git command; for the identity usecase might be less of a problem if only the subcomands that need user.* would compare the prefix of the current dir with all all contextsensitive.inludes settings and applying only the longest one 2) what if the contextsensitive.inluclude file also includes another contextsensitive.include, could that lead to ambiguous situations? 3) having the feature could allow the user to set other project specific settings. Nice usecases: if Android uses --no-ff merge.ff=false in ~/src/andoid/.gitconfig contextsensitive.inlclude would ensure the expected strategy is used, and rebase is the preferred behaviour for pull, instead of merge) -- Eddy Petrișor
Re: git stash push -u always warns "pathspec '...' did not match any files"
Marc Strapetzwrites: > Thanks, I can confirm that the misleading warning message is fixed. > > What I've noticed now is that when using -u option, Git won't warn if > the pathspec is actually not matching a file. Also, an empty stash may > be created. S..., does it mean that the patch Thomas posted and you confirmed trades one issue with another issue with a similar graveness? Is this something we want to "fix" without adding another breakage?
Re: [PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'
On Fri, Mar 9, 2018 at 3:22 PM, Junio C Hamanowrote: > Eric Sunshine writes: >> On Fri, Mar 9, 2018 at 12:35 PM, wrote: >>> /* Don't encode to the default encoding */ >>> - if (!strcasecmp(value, default_encoding)) >>> + if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding)) >>> return NULL; >> >> This change belongs in 6/10, not 10/10, methinks. > > It is actually worse than that, no? When default_encoding is > (somehow) configured not to be UTF-8, e.g. "Shift_JIS", we used to > avoid converting from Shift_JIS to Shift_JIS, but the optimization > no longer happens with this code. > > In any case, I think same_encoding() is probably a good thing to use > here at step 6/10, so the point is moot, I guess. Agreed, same_encoding() would be superior.
Re: [PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'
Eric Sunshinewrites: > On Fri, Mar 9, 2018 at 12:35 PM, wrote: >> [...] >> Add 'core.checkRoundtripEncoding', which contains a comma separated >> list of encodings, to define for what encodings Git should check the >> conversion round trip if they are used in the 'working-tree-encoding' >> attribute. >> [...] >> Signed-off-by: Lars Schneider >> --- >> diff --git a/convert.c b/convert.c >> @@ -1150,7 +1227,7 @@ static const char *git_path_check_encoding(struct >> attr_check_item *check) >> /* Don't encode to the default encoding */ >> - if (!strcasecmp(value, default_encoding)) >> + if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding)) >> return NULL; > > This change belongs in 6/10, not 10/10, methinks. It is actually worse than that, no? When default_encoding is (somehow) configured not to be UTF-8, e.g. "Shift_JIS", we used to avoid converting from Shift_JIS to Shift_JIS, but the optimization no longer happens with this code. In any case, I think same_encoding() is probably a good thing to use here at step 6/10, so the point is moot, I guess.
Re: [PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'
On Fri, Mar 9, 2018 at 12:35 PM,wrote: > [...] > Add 'core.checkRoundtripEncoding', which contains a comma separated > list of encodings, to define for what encodings Git should check the > conversion round trip if they are used in the 'working-tree-encoding' > attribute. > [...] > Signed-off-by: Lars Schneider > --- > diff --git a/convert.c b/convert.c > @@ -1150,7 +1227,7 @@ static const char *git_path_check_encoding(struct > attr_check_item *check) > /* Don't encode to the default encoding */ > - if (!strcasecmp(value, default_encoding)) > + if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding)) > return NULL; This change belongs in 6/10, not 10/10, methinks.
Re: [PATCH v3] fetch-pack.c: use oidset to check existence of loose object
Takuto Ikutawrites: > In repository having large number of remote refs, because to check Isn't this "When fetching from a repository with large number of refs,"? The number of refs (whether it is local or remote-tracking) the local side has has nothing to do with the issue you are addressing, no? > existence of each refs in local repository, 'git fetch' ends up doing a > lot of lstat(2) calls to see if it exists in loose form, which makes it > slow. Other than that, the above description reads much better and makes the result easier to understand. > This patch enumerates loose objects in hashmap beforehand and uses it to > check existence instead of using lstat(2) to improve performance of > fetch-pack for repositories having large number of remote refs compared > to the number of loose objects. We'd rather write this paragraph as if giving an order to the codebase "to be like so", e.g. Instead of making as many lstat(2) calls as the refs the remote side advertised to see if these objects exist in the loose form, first enumerate all the existing loose objects in hashmap beforehand and use it to check existence of them... > I took time stat of `git fetch` disabling quickfetch, so that fetch-pack I still do not know if a benchmark with quickfetch disabled gives relevant numbers, for reasons I gave earlier. The relative numbers between Linux and MacBook look quite convincing, as they illustrate differences of lstat(2) performance on these platforms. > for (ref = *refs; ref; ref = ref->next) { > struct object *o; > + unsigned int flags = OBJECT_INFO_QUICK; > > - if (!has_object_file_with_flags(>old_oid, > - OBJECT_INFO_QUICK)) > - continue; > + if (!oidset_contains(_oid_set, >old_oid)) { > + /* I know this does not exist in the loose form, > + * so check if it exists in a non-loose form. > + */ /* * Our multi-line comment looks like this, * with opening slash-asterisk and closing * asterisk-slash on their own lines. */ Thanks.
Re: [PATCH] fetch-pack.c: use oidset to check existence of loose object
Junio C Hamanowrites: >> Yes. But I think the default limit for the number of loose objects, 7000, >> gives us small overhead when we do enumeration of all objects. > > Hmph, I didn't see the code that does the estimation of loose object > count before starting to enumerate, though. Another thing the code could do to avoid negative consequences on projects that look quite different from yours (e.g. the other side does not have insane number of refs, but there are locally quite a few loose objects) is to count how many entries are on *refs list before we decide to enumerate all loose objects. When the refs list is relatively shorter than the estimated number of loose objects (you can actually do the estimation based on sampling, or just rely on your assumed 7k), it may be a win _not_ to trigger the new code you are adding to this codepath with this patch. I would imagine that the simplest implementaion may just count for (ref = *refs, count = 0; ref && count++ < LIMIT; ref = ref->next) ; use_oidset_optim = (LIMIT <= count); assuming your "up to 7k loose objects" and then experimenting to determine the LIMIT which is a rough number of refs that makes the oidset optimization worthwhile. We need a bit better/descriptive name for the LIMIT, if we go that route, though. Thanks.
Re: [PATCH v11 08/10] convert: advise canonical UTF encoding names
lars.schnei...@autodesk.com writes: > From: Lars Schneider> > The canonical name of an UTF encoding has the format UTF, dash, number, > and an optionally byte order in upper case (e.g. UTF-8 or UTF-16BE). > Some iconv versions support alternative names without a dash or with > lower case characters. > > To avoid problems between different iconv version always suggest the > canonical UTF names in advise messages. > > Signed-off-by: Lars Schneider > --- I think it is probably better to squash this to earlier step, i.e. jumping straight to the endgame solution. > diff --git a/convert.c b/convert.c > index b80d666a6b..9a3ae7cce1 100644 > --- a/convert.c > +++ b/convert.c > @@ -279,12 +279,20 @@ static int validate_encoding(const char *path, const > char *enc, > "BOM is prohibited in '%s' if encoded as %s"); > /* >* This advice is shown for UTF-??BE and UTF-??LE > encodings. > + * We cut off the last two characters of the encoding > name > + # to generate the encoding name suitable for BOMs. >*/ I somehow thought that I saw "s/#/*/" in somebody's response during the previous round? > const char *advise_msg = _( > "The file '%s' contains a byte order " > - "mark (BOM). Please use %.6s as " > + "mark (BOM). Please use UTF-%s as " > "working-tree-encoding."); > - advise(advise_msg, path, enc); > + const char *stripped = ""; > + char *upper = xstrdup_toupper(enc); > + upper[strlen(upper)-2] = '\0'; > + if (!skip_prefix(upper, "UTF-", )) > + skip_prefix(stripped, "UTF", ); > + advise(advise_msg, path, stripped); > + free(upper); If this codepath is ever entered with "enc" that does not begin with "UTF" (e.g. "Shift_JIS", which is impossible in the current code, but I'll talk about future-proofing here), then neither of these skip_prefix will trigger, and then you'd end up suggesting to use "UTF-" that is nonsense. Perhaps initialize stripped to NULL and force advise to segv to catch such a programmer error?
Re: [PATCH v11 07/10] convert: check for detectable errors in UTF encodings
lars.schnei...@autodesk.com writes: > + const char *advise_msg = _( > + "The file '%s' contains a byte order " > + "mark (BOM). Please use %.6s as " > + "working-tree-encoding."); I know that this will go away in a later step, but why ".6"? > + advise(advise_msg, path, enc);
Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute
lars.schnei...@autodesk.com writes: > +static const char *default_encoding = "UTF-8"; > + > ... > +static const char *git_path_check_encoding(struct attr_check_item *check) > +{ > + const char *value = check->value; > + > + if (ATTR_UNSET(value) || !strlen(value)) > + return NULL; > + > + if (ATTR_TRUE(value) || ATTR_FALSE(value)) { > + error(_("working-tree-encoding attribute requires a value")); > + return NULL; > + } Hmph, so we decide to be loud but otherwise ignore an undefined configuration? Shouldn't we rather die instead to avoid touching the user data in unexpected ways? > + > + /* Don't encode to the default encoding */ > + if (!strcasecmp(value, default_encoding)) > + return NULL; Is this an optimization to avoid "recode one encoding to the same encoding" no-op overhead? We already have the optimization in the same spirit in may existing codepaths that has nothing to do with w-t-e, and I think we should share the code. Two pieces of thought comes to mind. One is a lot smaller in scale: Is same_encoding() sufficient for this callsite instead of strcasecmp()? The other one is a lot bigger: Looking at all the existing callers of same_encoding() that call reencode_string() when it returns false, would it make sense to drop same_encoding() and move the optimization to reencode_string() instead? I suspect that the answer to the smaller one is "yes, and even if not, it should be easy to enhance/extend same_encoding() to make it do what we want it to, and such a change will benefit even existing callers." The answer to the larger one is likely "the optimization is not about skipping only reencode_string() call but other things are subtly different among callers of same_encoding(), so such a refactoring would not be all that useful." The above still holds for the code after 10/10 touches this part.
Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
On Fri, Mar 09 2018, Junio C. Hamano jotted: > Lars Schneiderwrites: > >> I think following the boost lib makes most sense. Therefore, >> I would like to go with "istarts_with". OK with you? > > I don't care too deeply; if we took starts_with() from there, where > what we now want is defined as istarts_with(), then that sounds like > a good thing to do. I don't care either, but just a note that we had this exact discussion around this time last year when I added a starts_with_icase() [1][2] which eventually got dropped. 1. https://public-inbox.org/git/xmqqpohao2hw@gitster.mtv.corp.google.com/ 2. https://public-inbox.org/git/20170326121654.22035-4-ava...@gmail.com/
Re: [PATCH v11 07/10] convert: check for detectable errors in UTF encodings
> On 09 Mar 2018, at 20:00, Junio C Hamanowrote: > > lars.schnei...@autodesk.com writes: > >> +const char *advise_msg = _( >> +"The file '%s' contains a byte order " >> +"mark (BOM). Please use %.6s as " >> +"working-tree-encoding."); > > I know that this will go away in a later step, but why ".6"? I deleted the original comment in the rebase, sorry: /* * This advice is shown for UTF-??BE and UTF-??LE * encodings. We truncate the encoding name to 6 * chars with %.6s to cut off the last two "byte * order" characters. */ - Lars
Re: [PATCH v11 07/10] convert: check for detectable errors in UTF encodings
lars.schnei...@autodesk.com writes: > + const char *advise_msg = _( > + "The file '%s' contains a byte order " > + "mark (BOM). Please use %.6s as " > + "working-tree-encoding."); I know that this will go away in a later step, but why ".6"? > + advise(advise_msg, path, enc);
Re: [PATCH] fetch-pack.c: use oidset to check existence of loose object
Takuto Ikutawrites: > Yes, I just wanted to say 'git fetch' invokes fetch-pack. > fetch-pack is skipped when running git fetch repeatedly while > remote has no update by quickfetch. So I disabled it to see the > performance of fetch-pack. In chromium repository, master branch > is updated several times in an hour, so git fetch invokes fetch-pack > in such frequency. I do understand that if you run "git fetch" against the same place in quick succession three times, the first one may cost a lot (not just that it has to do the everything_local() computation that you care about, it also has to actually do the network transfer), while the second and third ones that are done before anything new happens on the other end will not involve everything_local() overhead thanks to quickfetch. A "fetch" that is run against a remote that has nothing new, but still triggers everything_local() only because quickfetch is disabled, is an artificial case that has no relevance to the real world, I suspect, because the quickfetch optimization is to solve the "there is nothing to be done, still do_fetch_pack() spends so much cycles only to realize that it does not have anything to do, why?" issue. Isn't running the "git fetch" command with the "--dry-run" option many times in quick succession a lot closer to what you really want to measure, I wonder? That way, your first fetch won't be touching the state of the local side to affect your second and subsequent fetches. >> In any case, do_fetch_pack() tries to see if all of the tip commits >> we are going to fetch exist locally, so when you are trying a fetch >> that grabs huge number of refs (by the way, it means that the first >> sentence of the proposed log message is not quite true---it is "When >> fetching a large number of refs", as it does not matter how many >> refs _we_ have, no?), everything_local() ends up making repeated >> calls to has_object_file_with_flags() to all of the refs. > > I fixed description by changing 'refs' to 'remote refs'. In my understanding, > git tries to check existence of remote refs even if we won't fetch such refs. During fetch, everything_local() tries to mark common part by walking the refs the other side advertised upon the first contact, so it is correct that the number of checks is not reduced in a fetch that does not fetch many refs, but the number of remote-tracking refs you have has no effect, so I doubt such a rephrasing would make the description more understandable. "When fetching from a repository with large number of refs" is probably what you want to say, no? > Yes. But I think the default limit for the number of loose objects, 7000, > gives us small overhead when we do enumeration of all objects. Hmph, I didn't see the code that does the estimation of loose object count before starting to enumerate, though. >> Hmm, OBJECT_INFO_QUICK optimization was added in dfdd4afc >> ("sha1_file: teach sha1_object_info_extended more flags", >> 2017-06-21), but since 8b4c0103 ("sha1_file: support lazily fetching >> missing objects", 2017-12-08) it appears that passing >> OBJECT_INFO_QUICK down the codepath does not do anything >> interesting. Jonathan (cc'ed), are all remaining hits from "git >> grep OBJECT_INFO_QUICK" all dead no-ops these days? > > Yes the flag is no-op now, but let me untouched the flag in this patch. Yeah, I do not want you to be touching that in this change. It is an independent/orthogonal clean-up. Thanks.
Re: [PATCH] Support long format for log-based submodule diff
Stefan Bellerwrites: >> $ git diff --submodule=log --submodule-log-detail=(long|short) >> >> I'm not sure what makes sense here. I welcome thoughts/discussion and >> will provide follow-up patches. > > The case of merges is usually configured with --[no-]merges, or > --min-parents=. But that is a knob that controls an irrelevant aspect of the detail in the context of this discussion, isn't it? This code is about "to what degree the things that happened between two submodule commits in an adjacent pair of commits in the superproject are summarized?" and the current one unilaterally decides that something similar to what you would see in the output from "log --oneline --first-parent --left-right" is sufficient, which is a position to heavily favour projects whose histories are very clean by either being: (1) totally linear, each individual commit appearing on the first-parent chain; or (2) totally topic-branch based, everything appearing as merges of a topic branch to the trunk The hack Robert illustrates below is to change it to stop favouring such projects with "clean" histories, and show "log --oneline --no-merges --left-right". When presented that way, clean histories of topic-branch based projects will suffer by losing conciseness, but clean histories of totally linear projects will still be shown the same way, and messy history that sometimes merges, sometimes merges mergy histories, and sometimes directly builds on the trunk will be shown as an enumeration of individual commits in a flat way by ignoring merges and not restricting the traversal to the first parent chains, which would appear more uniform than what the current code shows. I do not see a point in introducing --min/max-parents as a knob to control how the history is summarized. This is a strongly related tangent, but I wonder if we can and/or want to share more code with the codepath that prepares the log message for a merge. It summarizes what happened on the side branch since it forked from the history it is joining back to (I think it is merge.c::shortlog() that computes this) and it is quite similar to what Robert wants to use for submodules here. On the other hand, in a project _without_ submodule, if you are pulling history made by your lieutenant whose history is full of linear merges of topic branches to the mainline, it may not be a bad idea to allow fmt-merge-msg to alternatively show something similar to the "diff --submodule=log" gives us, i.e. summarize the history of the side branch being merged by just listing the commits on the first-parent chain. So I sense some opportunity for cross pollination here.
[PATCH v11 04/10] utf8: add function to detect prohibited UTF-16/32 BOM
From: Lars SchneiderWhenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used [1]. The function returns true if this is the case. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#bom10 Signed-off-by: Lars Schneider --- utf8.c | 26 ++ utf8.h | 9 + 2 files changed, 35 insertions(+) diff --git a/utf8.c b/utf8.c index 2c27ce0137..e4b99580f0 100644 --- a/utf8.c +++ b/utf8.c @@ -538,6 +538,32 @@ char *reencode_string_len(const char *in, int insz, } #endif +static int has_bom_prefix(const char *data, size_t len, + const char *bom, size_t bom_len) +{ + return (len >= bom_len) && !memcmp(data, bom, bom_len); +} + +static const char utf16_be_bom[] = {0xFE, 0xFF}; +static const char utf16_le_bom[] = {0xFF, 0xFE}; +static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF}; +static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00}; + +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) +{ + return ( + (!strcasecmp(enc, "UTF-16BE") || !strcasecmp(enc, "UTF-16LE") || + !strcasecmp(enc, "UTF16BE") || !strcasecmp(enc, "UTF16LE")) && + (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) + ) || ( + (!strcasecmp(enc, "UTF-32BE") || !strcasecmp(enc, "UTF-32LE") || + !strcasecmp(enc, "UTF32BE") || !strcasecmp(enc, "UTF32LE")) && + (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) + ); +} + /* * Returns first character length in bytes for multi-byte `text` according to * `encoding`. diff --git a/utf8.h b/utf8.h index 6bbcf31a83..0db1db4519 100644 --- a/utf8.h +++ b/utf8.h @@ -70,4 +70,13 @@ typedef enum { void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int width, const char *s); +/* + * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16 + * BOM must not be used [1]. The same applies for the UTF-32 equivalents. + * The function returns true if this rule is violated. + * + * [1] http://unicode.org/faq/utf_bom.html#bom10 + */ +int has_prohibited_utf_bom(const char *enc, const char *data, size_t len); + #endif -- 2.16.2
[PATCH v11 07/10] convert: check for detectable errors in UTF encodings
From: Lars SchneiderCheck that new content is valid with respect to the user defined 'working-tree-encoding' attribute. Signed-off-by: Lars Schneider --- convert.c| 48 ++ t/t0028-working-tree-encoding.sh | 56 2 files changed, 104 insertions(+) diff --git a/convert.c b/convert.c index aa59ecfe49..b80d666a6b 100644 --- a/convert.c +++ b/convert.c @@ -266,6 +266,51 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, } +static int validate_encoding(const char *path, const char *enc, + const char *data, size_t len, int die_on_error) +{ + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ + if (istarts_with(enc, "UTF")) { + /* +* Check for detectable errors in UTF encodings +*/ + if (has_prohibited_utf_bom(enc, data, len)) { + const char *error_msg = _( + "BOM is prohibited in '%s' if encoded as %s"); + /* +* This advice is shown for UTF-??BE and UTF-??LE encodings. +*/ + const char *advise_msg = _( + "The file '%s' contains a byte order " + "mark (BOM). Please use %.6s as " + "working-tree-encoding."); + advise(advise_msg, path, enc); + if (die_on_error) + die(error_msg, path, enc); + else { + return error(error_msg, path, enc); + } + + } else if (is_missing_required_utf_bom(enc, data, len)) { + const char *error_msg = _( + "BOM is required in '%s' if encoded as %s"); + const char *advise_msg = _( + "The file '%s' is missing a byte order " + "mark (BOM). Please use %sBE or %sLE " + "(depending on the byte order) as " + "working-tree-encoding."); + advise(advise_msg, path, enc, enc); + if (die_on_error) + die(error_msg, path, enc); + else { + return error(error_msg, path, enc); + } + } + + } + return 0; +} + static const char *default_encoding = "UTF-8"; static int encode_to_git(const char *path, const char *src, size_t src_len, @@ -291,6 +336,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len, if (!buf && !src) return 1; + if (validate_encoding(path, enc, src, src_len, die_on_error)) + return 0; + dst = reencode_string_len(src, src_len, default_encoding, enc, _len); if (!dst) { diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh index e492945a01..e8408dfe5c 100755 --- a/t/t0028-working-tree-encoding.sh +++ b/t/t0028-working-tree-encoding.sh @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' ' for i in 16 32 do + test_expect_success "check prohibited UTF-${i} BOM" ' + test_when_finished "git reset --hard HEAD" && + + echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes && + echo "*.utf${i}le text working-tree-encoding=utf-${i}LE" >>.gitattributes && + + # Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian) + # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32). + # In these cases the BOM is prohibited. + cp bebom.utf${i}be.raw bebom.utf${i}be && + test_must_fail git add bebom.utf${i}be 2>err.out && + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out && + + cp lebom.utf${i}le.raw lebom.utf${i}be && + test_must_fail git add lebom.utf${i}be 2>err.out && + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out && + + cp bebom.utf${i}be.raw bebom.utf${i}le && + test_must_fail git add bebom.utf${i}le 2>err.out && + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out && + + cp lebom.utf${i}le.raw lebom.utf${i}le && + test_must_fail git add lebom.utf${i}le 2>err.out && + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out + ' + + test_expect_success "check required UTF-${i} BOM" ' + test_when_finished "git
[PATCH v11 09/10] convert: add tracing for 'working-tree-encoding' attribute
From: Lars SchneiderAdd the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable tracing for content that is reencoded with the 'working-tree-encoding' attribute. This is useful to debug encoding issues. Signed-off-by: Lars Schneider --- convert.c| 25 + t/t0028-working-tree-encoding.sh | 2 ++ 2 files changed, 27 insertions(+) diff --git a/convert.c b/convert.c index 9a3ae7cce1..d739078016 100644 --- a/convert.c +++ b/convert.c @@ -324,6 +324,29 @@ static int validate_encoding(const char *path, const char *enc, return 0; } +static void trace_encoding(const char *context, const char *path, + const char *encoding, const char *buf, size_t len) +{ + static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING); + struct strbuf trace = STRBUF_INIT; + int i; + + strbuf_addf(, "%s (%s, considered %s):\n", context, path, encoding); + for (i = 0; i < len && buf; ++i) { + strbuf_addf( + ,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c", + i, + (unsigned char) buf[i], + (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '), + ((i+1) % 8 && (i+1) < len ? ' ' : '\n') + ); + } + strbuf_addchars(, '\n', 1); + + trace_strbuf(, ); + strbuf_release(); +} + static const char *default_encoding = "UTF-8"; static int encode_to_git(const char *path, const char *src, size_t src_len, @@ -352,6 +375,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len, if (validate_encoding(path, enc, src, src_len, die_on_error)) return 0; + trace_encoding("source", path, enc, src, src_len); dst = reencode_string_len(src, src_len, default_encoding, enc, _len); if (!dst) { @@ -369,6 +393,7 @@ static int encode_to_git(const char *path, const char *src, size_t src_len, return 0; } } + trace_encoding("destination", path, default_encoding, dst, dst_len); strbuf_attach(buf, dst, dst_len, dst_len + 1); return 1; diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh index 1bee7b9f71..f68e282c5e 100755 --- a/t/t0028-working-tree-encoding.sh +++ b/t/t0028-working-tree-encoding.sh @@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via gitattributes' . ./test-lib.sh +GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING + test_expect_success 'setup test files' ' git config core.eol lf && -- 2.16.2
[PATCH v11 06/10] convert: add 'working-tree-encoding' attribute
From: Lars SchneiderGit recognizes files encoded with ASCII or one of its supersets (e.g. UTF-8 or ISO-8859-1) as text files. All other encodings are usually interpreted as binary and consequently built-in Git text processing tools (e.g. 'git diff') as well as most Git web front ends do not visualize the content. Add an attribute to tell Git what encoding the user has defined for a given file. If the content is added to the index, then Git converts the content to a canonical UTF-8 representation. On checkout Git will reverse the conversion. Signed-off-by: Lars Schneider --- Documentation/gitattributes.txt | 80 ++ convert.c| 114 ++- convert.h| 1 + sha1_file.c | 2 +- t/t0028-working-tree-encoding.sh | 144 +++ 5 files changed, 339 insertions(+), 2 deletions(-) create mode 100755 t/t0028-working-tree-encoding.sh diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index 30687de81a..31a4f92840 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -272,6 +272,86 @@ few exceptions. Even though... catch potential problems early, safety triggers. +`working-tree-encoding` +^^^ + +Git recognizes files encoded in ASCII or one of its supersets (e.g. +UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other +encodings (e.g. UTF-16) are interpreted as binary and consequently +built-in Git text processing tools (e.g. 'git diff') as well as most Git +web front ends do not visualize the contents of these files by default. + +In these cases you can tell Git the encoding of a file in the working +directory with the `working-tree-encoding` attribute. If a file with this +attribute is added to Git, then Git reencodes the content from the +specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded +content in its internal data structure (called "the index"). On checkout +the content is reencoded back to the specified encoding. + +Please note that using the `working-tree-encoding` attribute may have a +number of pitfalls: + +- Alternative Git implementations (e.g. JGit or libgit2) and older Git + versions (as of March 2018) do not support the `working-tree-encoding` + attribute. If you decide to use the `working-tree-encoding` attribute + in your repository, then it is strongly recommended to ensure that all + clients working with the repository support it. + + For example, Microsoft Visual Studio resources files (`*.rc`) or + PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16. + If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with + a `working-tree-encoding` enabled Git client, then `foo.ps1` will be + stored as UTF-8 internally. A client without `working-tree-encoding` + support will checkout `foo.ps1` as UTF-8 encoded file. This will + typically cause trouble for the users of this file. + + If a Git client, that does not support the `working-tree-encoding` + attribute, adds a new file `bar.ps1`, then `bar.ps1` will be + stored "as-is" internally (in this example probably as UTF-16). + A client with `working-tree-encoding` support will interpret the + internal contents as UTF-8 and try to convert it to UTF-16 on checkout. + That operation will fail and cause an error. + +- Reencoding content requires resources that might slow down certain + Git operations (e.g 'git checkout' or 'git add'). + +Use the `working-tree-encoding` attribute only if you cannot store a file +in UTF-8 encoding and if you want Git to be able to process the content +as text. + +As an example, use the following attributes if your '*.ps1' files are +UTF-16 encoded with byte order mark (BOM) and you want Git to perform +automatic line ending conversion based on your platform. + + +*.ps1 text working-tree-encoding=UTF-16 + + +Use the following attributes if your '*.ps1' files are UTF-16 little +endian encoded without BOM and you want Git to use Windows line endings +in the working directory. Please note, it is highly recommended to +explicitly define the line endings with `eol` if the `working-tree-encoding` +attribute is used to avoid ambiguity. + + +*.ps1 text working-tree-encoding=UTF-16LE eol=CRLF + + +You can get a list of all available encodings on your platform with the +following command: + + +iconv --list + + +If you do not know the encoding of a file, then you can use the `file` +command to guess the encoding: + + +file foo.ps1 + + + `ident` ^^^ diff --git a/convert.c b/convert.c index b976eb968c..aa59ecfe49 100644 --- a/convert.c +++ b/convert.c @@ -7,6 +7,7 @@
[PATCH v11 03/10] strbuf: add a case insensitive starts_with()
From: Lars SchneiderCheck in a case insensitive manner if one string is a prefix of another string. This function is used in a subsequent commit. Signed-off-by: Lars Schneider --- git-compat-util.h | 1 + strbuf.c | 9 + 2 files changed, 10 insertions(+) diff --git a/git-compat-util.h b/git-compat-util.h index 68b2ad531e..95c9b34832 100644 --- a/git-compat-util.h +++ b/git-compat-util.h @@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, va_list params); extern void set_die_is_recursing_routine(int (*routine)(void)); extern int starts_with(const char *str, const char *prefix); +extern int istarts_with(const char *str, const char *prefix); /* * If the string "str" begins with the string found in "prefix", return 1. diff --git a/strbuf.c b/strbuf.c index b635f0bdc4..99812b8488 100644 --- a/strbuf.c +++ b/strbuf.c @@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix) return 0; } +int istarts_with(const char *str, const char *prefix) +{ + for (; ; str++, prefix++) + if (!*prefix) + return 1; + else if (tolower(*str) != tolower(*prefix)) + return 0; +} + int skip_to_optional_arg_default(const char *str, const char *prefix, const char **arg, const char *def) { -- 2.16.2
[PATCH v11 05/10] utf8: add function to detect a missing UTF-16/32 BOM
From: Lars SchneiderIf the endianness is not defined in the encoding name, then let's be strict and require a BOM to avoid any encoding confusion. The is_missing_required_utf_bom() function returns true if a required BOM is missing. The Unicode standard instructs to assume big-endian if there in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used in HTML5 recommends to assume little-endian to "deal with deployed content" [3]. Strictly requiring a BOM seems to be the safest option for content in Git. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#gen6 [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf Section 3.10, D98, page 132 [3] https://encoding.spec.whatwg.org/#utf-16le Signed-off-by: Lars Schneider --- utf8.c | 13 + utf8.h | 19 +++ 2 files changed, 32 insertions(+) diff --git a/utf8.c b/utf8.c index e4b99580f0..81c6678df1 100644 --- a/utf8.c +++ b/utf8.c @@ -564,6 +564,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len) ); } +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len) +{ + return ( + (!strcasecmp(enc, "UTF-16") || !strcasecmp(enc, "UTF16")) && + !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) || +has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom))) + ) || ( + (!strcasecmp(enc, "UTF-32") || !strcasecmp(enc, "UTF32")) && + !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) || +has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom))) + ); +} + /* * Returns first character length in bytes for multi-byte `text` according to * `encoding`. diff --git a/utf8.h b/utf8.h index 0db1db4519..cce654a64a 100644 --- a/utf8.h +++ b/utf8.h @@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid */ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len); +/* + * If the endianness is not defined in the encoding name, then we + * require a BOM. The function returns true if a required BOM is missing. + * + * The Unicode standard instructs to assume big-endian if there in no + * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard + * used in HTML5 recommends to assume little-endian to "deal with + * deployed content" [3]. + * + * Therefore, strictly requiring a BOM seems to be the safest option for + * content in Git. + * + * [1] http://unicode.org/faq/utf_bom.html#gen6 + * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf + * Section 3.10, D98, page 132 + * [3] https://encoding.spec.whatwg.org/#utf-16le + */ +int is_missing_required_utf_bom(const char *enc, const char *data, size_t len); + #endif -- 2.16.2
[PATCH v11 01/10] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
From: Lars SchneiderSince 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we allocate the buffer for the lower case string with xmallocz(). This already ensures a NUL at the end of the allocated buffer. Remove the unnecessary assignment. Signed-off-by: Lars Schneider --- strbuf.c | 1 - 1 file changed, 1 deletion(-) diff --git a/strbuf.c b/strbuf.c index 1df674e919..55b7daeb35 100644 --- a/strbuf.c +++ b/strbuf.c @@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string) result = xmallocz(len); for (i = 0; i < len; i++) result[i] = tolower(string[i]); - result[i] = '\0'; return result; } -- 2.16.2
[PATCH v11 08/10] convert: advise canonical UTF encoding names
From: Lars SchneiderThe canonical name of an UTF encoding has the format UTF, dash, number, and an optionally byte order in upper case (e.g. UTF-8 or UTF-16BE). Some iconv versions support alternative names without a dash or with lower case characters. To avoid problems between different iconv version always suggest the canonical UTF names in advise messages. Signed-off-by: Lars Schneider --- convert.c| 21 + t/t0028-working-tree-encoding.sh | 10 -- 2 files changed, 25 insertions(+), 6 deletions(-) diff --git a/convert.c b/convert.c index b80d666a6b..9a3ae7cce1 100644 --- a/convert.c +++ b/convert.c @@ -279,12 +279,20 @@ static int validate_encoding(const char *path, const char *enc, "BOM is prohibited in '%s' if encoded as %s"); /* * This advice is shown for UTF-??BE and UTF-??LE encodings. +* We cut off the last two characters of the encoding name +# to generate the encoding name suitable for BOMs. */ const char *advise_msg = _( "The file '%s' contains a byte order " - "mark (BOM). Please use %.6s as " + "mark (BOM). Please use UTF-%s as " "working-tree-encoding."); - advise(advise_msg, path, enc); + const char *stripped = ""; + char *upper = xstrdup_toupper(enc); + upper[strlen(upper)-2] = '\0'; + if (!skip_prefix(upper, "UTF-", )) + skip_prefix(stripped, "UTF", ); + advise(advise_msg, path, stripped); + free(upper); if (die_on_error) die(error_msg, path, enc); else { @@ -296,10 +304,15 @@ static int validate_encoding(const char *path, const char *enc, "BOM is required in '%s' if encoded as %s"); const char *advise_msg = _( "The file '%s' is missing a byte order " - "mark (BOM). Please use %sBE or %sLE " + "mark (BOM). Please use UTF-%sBE or UTF-%sLE " "(depending on the byte order) as " "working-tree-encoding."); - advise(advise_msg, path, enc, enc); + const char *stripped = ""; + char *upper = xstrdup_toupper(enc); + if (!skip_prefix(upper, "UTF-", )) + skip_prefix(stripped, "UTF", ); + advise(advise_msg, path, stripped, stripped); + free(upper); if (die_on_error) die(error_msg, path, enc); else { diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh index e8408dfe5c..1bee7b9f71 100755 --- a/t/t0028-working-tree-encoding.sh +++ b/t/t0028-working-tree-encoding.sh @@ -74,18 +74,22 @@ do cp bebom.utf${i}be.raw bebom.utf${i}be && test_must_fail git add bebom.utf${i}be 2>err.out && test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out && + test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out && cp lebom.utf${i}le.raw lebom.utf${i}be && test_must_fail git add lebom.utf${i}be 2>err.out && test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out && + test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out && cp bebom.utf${i}be.raw bebom.utf${i}le && test_must_fail git add bebom.utf${i}le 2>err.out && test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out && + test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out && cp lebom.utf${i}le.raw lebom.utf${i}le && test_must_fail git add lebom.utf${i}le 2>err.out && - test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out && + test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out ' test_expect_success "check required UTF-${i} BOM" ' @@ -96,10 +100,12 @@ do cp nobom.utf${i}be.raw nobom.utf${i} && test_must_fail git add nobom.utf${i} 2>err.out && test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out && + test_i18ngrep "use UTF-${i}BE or UTF-${i}LE"
[PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'
From: Lars SchneiderUTF supports lossless conversion round tripping and conversions between UTF and other encodings are mostly round trip safe as Unicode aims to be a superset of all other character encodings. However, certain encodings (e.g. SHIFT-JIS) are known to have round trip issues [1]. Add 'core.checkRoundtripEncoding', which contains a comma separated list of encodings, to define for what encodings Git should check the conversion round trip if they are used in the 'working-tree-encoding' attribute. Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'. [1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode Signed-off-by: Lars Schneider --- Documentation/config.txt | 6 +++ Documentation/gitattributes.txt | 8 config.c | 5 +++ convert.c| 79 +++- convert.h| 1 + environment.c| 1 + t/t0028-working-tree-encoding.sh | 39 7 files changed, 138 insertions(+), 1 deletion(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index 0e25b2c92b..7dcac9b540 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -530,6 +530,12 @@ core.autocrlf:: This variable can be set to 'input', in which case no output conversion is performed. +core.checkRoundtripEncoding:: + A comma and/or whitespace separated list of encodings that Git + performs UTF-8 round trip checks on if they are used in an + `working-tree-encoding` attribute (see linkgit:gitattributes[5]). + The default value is `SHIFT-JIS`. + core.symlinks:: If false, symbolic links are checked out as small plain files that contain the link text. linkgit:git-update-index[1] and diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index 31a4f92840..aa3deae392 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -312,6 +312,14 @@ number of pitfalls: internal contents as UTF-8 and try to convert it to UTF-16 on checkout. That operation will fail and cause an error. +- Reencoding content to non-UTF encodings can cause errors as the + conversion might not be UTF-8 round trip safe. If you suspect your + encoding to not be round trip safe, then add it to + `core.checkRoundtripEncoding` to make Git check the round trip + encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character + set) is known to have round trip issues with UTF-8 and is checked by + default. + - Reencoding content requires resources that might slow down certain Git operations (e.g 'git checkout' or 'git add'). diff --git a/config.c b/config.c index 1f003fbb90..d0ada9fcd4 100644 --- a/config.c +++ b/config.c @@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, const char *value) return 0; } + if (!strcmp(var, "core.checkroundtripencoding")) { + check_roundtrip_encoding = xstrdup(value); + return 0; + } + if (!strcmp(var, "core.notesref")) { notes_ref_name = xstrdup(value); return 0; diff --git a/convert.c b/convert.c index d739078016..c2d24882c1 100644 --- a/convert.c +++ b/convert.c @@ -347,6 +347,42 @@ static void trace_encoding(const char *context, const char *path, strbuf_release(); } +static int check_roundtrip(const char *enc_name) +{ + /* +* check_roundtrip_encoding contains a string of comma and/or +* space separated encodings (eg. "UTF-16, ASCII, CP1125"). +* Search for the given encoding in that string. +*/ + const char *found = strcasestr(check_roundtrip_encoding, enc_name); + const char *next; + int len; + if (!found) + return 0; + next = found + strlen(enc_name); + len = strlen(check_roundtrip_encoding); + return (found && ( + /* +* check that the found encoding is at the +* beginning of check_roundtrip_encoding or +* that it is prefixed with a space or comma +*/ + found == check_roundtrip_encoding || ( + (isspace(found[-1]) || found[-1] == ',') + ) + ) && ( + /* +* check that the found encoding is at the +* end of check_roundtrip_encoding or +* that it is suffixed with a space or comma +*/ + next == check_roundtrip_encoding + len || ( + next < check_roundtrip_encoding + len && + (isspace(next[0]) ||
[PATCH v11 02/10] strbuf: add xstrdup_toupper()
From: Lars SchneiderCreate a copy of an existing string and make all characters upper case. Similar xstrdup_tolower(). This function is used in a subsequent commit. Signed-off-by: Lars Schneider --- strbuf.c | 12 strbuf.h | 1 + 2 files changed, 13 insertions(+) diff --git a/strbuf.c b/strbuf.c index 55b7daeb35..b635f0bdc4 100644 --- a/strbuf.c +++ b/strbuf.c @@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string) return result; } +char *xstrdup_toupper(const char *string) +{ + char *result; + size_t len, i; + + len = strlen(string); + result = xmallocz(len); + for (i = 0; i < len; i++) + result[i] = toupper(string[i]); + return result; +} + char *xstrvfmt(const char *fmt, va_list ap) { struct strbuf buf = STRBUF_INIT; diff --git a/strbuf.h b/strbuf.h index 14c8c10d66..df7ced53ed 100644 --- a/strbuf.h +++ b/strbuf.h @@ -607,6 +607,7 @@ __attribute__((format (printf,2,3))) extern int fprintf_ln(FILE *fp, const char *fmt, ...); char *xstrdup_tolower(const char *); +char *xstrdup_toupper(const char *); /** * Create a newly allocated string using printf format. You can do this easily -- 2.16.2
[PATCH v11 00/10] convert: add support for different encodings
From: Lars SchneiderHi, Patches 1-5,9 are preparation and helper functions. Patch 6-8,10 are the actual change. Patch 8 is new. This series depends on Torsten's 8462ff43e4 (convert_to_git(): safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is already in master. Changes since v10: * rename startscase_with() to istarts_with() (Duy) * validate_encoding() advises the canonical form of the UTF encoding name to the user (Junio) --> I added it this as a separate commit that you could be dropped if desired by the reviewers. * fix documentation for roundtrip check (Junio) * use isspace() to check whitespace/tab delimiter in core.checkRoundtripEncoding (Junio) * remove dead code in roundtrip check (Junio) * fix invalid # in comment (Eric) * detect UTF8 and UTF-8 as default encoding (Eric) * make asterisk stick to the variable, not type (Junio) * print an error if "w-t-e" does not have a proper value (Junio) --> BTW: I noticed that the attribute is not set to "git_attr__false" even if I define "-working-tree-encoding". I haven't investigated further yet. Might that be a bug? If yes, then this should be addresses in a separate patch series. Thanks, Lars RFC: https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/ v1: https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/ v2: https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/ v3: https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/ v4: https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/ v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/ v6: https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/ v7: https://public-inbox.org/git/20180215152711.158-1-lars.schnei...@autodesk.com/ v8: https://public-inbox.org/git/20180224162801.98860-1-lars.schnei...@autodesk.com/ v9: https://public-inbox.org/git/20180304201418.60958-1-lars.schnei...@autodesk.com/ v10: https://public-inbox.org/git/20180307173026.30058-1-lars.schnei...@autodesk.com/ Base Ref: Web-Diff: https://github.com/larsxschneider/git/commit/afc02ce2e0 Checkout: git fetch https://github.com/larsxschneider/git encoding-v11 && git checkout afc02ce2e0 ### Interdiff (v10..v11): diff --git a/Documentation/config.txt b/Documentation/config.txt index d7a56054a5..7dcac9b540 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -531,10 +531,10 @@ core.autocrlf:: in which case no output conversion is performed. core.checkRoundtripEncoding:: - A comma separated list of encodings that Git performs UTF-8 round - trip checks on if they are used in an `working-tree-encoding` - attribute (see linkgit:gitattributes[5]). The default value is - `SHIFT-JIS`. + A comma and/or whitespace separated list of encodings that Git + performs UTF-8 round trip checks on if they are used in an + `working-tree-encoding` attribute (see linkgit:gitattributes[5]). + The default value is `SHIFT-JIS`. core.symlinks:: If false, symbolic links are checked out as small plain files that diff --git a/convert.c b/convert.c index e861f1abbc..c2d24882c1 100644 --- a/convert.c +++ b/convert.c @@ -270,7 +270,7 @@ static int validate_encoding(const char *path, const char *enc, const char *data, size_t len, int die_on_error) { /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ - if (startscase_with(enc, "UTF")) { + if (istarts_with(enc, "UTF")) { /* * Check for detectable errors in UTF encodings */ @@ -284,12 +284,15 @@ static int validate_encoding(const char *path, const char *enc, */ const char *advise_msg = _( "The file '%s' contains a byte order " - "mark (BOM). Please use %s as " + "mark (BOM). Please use UTF-%s as " "working-tree-encoding."); - char *upper_enc = xstrdup_toupper(enc); - upper_enc[strlen(upper_enc)-2] = '\0'; - advise(advise_msg, path, upper_enc); - free(upper_enc); + const char *stripped = ""; + char *upper = xstrdup_toupper(enc); + upper[strlen(upper)-2] = '\0'; + if (!skip_prefix(upper, "UTF-", )) + skip_prefix(stripped, "UTF", ); + advise(advise_msg, path, stripped); + free(upper); if (die_on_error) die(error_msg, path, enc); else { @@ -301,12 +304,15 @@
Good day
Good day Blessings to you,am contacting you based on a mutual benefit inheritance transaction of ($10.5 million US dollars) that has to do with your last name contact me for more details. Contact email [ gracefollyt...@gmail.com ] Regards, Mrs Grace Folly
Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
Lars Schneiderwrites: > I think following the boost lib makes most sense. Therefore, > I would like to go with "istarts_with". OK with you? I don't care too deeply; if we took starts_with() from there, where what we now want is defined as istarts_with(), then that sounds like a good thing to do. Thanks.
Re: What's cooking in git.git (Mar 2018, #02; Tue, 6)
Martin Ågrenwrites: >> Is this ready for 'next'? > > I am not aware of any open questions or issues. You thought out loud > about how the series was structured, in particular about introducing a > successful test, then redefining it, as opposed to introducing it as a > failing test, then making it succeed. I hope I managed to motivate my > choice better in v2 (which is what you have picked up). > > Duy wondered if it was sane to use a pager when we know that we are > "--get"-ing at most one config item. In v2, I addressed this by turning > on paging for a more careful selection of "--get"-ters. Yeah, I am aware of these exchanges, and they are resolved nicely, I think. I was mostly asking if other people have concerns we haven't thought of yet. Let's merge this to 'next', then. Thanks.
Re: How to use filter-branch with --state-branch?
2018-03-09 14:23 GMT+01:00 Ian Campbell: > On Fri, 2018-03-09 at 14:04 +0100, Michele Locati wrote: >> Just a couple of questions: >> >> 1. it seems to me it's not possible to process all the branches in one >> go. Am I right? > > I'm not sure, I've never done such a thing, in fact I didn't know you > could. > > Really all this feature does is record the `.git/rewrite-map` (or > whatever the correct name is) at the end of the rewrite and reinstate > it again the next time, so it shouldn't really interact with many of > the other options. > > My method for storeing "last version processed" in a branch does > conflict I suppose (since that branch would be rewritten) but that's an > artefact of the surrounding scaffolding -- you could equally well keep > the record in some file on the local system or in a non-branch-ish ref > (I guess). > >> 2. Why do you have this line in filter.sh? >> `rm -f .git/refs/original/refs/heads/${UPSTREAM_REWRITTEN}` > > TBH I'm not really sure. I originally wrote this patch many years ago, > it's just recently that I got around to upstreaming, so my memory is > more fuzzy than might be expected. > > I think perhaps I was trying to avoid this error: > > A previous backup already exists in $orig_namespace > Force overwriting the backup with -f" > > which comes if there is an existing backup (a safety feature in the > non-incremental case). > > Note quite sure why I didn't use `-f` as the message says, but I guess > because it forces other things too which I didn't want to do? > > Perhaps what I should have done is make that check conditional on the > use of --state-branch. > > I wonder if you could use the `original/refs/...` as the "last version > processed"? Would be a lot less manual messing around than what I do! > > Ian. I managed to get a general script that seems to work: see https://github.com/mlocati/incremental-git-filter-branch Thanks again, Ian. -- Michele
Re: [RFC] Rebasing merges: a jorney to the ultimate solution(RoadClear)
Hi Junio, On Thu, 8 Mar 2018, Junio C Hamano wrote: > Johannes Schindelinwrites: > > >> If we are talking about a drastic change, a few more days may not be > >> sufficient, but we are not in a hurry, as this already sounds like a > >> 2.18 material anyway. > > > > It is not at all a drastic change. It will actually make the current patch > > series better (simplifying the "can we fast-forward?" check). > > > > I just want to make sure that I already have Phillip's strategy working, > > but it will be yet another topic branch on top of the topic branch that > > will add support for octopus merges *after* the current --recreate-merges > > topic branch ;-) > > Oh, if the "not redoing the merge afresh, but attempt to reuse the > previous merge" that was discussed is going to be done as an > update/addition to the "redo the merge afresh" you already had in > production forever (and I had in 'pu' for quite a while in various > polished-ness during iteration), then I do prefer merging down what > has already proven to be 'next' worthy without waiting for the > discussion and your local verification of Phillip's new thing, > especially given that you'll be giving an explicit control to the > users which variant of "merge" insn will be used and the addition > of the Phillip's thing won't be a backward-compatibility issue when > it comes later. I would like to stress that the `--recreate-merges` functionality has *not* beed in production forever. It is a reimplementation in pure C of the Unix shell script that has been in production for five years (unchanged only for the last half year, the last critical fix happened in January last year). So I do see the value here to use `next` as test bed, and once the patch series will hit `next`, I will also merge it into Git for Windows (as an EXPERIMENTAL feature). As such, I am quite comfortable with refactoring a bit here and there, for example how to handle the case where a `merge` can be fast-forwarded. I think the changes I made in the last few days were an actual improvement to readability, even if the only reason for those changes was that I wanted to accommodate the "rebase merge commits" thing. FWIW I am fine with bumping this down to 2.18. I got a little bit of valuable feedback at GitMerge, e.g. that --recreate-merges does not (yet) have a mode where it can update refs corresponding to the rebased commits. (This could actually turn out to be independent of --recreate-merges, even). I already pushed out the updated branch, and it can be inspected at https://github.com/git/git/pull/447 The reason I did not send this out yet is that I want to give it a final look-over myself, so that I do not waste people's time again (as I did with that monster interdiff that was pretty bogus, too). Ciao, Dscho
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
> On 07 Mar 2018, at 19:04, Eric Sunshinewrote: > > On Wed, Mar 7, 2018 at 12:30 PM, wrote: >> Check that new content is valid with respect to the user defined >> 'working-tree-encoding' attribute. >> >> Signed-off-by: Lars Schneider >> --- >> diff --git a/convert.c b/convert.c >> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct >> text_stat *stats, >> +static int validate_encoding(const char *path, const char *enc, >> + const char *data, size_t len, int die_on_error) >> +{ >> + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ >> + if (startscase_with(enc, "UTF")) { >> + /* >> +* Check for detectable errors in UTF encodings >> +*/ >> + if (has_prohibited_utf_bom(enc, data, len)) { >> + const char *error_msg = _( >> + "BOM is prohibited in '%s' if encoded as >> %s"); >> + /* >> +* This advice is shown for UTF-??BE and UTF-??LE >> encodings. >> +* We cut off the last two characters of the >> encoding name >> +# to generate the encoding name suitable for BOMs. > > s/#/*/ Of course! >> diff --git a/t/t0028-working-tree-encoding.sh >> b/t/t0028-working-tree-encoding.sh >> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes >> support' ' >> for i in 16 32 >> do >> + test_expect_success "check prohibited UTF-${i} BOM" ' >> + test_when_finished "git reset --hard HEAD" && >> + >> + echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >> >>.gitattributes && >> + echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >> >>.gitattributes && > > v10 is checking only hyphenated lowercase encoding name; earlier > versions checked uppercase. For better coverage, it would be nice to > check several combinations: all uppercase, all lowercase, mixed case, > hyphenated, not hyphenated. > > I'm not suggesting running all the tests repeatedly but rather just > varying the format of the encoding name in these tests you're adding. > For instance, the above could instead be: > >echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes > && >echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes && > > or something. The casing is a good idea - I will do that. I don't want to do "hyphenated, not hyphenated" as this would make the tests fail on macOS (and I believe on Windows). Thanks, Lars
Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()
> On 09 Mar 2018, at 00:12, Junio C Hamanowrote: > > Duy Nguyen writes: > >>> extern int starts_with(const char *str, const char *prefix); >>> +extern int startscase_with(const char *str, const char *prefix); >> >> This name is a bit hard to read. Boost [1] goes with istarts_with. I >> wonder if it's better. If not I guess either starts_with_case or >> starts_case_with will improve readability. > > starts_with_case() sounds quite strange even though > starts_with_icase() may make it clear that it is "a variant of > starts_with() function that ignores case". I dunno. > dir.c::cmp_icase() takes the _icase suffix for not quite the way > that is consistent with that understanding, though. I think following the boost lib makes most sense. Therefore, I would like to go with "istarts_with". OK with you? - Lars
git submodule update - reset instead of checkout?
Hi everyone, I've been reading up on the current state of git submodules (our customer's customers want to use them, causing a slight groan from our customer). The usability thing I wonder about - with git submodule update, is it necessary to detach the head to the current (or upstream) head, instead of resetting the current branch to that state? Primary interest in the question: Seeing 'detached head' scares almost everybody. To brainstorm: - as we can already use 'submodule update --remote' to update to the remote head of a branch, it would be logical to have that branch checked out locally (and unfortunately, potentially have that branch's name conflict with the remote branch setup). - when developers more or less accidentally commit on the detached head, all is not lost yet (I remember this being differently), but if the next 'submodule update' just resets again, the commit just made is still dropped, just as in the detached head state. - So, we'd need to have 'submodule update' act closer to the pull or rebase counterparts and refuse to just lose commits (or uncommitted changes). Having a checked-out branch in the submodule would have the advantage that I can 'just' push local commits. At the moment, doing that requires a relatively intricate dance, not at all similar to working in the base (parent) module. I'm working on hooks that automatically update the submodules after a commit change (merge, pull, checkout) in the parent module, but with the additional feature of (a) keeping a branch checked out and (b) avoid discarding local changes. Probably means I shouldn't invoke 'submodule update' at all, and deal with everyting myself. Any thoughs/comments/helpful hints? (Addional complexity: egit/jgit is in use as well, and the work model we will arrive at probabaly needs to work with the current egit.) - Andreas -- "Totally trivial. Famous last words." From: Linus TorvaldsDate: Fri, 22 Jan 2010 07:29:21 -0800
Re: [PATCH] fetch-pack.c: use oidset to check existence of loose object
2018-03-09 3:42 GMT+09:00 Junio C Hamano: > Takuto Ikuta writes: >> This patch stores existing loose objects in hashmap beforehand and use >> it to check existence instead of using lstat. >> >> With this patch, the number of lstat calls in `git fetch` is reduced >> from 411412 to 13794 for chromium repository. >> >> I took time stat of `git fetch` disabling quickfetch for chromium >> repository 3 time on linux with SSD. > > Now you drop a clue that would help to fill in the blanks above, but > I am not sure what the significance of your having to disable > quickfetch in order to take measurements---it makes it sound as if > it is an articificial problem that does not exist in real life > (i.e. when quickfetch is not disabled), but I am getting the feeling > that it is not what you wanted to say here. > Yes, I just wanted to say 'git fetch' invokes fetch-pack. fetch-pack is skipped when running git fetch repeatedly while remote has no update by quickfetch. So I disabled it to see the performance of fetch-pack. In chromium repository, master branch is updated several times in an hour, so git fetch invokes fetch-pack in such frequency. > In any case, do_fetch_pack() tries to see if all of the tip commits > we are going to fetch exist locally, so when you are trying a fetch > that grabs huge number of refs (by the way, it means that the first > sentence of the proposed log message is not quite true---it is "When > fetching a large number of refs", as it does not matter how many > refs _we_ have, no?), everything_local() ends up making repeated > calls to has_object_file_with_flags() to all of the refs. > I fixed description by changing 'refs' to 'remote refs'. In my understanding, git tries to check existence of remote refs even if we won't fetch such refs. > I like the idea---this turns "for each of these many things, check > if it exists with lstat(2)" into "enumerate what exists with > lstat(2), and then use that for the existence test"; if you need to > try N objects for existence, and you only have M objects loose where > N is vastly larger than M, it will be a huge win. If you have very > many loose objects and checking only a handful of objects for > existence check, you would lose big, though, no? > Yes. But I think the default limit for the number of loose objects, 7000, gives us small overhead when we do enumeration of all objects. >> diff --git a/fetch-pack.c b/fetch-pack.c >> index d97461296..1658487f7 100644 >> --- a/fetch-pack.c >> +++ b/fetch-pack.c >> @@ -711,6 +711,15 @@ static void mark_alternate_complete(struct object *obj) >> mark_complete(>oid); >> } >> >> +static int add_loose_objects_to_set(const struct object_id *oid, >> + const char *path, >> + void *data) >> +{ >> + struct oidset* set = (struct oidset*)(data); > > Style: in our codebase, asterisk does not stick to the type. > > struct oidset *set = (struct oidset *)(data); > >> @@ -719,16 +728,21 @@ static int everything_local(struct fetch_pack_args >> *args, >> int retval; >> int old_save_commit_buffer = save_commit_buffer; >> timestamp_t cutoff = 0; >> + struct oidset loose_oid_set = OIDSET_INIT; >> + >> + for_each_loose_object(add_loose_objects_to_set, _oid_set, 0); > > OK, so this is the "enumerate all loose objects" phase. > >> save_commit_buffer = 0; >> >> for (ref = *refs; ref; ref = ref->next) { >> struct object *o; >> + unsigned int flag = OBJECT_INFO_QUICK; > > Hmm, OBJECT_INFO_QUICK optimization was added in dfdd4afc > ("sha1_file: teach sha1_object_info_extended more flags", > 2017-06-21), but since 8b4c0103 ("sha1_file: support lazily fetching > missing objects", 2017-12-08) it appears that passing > OBJECT_INFO_QUICK down the codepath does not do anything > interesting. Jonathan (cc'ed), are all remaining hits from "git > grep OBJECT_INFO_QUICK" all dead no-ops these days? > Yes the flag is no-op now, but let me untouched the flag in this patch. >> diff --git a/sha1_file.c b/sha1_file.c >> index 1b94f39c4..c903cbcec 100644 >> --- a/sha1_file.c >> +++ b/sha1_file.c >> @@ -1262,6 +1262,9 @@ int sha1_object_info_extended(const unsigned char >> *sha1, struct object_info *oi, >> if (find_pack_entry(real, )) >> break; >> >> + if (flags & OBJECT_INFO_SKIP_LOOSE) >> + return -1; >> + > > I cannot quite convince myself that this is done at the right layer; > it smells to be at a bit too low a layer. This change makes sense > only to a caller that is interested in the existence test. If the > flag is named after what it does, i.e. "ignore loose object", then > it does sort-of make sense, though. > Couldn't come up with good alternative for this, I followed your flag name suggestion in patch v3. Takuto
Re: [PATCH] fetch-pack.c: use oidset to check existence of loose object
2018-03-09 2:19 GMT+09:00 René Scharfe: > Am 08.03.2018 um 13:06 schrieb Takuto Ikuta: >> +static int add_loose_objects_to_set(const struct object_id *oid, >> + const char *path, >> + void *data) >> +{ >> + struct oidset* set = (struct oidset*)(data); > > This cast is not needed (unlike in C++). And the asterisk should be stuck > to the variable, not the type (see Documentation/CodingGuidelines). > >> + oidset_insert(set, oid); > > In fact, you could just put "data" in here instead of "set" (without a > cast), with no loss in readability or safety. > Thank you for review, changed to use data directly in v2. >> + return 0; >> +} >> + >> static int everything_local(struct fetch_pack_args *args, >> struct ref **refs, >> struct ref **sought, int nr_sought) >> @@ -719,16 +728,21 @@ static int everything_local(struct fetch_pack_args >> *args, >> int retval; >> int old_save_commit_buffer = save_commit_buffer; >> timestamp_t cutoff = 0; >> + struct oidset loose_oid_set = OIDSET_INIT; >> + >> + for_each_loose_object(add_loose_objects_to_set, _oid_set, 0); >> >> save_commit_buffer = 0; >> >> for (ref = *refs; ref; ref = ref->next) { >> struct object *o; >> + unsigned int flag = OBJECT_INFO_QUICK; >> >> - if (!has_object_file_with_flags(>old_oid, >> - OBJECT_INFO_QUICK)) >> - continue; >> + if (!oidset_contains(_oid_set, >old_oid)) >> + flag |= OBJECT_INFO_SKIP_LOOSE; >> >> + if (!has_object_file_with_flags(>old_oid, flag)) >> + continue; >> o = parse_object(>old_oid); >> if (!o) >> continue; >> @@ -744,6 +758,8 @@ static int everything_local(struct fetch_pack_args *args, >> } >> } >> >> + oidset_clear(_oid_set); >> + > > This part looks fine to me. (Except perhaps call the variable "flags" > because you sometimes have two?) > Changed flag names. > Why not include packed objects as well? Probably because packs have > indexes which can queried quickly to determine object existence, and > because there are only few loose objects in typical repositories, > right? > Correct. In my target repository, chromium, fetch-pack's slowness comes from many lstat to non-existing loose objects for remote refs. I focus on to remove such lstat. > A similar cache was introduced by cc817ca3ef (sha1_name: cache > readdir(3) results in find_short_object_filename()) to speed up > finding unambiguous shorts object hashes. I wonder if it could be > used here as well, but I don't see an easy way. > >> if (!args->no_dependents) { >> if (!args->deepen) { >> for_each_ref(mark_complete_oid, NULL); >> diff --git a/sha1_file.c b/sha1_file.c >> index 1b94f39c4..c903cbcec 100644 >> --- a/sha1_file.c >> +++ b/sha1_file.c >> @@ -1262,6 +1262,9 @@ int sha1_object_info_extended(const unsigned char >> *sha1, struct object_info *oi, >> if (find_pack_entry(real, )) >> break; >> >> + if (flags & OBJECT_INFO_SKIP_LOOSE) >> + return -1; >> + >> /* Most likely it's a loose object. */ >> if (!sha1_loose_object_info(real, oi, flags)) >> return 0; >> > > This early return doesn't just skip checking loose objects. It > also skips reloading packs and fetching missing objects for > partial clones. That may not be a problem for fetch-pack, but > it means the flag has a misleading name. Do you get the same > performance improvement if you make it only skip that > sha1_loose_object_info() call? > I changed the flag name following Junio's suggestion. If we skip sha1_loose_object_info call, reprepare_packed_git(...) runs for every non-existing refs and git fetch becomes slower. Takuto
[PATCH v3] fetch-pack.c: use oidset to check existence of loose object
In repository having large number of remote refs, because to check existence of each refs in local repository, 'git fetch' ends up doing a lot of lstat(2) calls to see if it exists in loose form, which makes it slow. This patch enumerates loose objects in hashmap beforehand and uses it to check existence instead of using lstat(2) to improve performance of fetch-pack for repositories having large number of remote refs compared to the number of loose objects. With this patch, the number of lstat(2) calls in `git fetch` is reduced from 411412 to 13794 for chromium repository, it has more than 48 remote refs. I took time stat of `git fetch` disabling quickfetch, so that fetch-pack runs, for chromium repository 3 time on linux with SSD. * with this patch 8.105s 8.309s 7.640s avg: 8.018s * master 12.287s 11.175s 12.227s avg: 11.896s On my MacBook Air which has slower lstat(2). * with this patch 14.501s * master 1m16.027s `git fetch` on slow disk will be improved largely. Signed-off-by: Takuto Ikuta--- cache.h | 2 ++ fetch-pack.c | 26 +++--- sha1_file.c | 3 +++ 3 files changed, 28 insertions(+), 3 deletions(-) diff --git a/cache.h b/cache.h index d06932ed0..6a72f54d7 100644 --- a/cache.h +++ b/cache.h @@ -1773,6 +1773,8 @@ struct object_info { #define OBJECT_INFO_SKIP_CACHED 4 /* Do not retry packed storage after checking packed and loose storage */ #define OBJECT_INFO_QUICK 8 +/* Do not check loose object */ +#define OBJECT_INFO_IGNORE_LOOSE 16 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags); /* diff --git a/fetch-pack.c b/fetch-pack.c index d97461296..aed4aa213 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -711,6 +711,14 @@ static void mark_alternate_complete(struct object *obj) mark_complete(>oid); } +static int add_loose_objects_to_set(const struct object_id *oid, + const char *path, + void *data) +{ + oidset_insert(data, oid); + return 0; +} + static int everything_local(struct fetch_pack_args *args, struct ref **refs, struct ref **sought, int nr_sought) @@ -719,16 +727,26 @@ static int everything_local(struct fetch_pack_args *args, int retval; int old_save_commit_buffer = save_commit_buffer; timestamp_t cutoff = 0; + struct oidset loose_oid_set = OIDSET_INIT; + + /* Enumerate all loose objects. */ + for_each_loose_object(add_loose_objects_to_set, _oid_set, 0); save_commit_buffer = 0; for (ref = *refs; ref; ref = ref->next) { struct object *o; + unsigned int flags = OBJECT_INFO_QUICK; - if (!has_object_file_with_flags(>old_oid, - OBJECT_INFO_QUICK)) - continue; + if (!oidset_contains(_oid_set, >old_oid)) { + /* I know this does not exist in the loose form, +* so check if it exists in a non-loose form. +*/ + flags |= OBJECT_INFO_IGNORE_LOOSE; + } + if (!has_object_file_with_flags(>old_oid, flags)) + continue; o = parse_object(>old_oid); if (!o) continue; @@ -744,6 +762,8 @@ static int everything_local(struct fetch_pack_args *args, } } + oidset_clear(_oid_set); + if (!args->no_dependents) { if (!args->deepen) { for_each_ref(mark_complete_oid, NULL); diff --git a/sha1_file.c b/sha1_file.c index 1b94f39c4..c0a197947 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1262,6 +1262,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, if (find_pack_entry(real, )) break; + if (flags & OBJECT_INFO_IGNORE_LOOSE) + return -1; + /* Most likely it's a loose object. */ if (!sha1_loose_object_info(real, oi, flags)) return 0; -- 2.16.2
Re: How to use filter-branch with --state-branch?
On Fri, 2018-03-09 at 14:04 +0100, Michele Locati wrote: > Just a couple of questions: > > 1. it seems to me it's not possible to process all the branches in one > go. Am I right? I'm not sure, I've never done such a thing, in fact I didn't know you could. Really all this feature does is record the `.git/rewrite-map` (or whatever the correct name is) at the end of the rewrite and reinstate it again the next time, so it shouldn't really interact with many of the other options. My method for storeing "last version processed" in a branch does conflict I suppose (since that branch would be rewritten) but that's an artefact of the surrounding scaffolding -- you could equally well keep the record in some file on the local system or in a non-branch-ish ref (I guess). > 2. Why do you have this line in filter.sh? > `rm -f .git/refs/original/refs/heads/${UPSTREAM_REWRITTEN}` TBH I'm not really sure. I originally wrote this patch many years ago, it's just recently that I got around to upstreaming, so my memory is more fuzzy than might be expected. I think perhaps I was trying to avoid this error: A previous backup already exists in $orig_namespace Force overwriting the backup with -f" which comes if there is an existing backup (a safety feature in the non-incremental case). Note quite sure why I didn't use `-f` as the message says, but I guess because it forces other things too which I didn't want to do? Perhaps what I should have done is make that check conditional on the use of --state-branch. I wonder if you could use the `original/refs/...` as the "last version processed"? Would be a lot less manual messing around than what I do! Ian.
[PATCH v2 1/1] fetch-pack.c: use oidset to check existence of loose object
In repository having large number of remote refs, because to check existence of each refs in local repository, 'git fetch' ends up doing a lot of lstat(2) calls to see if it exists in loose form, which makes it slow. This patch enumerates loose objects in hashmap beforehand and uses it to check existence instead of using lstat(2) to improve performance of fetch-pack for repositories having large number of remote refs compared to the number of loose objects. With this patch, the number of lstat(2) calls in `git fetch` is reduced from 411412 to 13794 for chromium repository, it has more than 48 remote refs. I took time stat of `git fetch` disabling quickfetch, so that fetch-pack runs, for chromium repository 3 time on linux with SSD. * with this patch 8.105s 8.309s 7.640s avg: 8.018s * master 12.287s 11.175s 12.227s avg: 11.896s On my MacBook Air which has slower lstat(2). * with this patch 14.501s * master 1m16.027s `git fetch` on slow disk will be improved largely. Signed-off-by: Takuto Ikuta--- cache.h | 2 ++ fetch-pack.c | 26 +++--- sha1_file.c | 3 +++ 3 files changed, 28 insertions(+), 3 deletions(-) diff --git a/cache.h b/cache.h index d06932ed0..6a72f54d7 100644 --- a/cache.h +++ b/cache.h @@ -1773,6 +1773,8 @@ struct object_info { #define OBJECT_INFO_SKIP_CACHED 4 /* Do not retry packed storage after checking packed and loose storage */ #define OBJECT_INFO_QUICK 8 +/* Do not check loose object */ +#define OBJECT_INFO_IGNORE_LOOSE 16 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags); /* diff --git a/fetch-pack.c b/fetch-pack.c index d97461296..ef8b93424 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -711,6 +711,14 @@ static void mark_alternate_complete(struct object *obj) mark_complete(>oid); } +static int add_loose_objects_to_set(const struct object_id *oid, + const char *path, + void *data) +{ + oidset_insert(data, oid); + return 0; +} + static int everything_local(struct fetch_pack_args *args, struct ref **refs, struct ref **sought, int nr_sought) @@ -719,16 +727,26 @@ static int everything_local(struct fetch_pack_args *args, int retval; int old_save_commit_buffer = save_commit_buffer; timestamp_t cutoff = 0; + struct oidset loose_oid_set = OIDSET_INIT; + + /* Enumerate all loose objects. */ + for_each_loose_object(add_loose_objects_to_set, _oid_set, 0); save_commit_buffer = 0; for (ref = *refs; ref; ref = ref->next) { struct object *o; + unsigned int flag = OBJECT_INFO_QUICK; - if (!has_object_file_with_flags(>old_oid, - OBJECT_INFO_QUICK)) - continue; + if (!oidset_contains(_oid_set, >old_oid)) { + /* I know this does not exist in the loose form, +* so check if it exists in a non-loose form. +*/ + flag |= OBJECT_INFO_IGNORE_LOOSE; + } + if (!has_object_file_with_flags(>old_oid, flag)) + continue; o = parse_object(>old_oid); if (!o) continue; @@ -744,6 +762,8 @@ static int everything_local(struct fetch_pack_args *args, } } + oidset_clear(_oid_set); + if (!args->no_dependents) { if (!args->deepen) { for_each_ref(mark_complete_oid, NULL); diff --git a/sha1_file.c b/sha1_file.c index 1b94f39c4..c0a197947 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1262,6 +1262,9 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, if (find_pack_entry(real, )) break; + if (flags & OBJECT_INFO_IGNORE_LOOSE) + return -1; + /* Most likely it's a loose object. */ if (!sha1_loose_object_info(real, oi, flags)) return 0; -- 2.16.2
[PATCH v2 0/1] fetch-pack.c: use oidset to check existence of loose object
I removed unnecessary type cast, added comment in everthing_local(...), changed flag name and updated description. Takuto Ikuta (1): fetch-pack.c: use oidset to check existence of loose object cache.h | 2 ++ fetch-pack.c | 26 +++--- sha1_file.c | 3 +++ 3 files changed, 28 insertions(+), 3 deletions(-) -- 2.16.2
Re: How to use filter-branch with --state-branch?
2018-03-08 10:40 GMT+01:00 Ian Campbell: > > On Thu, 2018-03-08 at 10:25 +0100, Ævar Arnfjörð Bjarmason wrote: > > > > The first filter-branch call required 7168 steps, so did the second > > > call... > > > I also tried without the --prune option of remote update (I had to add > > > --force to the second filter-branch), but nothing changed. > > You can see an example of the usage in: > > https://git.kernel.org/pub/scm/linux/kernel/git/devicetree/devicetree-rebasing.git/ > > in the `scripts/` sub dir (flow is `cronjob` → `filter.sh` → `git > filter-branch...`. > > I think the big difference is rather than `--all` you need to give it > the `previous..now` range since that is the update you wish to do > (first time around you just give it `now`). > > The devicetree-rebasing scripting arranges that by keeping the previous > in a separate branch. > > Ian. Thank you for your quick reply, Ian. Just a couple of questions: 1. it seems to me it's not possible to process all the branches in one go. Am I right? 2. Why do you have this line in filter.sh? `rm -f .git/refs/original/refs/heads/${UPSTREAM_REWRITTEN}` Thank you again, Michele
Re: [PATCH v1.1 1/2] t9400-git-cvsserver-server: don't rely on the output of 'test_cmp'
On Fri, Mar 9, 2018 at 12:44 AM, Junio C Hamanowrote: > SZEDER Gábor writes: >>> This makes "rm -f failures &&" unnecessary, no? >> >> Indeed, it does. > > OK, no need to resend, as I'll amend it locally before queuing > (unless there is something else that needs updating, that is). Thanks. That 'rm' was unnecessary to begin with, because none of the previous tests wrote and left behind a file 'failures', ever. Though one could argue that the original author was careful and added that 'rm failures' as a protection against future changes to previous tests... Dunno, we are talking about a test that checked the stderr of 'test_cmp', so anything is possible :)
Re: What's cooking in git.git (Mar 2018, #02; Tue, 6)
On Fri, Mar 9, 2018 at 1:15 PM, Martin Ågrenwrote: > On 7 March 2018 at 00:34, Junio C Hamano wrote: > >> * ma/config-page-only-in-list-mode (2018-02-21) 3 commits >> - config: change default of `pager.config` to "on" >> - config: respect `pager.config` in list/get-mode only >> - t7006: add tests for how git config paginates >> >> In a way similar to how "git tag" learned to honor the pager >> setting only in the list mode, "git config" learned to ignore the >> pager setting when it is used for setting values (i.e. when the >> purpose of the operation is not to "show"). >> >> Is this ready for 'next'? > > I am not aware of any open questions or issues. You thought out loud > about how the series was structured, in particular about introducing a > successful test, then redefining it, as opposed to introducing it as a > failing test, then making it succeed. I hope I managed to motivate my > choice better in v2 (which is what you have picked up). > > Duy wondered if it was sane to use a pager when we know that we are > "--get"-ing at most one config item. In v2, I addressed this by turning > on paging for a more careful selection of "--get"-ters. Yeah I got busy with stuff and didn't look at it. I've just checked what's in 'pu'. Looks good to me. -- Duy
Re: recent glob expansion breakage on Windows?
On 03/08/18 23:03, Jonathan Nieder wrote: > +git-for-windows > Hi, > > Laszlo Ersek wrote: > >> Jaben reports that git-send-email is suddenly failing to expand the >> "*.patch" glob for him, at the Windows CMD prompt: >> >> - >> E:\...>git send-email --suppress-cc=author --suppress-cc=self >> --suppress-cc=cc --suppress-cc=sob --dry-run *.patch >> >> No patch files specified! >> - >> >> Whereas, moving the same patch files to another subdir, and then passing >> the subdir to git-send-email, works fine. >> >> I seem to have found some $^O based perl code in the git tree that >> expands the glob manually (in place of the shell) on Windows -- however, >> that code looks ancient ^W very stable, and doesn't seem to explain the >> sudden breakage. >> >> Is it possible that a separate perl update on Jaben's Windows box is >> causing this? Or does the breakage look consistent with a recent git change? >> >> Has anyone else reported something similar recently? > > This reminds me of https://github.com/git-for-windows/git/issues/339. > There, Johannes Schindelin writes (about a different command): > > | This is expected because neither PowerShell nor cmd.exe nor git.exe > | expand wildcards. Those examples you found were written with a shell > | in mind, and the shell expands wildcards (hence Git does not think > | it needs to). > > That may or may not also apply to send-email. Thank you for the reference -- I can't say whether closing issue #339 as WONTFIX was justified or not, but it certainly seems inconsistent with Jaben's earlier experience (to my understanding), i.e. that git did expand the glob. > In what version did it work? Jaben, can you please answer that? (One version in which it is broken is 2.14.1.windows.1.) Can you perhaps ask your teammates about their git/windows versions (assuming the *.patch glob is expanded correctly for them)? Thank you, Jonathan, Laszlo
Re: [PATCH] Support long format for log-based submodule diff
On Wed, Mar 7, 2018 at 1:11 PM, Robert Daileywrote: > I am experimenting with a version of submodule diff (using log style) > that prints the commits brought in from merges, while excluding the > merge commits themselves. This is useful in cases where a merge commit's > summary does not fully explain the changes being merged (for example, > for longer-lived branches). > > I could have gone through the effort to make this more configurable, but > before doing that level of work I wanted to get some discussion going to > understand first if this is a useful change and second how it should be > configured. For example, we could allow: > > $ git diff --submodule=long-log > > Or a supplementary option such as: > > $ git diff --submodule=log --submodule-log-detail=(long|short) > > I'm not sure what makes sense here. I welcome thoughts/discussion and > will provide follow-up patches. The case of merges is usually configured with --[no-]merges, or --min-parents=. I would think we would want to have different settings per repository, i.e. these settings would only apply to the superproject, however we could keep the same names for submodules, such that we could do git log --min-parents=0 --submodules=--no-merges We started an effort to have a repository object handle in most functions some time ago, but the option parsing for the revision walking doesn't take a repository yet, otherwise the generic revision parsing for submodules would be easy to implement. Thoughts on this generic approach? Stefan > > Signed-off-by: Robert Dailey > --- > submodule.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/submodule.c b/submodule.c > index 2967704317..a0a62ad7bd 100644 > --- a/submodule.c > +++ b/submodule.c > @@ -428,7 +428,8 @@ static int prepare_submodule_summary(struct rev_info > *rev, const char *path, > init_revisions(rev, NULL); > setup_revisions(0, NULL, rev, NULL); > rev->left_right = 1; > - rev->first_parent_only = 1; > + rev->max_parents = 1; > + rev->first_parent_only = 0; > left->object.flags |= SYMMETRIC_LEFT; > add_pending_object(rev, >object, path); > add_pending_object(rev, >object, path); > -- > 2.13.1.windows.2 >