Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Tue, Nov 13, 2012 at 11:15 AM, Michael J Gruber wrote: > Felipe Contreras venit, vidit, dixit 12.11.2012 23:47: >> On Mon, Nov 12, 2012 at 10:41 PM, Jeff King wrote: >>> On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote: >>> > 3. Exporters should not use it if they have any broken-down > representation at all. Even knowing that the first half is a human > name and the second half is something else would give it a better > shot at cleaning than fast-import would get. I'm not sure what you mean by this. If they have name and email, then sure, it's easy. >>> >>> But not as easy as just printing it. What if you have this: >>> >>> name="Peff King" >>> email="" >>> >>> Concatenating them does not produce a valid git author name. Sending the >>> concatenation through fast-import's cleanup function would lose >>> information (namely, the location of the boundary between name and >>> email). >> >> Right. Unfortunately I'm not aware of any DSCM that does that. >> >>> Similarly, one might have other structured data (e.g., CVS username) >>> where the structure is a useful hint, but some conversion to name+email >>> is still necessary. >> >> CVS might be the only one that has such structured data. I think in >> subversion the username has no meaning. A 'felipec' subversion >> username is as bad as a mercurial 'felipec' username. > > In subversion, the username has the clearly defined meaning of being a > username on the subversion host. If the host is, e.g., a sourceforge > site then I can easily look up the user profile and convert the username > into a valid e-mail address (@users.sf.net). That is the > advantage that the exporter (together with user knowledge) has over the > importer. > > If the initial clone process aborts after every single "unknown" user > it's no fun, of course. On the other hand, if an incremental clone > (fetch) let's commits with unknown author sneak in it's no fun either > (because I may want to fetch in crontab and publish that converted beast > automatically). That is why I proposed neither approach. > > Most conveniently, the export side of a remote helper would > > - do "obvious" automatic lossless transformations > - use an author map for other names This should be done by fast-import. It doesn't make any sense that every remote helper and fast-exporter out there have their own way of mapping authors (or none). > - For names not covered by the above (or having an empty map entry): > Stop exporting commits but continue parsing commits and amend the author > map with any unknown usernames (empty entry), and warn the user. > (crontab script can notify me based on the return code.) Stop exporting commits but continue parsing commits? I don't know what that means. fast-import should try it's best to clean it up, warn the user, sure, but also store the missing entry on a file, so that it can be filed later (if the user so wishes). > If the cloning involves a "foreign clone" (like the hg clone behind the > scene) then the runtime of the second pass should be much smaller. In > principle, one could even store all blobs and trees on the first run and > skip that step on the second, but that would rely on immutability on the > foreign side, so I dunno. (And to check the sha1, we have to get the > blob anyways.) No. There's no concept of partial clones... Either you clone, or you don't. Wait if the remote helper didn't notice that the author was bad? fast-import could just just leave everything up to that point, warn abut what happened, and exit, but the exporter side would die in the middle of exporting, and it might end up in a bad state, not saving marks, or who knows what. It wouldn't work. The cloning should be full, and the bad authors stored in a scaffold author map. > As for the format for incomplete entries (foo ), a technical > guideline should suffice for those that follow guidelines. fast-import should do that. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
Felipe Contreras venit, vidit, dixit 12.11.2012 23:47: > On Mon, Nov 12, 2012 at 10:41 PM, Jeff King wrote: >> On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote: >> 3. Exporters should not use it if they have any broken-down representation at all. Even knowing that the first half is a human name and the second half is something else would give it a better shot at cleaning than fast-import would get. >>> >>> I'm not sure what you mean by this. If they have name and email, then >>> sure, it's easy. >> >> But not as easy as just printing it. What if you have this: >> >> name="Peff King" >> email="" >> >> Concatenating them does not produce a valid git author name. Sending the >> concatenation through fast-import's cleanup function would lose >> information (namely, the location of the boundary between name and >> email). > > Right. Unfortunately I'm not aware of any DSCM that does that. > >> Similarly, one might have other structured data (e.g., CVS username) >> where the structure is a useful hint, but some conversion to name+email >> is still necessary. > > CVS might be the only one that has such structured data. I think in > subversion the username has no meaning. A 'felipec' subversion > username is as bad as a mercurial 'felipec' username. In subversion, the username has the clearly defined meaning of being a username on the subversion host. If the host is, e.g., a sourceforge site then I can easily look up the user profile and convert the username into a valid e-mail address (@users.sf.net). That is the advantage that the exporter (together with user knowledge) has over the importer. If the initial clone process aborts after every single "unknown" user it's no fun, of course. On the other hand, if an incremental clone (fetch) let's commits with unknown author sneak in it's no fun either (because I may want to fetch in crontab and publish that converted beast automatically). That is why I proposed neither approach. Most conveniently, the export side of a remote helper would - do "obvious" automatic lossless transformations - use an author map for other names - For names not covered by the above (or having an empty map entry): Stop exporting commits but continue parsing commits and amend the author map with any unknown usernames (empty entry), and warn the user. (crontab script can notify me based on the return code.) If the cloning involves a "foreign clone" (like the hg clone behind the scene) then the runtime of the second pass should be much smaller. In principle, one could even store all blobs and trees on the first run and skip that step on the second, but that would rely on immutability on the foreign side, so I dunno. (And to check the sha1, we have to get the blob anyways.) As for the format for incomplete entries (foo ), a technical guideline should suffice for those that follow guidelines. Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Mon, Nov 12, 2012 at 10:41 PM, Jeff King wrote: > On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote: > >> > 3. Exporters should not use it if they have any broken-down >> > representation at all. Even knowing that the first half is a human >> > name and the second half is something else would give it a better >> > shot at cleaning than fast-import would get. >> >> I'm not sure what you mean by this. If they have name and email, then >> sure, it's easy. > > But not as easy as just printing it. What if you have this: > > name="Peff King" > email="" > > Concatenating them does not produce a valid git author name. Sending the > concatenation through fast-import's cleanup function would lose > information (namely, the location of the boundary between name and > email). Right. Unfortunately I'm not aware of any DSCM that does that. > Similarly, one might have other structured data (e.g., CVS username) > where the structure is a useful hint, but some conversion to name+email > is still necessary. CVS might be the only one that has such structured data. I think in subversion the username has no meaning. A 'felipec' subversion username is as bad as a mercurial 'felipec' username. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote: > > 3. Exporters should not use it if they have any broken-down > > representation at all. Even knowing that the first half is a human > > name and the second half is something else would give it a better > > shot at cleaning than fast-import would get. > > I'm not sure what you mean by this. If they have name and email, then > sure, it's easy. But not as easy as just printing it. What if you have this: name="Peff King" email="" Concatenating them does not produce a valid git author name. Sending the concatenation through fast-import's cleanup function would lose information (namely, the location of the boundary between name and email). Similarly, one might have other structured data (e.g., CVS username) where the structure is a useful hint, but some conversion to name+email is still necessary. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Mon, Nov 12, 2012 at 6:45 PM, Junio C Hamano wrote: > A Large Angry SCM writes: > >> On 11/11/2012 07:41 AM, Felipe Contreras wrote: >>> On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM >>> wrote: On 11/10/2012 01:43 PM, Felipe Contreras wrote: >>> > So, the options are: > > a) Leave the name conversion to the export tools, and when they miss > some weird corner case, like 'Author consequences, perhaps after an hour of the process. > > We know there are sources of data that don't have git-formatted author > names, so we know every tool out there must do this checking. > > In addition to that, let the export tool decide what to do when one of > these bad names appear, which in many cases probably means do nothing, > so the user would not even see that such a bad name was there, which > might not be what they want. > > b) Do the name conversion in fast-import itself, perhaps optionally, > so if a tool missed some weird corner case, the user does not have to > face the consequences. > > The tool writers don't have to worry about this, so we would not have > tools out there doing a half-assed job of this. > > And what happens when such bad names end up being consistent: warning, > a scaffold mapping of bad names, etc. > > > One is bad for the users, and the tools writers, only disadvantages, > the other is good for the users and the tools writers, only > advantages. > c) Do the name conversion, and whatever other cleanup and manipulations you're interesting in, in a filter between the exporter and git-fast-import. >>> >>> Such a filter would probably be quite complicated, and would decrease >>> performance. >>> >> >> Really? >> >> The fast import stream protocol is pretty simple. All the filter >> really needs to do is pass through everything that isn't a 'commit' >> command. And for the 'commit' command, it only needs to do something >> with the 'author' and 'committer' lines; passing through everything >> else. >> >> I agree that an additional filter _may_ decrease performance somewhat >> if you are already CPU constrained. But I suspect that the effect >> would be negligible compared to the all of the SHA-1 calculations. > > More importantly, which do users prefer: quickly produce an > incorrect result, or spend some more time to get it right? Why not both? If I do 'git clone hg::http://selenic.com/hg' I expect it to work, no matter what. Then, if I care about getting it right, like for example if the project is moving to git, then check .git/hg/origin/bad-authors, and fill them with the right ones. Of course, the current remote helper framework doesn't have the option to map authors, but it could be added. That would be better than letting every remote helper tool to have a custom way of mapping authors, and also custom configuration for them. > Because the exporting tool has a lot more intimate knowledge about > how the names are represented in the history of the original SCM, > canonicalization of the names, if done at that point, would likely > to give us more useful results, than a canonicalization done at the > beginning of the importer, which lacks SCM specific details. So in > that sense, (a) is more preferrable than (b). But it doesn't have more intimate knowledge. It has exactly the same information as fast-import; nothing. What intimate knowledge is a tool expected to get from this? % hg commit -u 'Foo Bar ' -m one % hg --debug log changeset: 0:5ef37a2c773f02d0e01f1ecdcc59149832d294e8 tag: tip phase: draft parent: -1: parent: -1: manifest:0:c6d4cd25b9fc2f83b0dd51f4acbea9486fce54d7 user:Foo Bar date:Sun Nov 11 18:33:00 2012 +0100 files+: file extra: branch=default description: one Some tools might, but if they did, then bad authors wouldn't be a problem. > On the other hand, we would want consistency across the converted > results no matter what SCM the history was originally in. E.g. a > name without email that came from CVS or SVN would consistently want > to become "name " or "name " or whatever, and > letting exporting tools responsible for the canonicalization will > lead them to create their own garbage. In that sense, (b) can be > better than (a). Or 'Unknown ' or '' or '<>', or any of the forms conversion tools have been doing for ages. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
A Large Angry SCM writes: > On 11/11/2012 07:41 AM, Felipe Contreras wrote: >> On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM >> wrote: >>> On 11/10/2012 01:43 PM, Felipe Contreras wrote: >> So, the options are: a) Leave the name conversion to the export tools, and when they miss some weird corner case, like 'Author>>> consequences, perhaps after an hour of the process. We know there are sources of data that don't have git-formatted author names, so we know every tool out there must do this checking. In addition to that, let the export tool decide what to do when one of these bad names appear, which in many cases probably means do nothing, so the user would not even see that such a bad name was there, which might not be what they want. b) Do the name conversion in fast-import itself, perhaps optionally, so if a tool missed some weird corner case, the user does not have to face the consequences. The tool writers don't have to worry about this, so we would not have tools out there doing a half-assed job of this. And what happens when such bad names end up being consistent: warning, a scaffold mapping of bad names, etc. One is bad for the users, and the tools writers, only disadvantages, the other is good for the users and the tools writers, only advantages. >>> >>> c) Do the name conversion, and whatever other cleanup and manipulations >>> you're interesting in, in a filter between the exporter and git-fast-import. >> >> Such a filter would probably be quite complicated, and would decrease >> performance. >> > > Really? > > The fast import stream protocol is pretty simple. All the filter > really needs to do is pass through everything that isn't a 'commit' > command. And for the 'commit' command, it only needs to do something > with the 'author' and 'committer' lines; passing through everything > else. > > I agree that an additional filter _may_ decrease performance somewhat > if you are already CPU constrained. But I suspect that the effect > would be negligible compared to the all of the SHA-1 calculations. More importantly, which do users prefer: quickly produce an incorrect result, or spend some more time to get it right? Because the exporting tool has a lot more intimate knowledge about how the names are represented in the history of the original SCM, canonicalization of the names, if done at that point, would likely to give us more useful results, than a canonicalization done at the beginning of the importer, which lacks SCM specific details. So in that sense, (a) is more preferrable than (b). On the other hand, we would want consistency across the converted results no matter what SCM the history was originally in. E.g. a name without email that came from CVS or SVN would consistently want to become "name " or "name " or whatever, and letting exporting tools responsible for the canonicalization will lead them to create their own garbage. In that sense, (b) can be better than (a). I think (c) implements worst of both choices. It cannot exploit knowledge specific to the original SCM like (a) would, and while it can enforce consistency the same way as (b) would, it would be a separate program, unlike (b). So... -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sun, Nov 11, 2012 at 7:14 PM, Jeff King wrote: > On Sun, Nov 11, 2012 at 06:45:32PM +0100, Felipe Contreras wrote: > >> > If there is a standard filter, then what is the advantage in doing it as >> > a pipe? Why not just teach fast-import the same trick (and possibly make >> > it optional)? That would be simpler, more efficient, and it would make >> > it easier for remote helpers to turn it on (they use a command-line >> > switch rather than setting up an extra process). >> >> Right, but instead of a command-line switch it probably should be >> enabled on the stream: >> >> feature clean-authors >> >> Or something. > > Yeah, I was thinking it would need a feature switch to the remote helper > to turn on the command-line, but I forgot that fast-import can take > feature lines directly. > >> > We can clean up and normalize >> > things like whitespace (and we probably should if we do not do so >> > already). But beyond that, we have no context about the name; only the >> > exporter has that. >> >> There is no context. > > There may not be a lot, but there is some: > >> These are exactly the same questions every exporter must answer. And >> there's no answer, because the field is not a git author, it's a >> mercurial user, or a bazaar committer, or who knows what. > > The exporter knows that the field is a mercurial user (or whatever). > Fast-import does not even know that, and cannot apply any rules or > heuristics about the format of a mercurial user string, what is common > in the mercurial world, etc. It may not be a lot of context in some > cases (I do not know anything about mercurial's formats, so I can't say > what knowledge is available). But at least the exporter has a chance at > domain-specific interpretation of the string. Fast-import has no chance, > because it does not know the domain. > > I've snipped the rest of your argument, which is basically that > mercurial does not have any context at all, and knowing that it is a > mercurial author is useless. I am not sure that is true; even knowing > that it is a free-form field versus something structured (e.g., we know > CVS authors are usernames on the server server) is useful. It is useful in the sense that we know we cannot do anything sensible about it. All we can do is try. > But I would agree there are probably multiple systems that are like > mercurial in that the author field is usually something like "name > ", but may be arbitrary text (I assume bzr is the same way, but > you would know better than me). So it may make sense to have some stock > algorithm to try to convert arbitrary almost-name-and-email text into > name and email to reduce duplication between exporters, but: Yes, bazaar seems to be the same way. % bzr log revno: 1 committer: Foo Bar1. It must be turned on explicitly by the exporter, since we do not > want to munge more structured input from clueful exporters. Agreed. > 2. The exporter should only turn it on after replacing its own munging > (e.g., it shouldn't be adding junk like ; fast-import > would need to receive as pristine an input as possible). Agreed. > 3. Exporters should not use it if they have any broken-down > representation at all. Even knowing that the first half is a human > name and the second half is something else would give it a better > shot at cleaning than fast-import would get. I'm not sure what you mean by this. If they have name and email, then sure, it's easy. And for the record, I've have encountered this problem also with monotone. There's quite a lot of strategies to convert names to git authors. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On 11/11/2012 12:15 PM, Jeff King wrote: On Sun, Nov 11, 2012 at 12:00:44PM -0500, A Large Angry SCM wrote: a) Leave the name conversion to the export tools, and when they miss some weird corner case, like 'Author [...] b) Do the name conversion in fast-import itself, perhaps optionally, so if a tool missed some weird corner case, the user does not have to face the consequences. [...] c) Do the name conversion, and whatever other cleanup and manipulations you're interesting in, in a filter between the exporter and git-fast-import. Such a filter would probably be quite complicated, and would decrease performance. Really? The fast import stream protocol is pretty simple. All the filter really needs to do is pass through everything that isn't a 'commit' command. And for the 'commit' command, it only needs to do something with the 'author' and 'committer' lines; passing through everything else. I agree that an additional filter _may_ decrease performance somewhat if you are already CPU constrained. But I suspect that the effect would be negligible compared to the all of the SHA-1 calculations. It might be measurable, as you are passing every byte of every version of every file in the repo through an extra pipe. But more importantly, I don't think it helps. If there is not a standard filter for fixing up names, we do not need to care. The user can use "sed" or whatever and pay the performance penalty (and deal with the possibility of errors from being lazy about parsing the fast-import stream). If there is a standard filter, then what is the advantage in doing it as a pipe? Why not just teach fast-import the same trick (and possibly make it optional)? That would be simpler, more efficient, and it would make it easier for remote helpers to turn it on (they use a command-line switch rather than setting up an extra process). But what I don't understand is: what would such a standard filter look like? Fast-import (or a filter) would already receive the exporter's best attempt at a git-like ident string. We can clean up and normalize things like whitespace (and we probably should if we do not do so already). But beyond that, we have no context about the name; only the exporter has that. So if we receive: Foo Bar or: Foo Bar or: Foo Bar I don't think that there is or can be a standard filter. Cleaning up after a broken exporter is likely to always be a repository unique situation. The example here is about names and email addresses but it could easily be about other things (dates, history, content, etc.). Some of which that could possible be fixed using git-filter-branch; some possibly not. Fixing the exporter is always the most desirable option, but it may not be the best option for the particular situation. Locally modifying git-fast-import is another option; again, possibly not the best option. Convincing the git maintainers to handle your specific situation, though a good option for you, is not likely to be scalable. A filter in front of git-fast-import is always _an_ option and can tailored to the particular situation. My preference is to follow the "Unix philosophy": the tools are focused on what they need to do and can be composed with other tools/scripts to accomplish the desired result. d) Another (bad) option is to make git-fast-import very permissive and warn the user to fix things via git-filter-branch before distributing the repository or git's standard repository checks find the problems. This isn't my itch so I think I may have exhausted my $0.02 on this subject. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sun, Nov 11, 2012 at 06:45:32PM +0100, Felipe Contreras wrote: > > If there is a standard filter, then what is the advantage in doing it as > > a pipe? Why not just teach fast-import the same trick (and possibly make > > it optional)? That would be simpler, more efficient, and it would make > > it easier for remote helpers to turn it on (they use a command-line > > switch rather than setting up an extra process). > > Right, but instead of a command-line switch it probably should be > enabled on the stream: > > feature clean-authors > > Or something. Yeah, I was thinking it would need a feature switch to the remote helper to turn on the command-line, but I forgot that fast-import can take feature lines directly. > > We can clean up and normalize > > things like whitespace (and we probably should if we do not do so > > already). But beyond that, we have no context about the name; only the > > exporter has that. > > There is no context. There may not be a lot, but there is some: > These are exactly the same questions every exporter must answer. And > there's no answer, because the field is not a git author, it's a > mercurial user, or a bazaar committer, or who knows what. The exporter knows that the field is a mercurial user (or whatever). Fast-import does not even know that, and cannot apply any rules or heuristics about the format of a mercurial user string, what is common in the mercurial world, etc. It may not be a lot of context in some cases (I do not know anything about mercurial's formats, so I can't say what knowledge is available). But at least the exporter has a chance at domain-specific interpretation of the string. Fast-import has no chance, because it does not know the domain. I've snipped the rest of your argument, which is basically that mercurial does not have any context at all, and knowing that it is a mercurial author is useless. I am not sure that is true; even knowing that it is a free-form field versus something structured (e.g., we know CVS authors are usernames on the server server) is useful. But I would agree there are probably multiple systems that are like mercurial in that the author field is usually something like "name ", but may be arbitrary text (I assume bzr is the same way, but you would know better than me). So it may make sense to have some stock algorithm to try to convert arbitrary almost-name-and-email text into name and email to reduce duplication between exporters, but: 1. It must be turned on explicitly by the exporter, since we do not want to munge more structured input from clueful exporters. 2. The exporter should only turn it on after replacing its own munging (e.g., it shouldn't be adding junk like ; fast-import would need to receive as pristine an input as possible). 3. Exporters should not use it if they have any broken-down representation at all. Even knowing that the first half is a human name and the second half is something else would give it a better shot at cleaning than fast-import would get. Alternatively, the feature could enable the exporter to pass a more structured ident to git. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sun, Nov 11, 2012 at 6:39 PM, A Large Angry SCM wrote: > On 11/11/2012 12:16 PM, Felipe Contreras wrote: >> And how do you propose to find the commit commands without parsing all >> the other commands? If you randomly look for lines that begin with >> 'commit /refs' you might end up in the middle of a commit message or >> the contents of a file. > > I didn't say you didn't have to parse the protocol. I said that the protocol > is pretty simple. Parsing is never simple. >>> I agree that an additional filter _may_ decrease performance somewhat if >>> you >>> are already CPU constrained. But I suspect that the effect would be >>> negligible compared to the all of the SHA-1 calculations. >> >> Well. If it's so easy surely you can write one quickly, and I can measure >> it. > > Not my itch; You care, you do it. It was your idea, I don't care. If it's so simple, why don't you do it? Because it's not that simple. And anyway it will have a performance penalty. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sun, Nov 11, 2012 at 6:15 PM, Jeff King wrote: > On Sun, Nov 11, 2012 at 12:00:44PM -0500, A Large Angry SCM wrote: > If there is a standard filter, then what is the advantage in doing it as > a pipe? Why not just teach fast-import the same trick (and possibly make > it optional)? That would be simpler, more efficient, and it would make > it easier for remote helpers to turn it on (they use a command-line > switch rather than setting up an extra process). Right, but instead of a command-line switch it probably should be enabled on the stream: feature clean-authors Or something. > But what I don't understand is: what would such a standard filter look > like? Fast-import (or a filter) would already receive the exporter's > best attempt at a git-like ident string. Currently, yeah, because there's no other option. It's either try to clean it up, or fail. But if 'git fast-import' as a superior alternative, I certainly would remove my custom code and enable that feature. > We can clean up and normalize > things like whitespace (and we probably should if we do not do so > already). But beyond that, we have no context about the name; only the > exporter has that. There is no context. > So if we receive: > > Foo Bar > > or: > > Foo Bar > > or: > > Foo Bar > what do we do with it? Is the first part a malformed name/email pair, > and the second part is crap added by a lazy exporter? Or does the > exporter want to keep the angle brackets as part of the name field? Is > there a malformed email in the last one, or no email at all? These are exactly the same questions every exporter must answer. And there's no answer, because the field is not a git author, it's a mercurial user, or a bazaar committer, or who knows what. >From whatever source, these all might be valid authors: john john (grease) t...@test.com test test test >t...@est.com> test test com> <> > < The first chapter of the LOTR There is no context. > The exporter is the only program that actually knows where the data came > from, It doesn't matter where it came from, it's not a name/email pair. > how it should be broken down, It cannot be broken down, it's free-form text. Any text. > and what is appropriate for pulling > data out of its particular source system. This free-form text is the lowest granularity. There is nothing else. > For that reason, the exporter > has to be the place where we come up with a syntactically correct and > unambiguous ident. *If* the exporter is able to do this, sure, but many don't have any more information. See: % hg commit -u 'Foo Bar ' -m one % hg --debug log changeset: 0:5ef37a2c773f02d0e01f1ecdcc59149832d294e8 tag: tip phase: draft parent: -1: parent: -1: manifest:0:c6d4cd25b9fc2f83b0dd51f4acbea9486fce54d7 user:Foo Bar date:Sun Nov 11 18:33:00 2012 +0100 files+: file extra: branch=default description: one What is a hg exporter tool supposed to do with that? What such a tool can do, 'git fast-import' can do. > I am not opposed to adding a mailmap-like feature to fast-import to map > identities, but it has to start with sane, unambiguous output from the > exporter. And if that's not possible? -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On 11/11/2012 12:16 PM, Felipe Contreras wrote: On Sun, Nov 11, 2012 at 6:00 PM, A Large Angry SCM wrote: On 11/11/2012 07:41 AM, Felipe Contreras wrote: Such a filter would probably be quite complicated, and would decrease performance. Really? The fast import stream protocol is pretty simple. All the filter really needs to do is pass through everything that isn't a 'commit' command. And for the 'commit' command, it only needs to do something with the 'author' and 'committer' lines; passing through everything else. And how do you propose to find the commit commands without parsing all the other commands? If you randomly look for lines that begin with 'commit /refs' you might end up in the middle of a commit message or the contents of a file. I didn't say you didn't have to parse the protocol. I said that the protocol is pretty simple. I agree that an additional filter _may_ decrease performance somewhat if you are already CPU constrained. But I suspect that the effect would be negligible compared to the all of the SHA-1 calculations. Well. If it's so easy surely you can write one quickly, and I can measure it. Not my itch; You care, you do it. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sun, Nov 11, 2012 at 6:00 PM, A Large Angry SCM wrote: > On 11/11/2012 07:41 AM, Felipe Contreras wrote: >> Such a filter would probably be quite complicated, and would decrease >> performance. > > Really? > > The fast import stream protocol is pretty simple. All the filter really > needs to do is pass through everything that isn't a 'commit' command. And > for the 'commit' command, it only needs to do something with the 'author' > and 'committer' lines; passing through everything else. And how do you propose to find the commit commands without parsing all the other commands? If you randomly look for lines that begin with 'commit /refs' you might end up in the middle of a commit message or the contents of a file. > I agree that an additional filter _may_ decrease performance somewhat if you > are already CPU constrained. But I suspect that the effect would be > negligible compared to the all of the SHA-1 calculations. Well. If it's so easy surely you can write one quickly, and I can measure it. Cheers. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sun, Nov 11, 2012 at 12:00:44PM -0500, A Large Angry SCM wrote: > >>>a) Leave the name conversion to the export tools, and when they miss > >>>some weird corner case, like 'Author >>>consequences, perhaps after an hour of the process. > [...] > >>>b) Do the name conversion in fast-import itself, perhaps optionally, > >>>so if a tool missed some weird corner case, the user does not have to > >>>face the consequences. > [...] > >>c) Do the name conversion, and whatever other cleanup and manipulations > >>you're interesting in, in a filter between the exporter and git-fast-import. > > > >Such a filter would probably be quite complicated, and would decrease > >performance. > > > > Really? > > The fast import stream protocol is pretty simple. All the filter > really needs to do is pass through everything that isn't a 'commit' > command. And for the 'commit' command, it only needs to do something > with the 'author' and 'committer' lines; passing through everything > else. > > I agree that an additional filter _may_ decrease performance somewhat > if you are already CPU constrained. But I suspect that the effect > would be negligible compared to the all of the SHA-1 calculations. It might be measurable, as you are passing every byte of every version of every file in the repo through an extra pipe. But more importantly, I don't think it helps. If there is not a standard filter for fixing up names, we do not need to care. The user can use "sed" or whatever and pay the performance penalty (and deal with the possibility of errors from being lazy about parsing the fast-import stream). If there is a standard filter, then what is the advantage in doing it as a pipe? Why not just teach fast-import the same trick (and possibly make it optional)? That would be simpler, more efficient, and it would make it easier for remote helpers to turn it on (they use a command-line switch rather than setting up an extra process). But what I don't understand is: what would such a standard filter look like? Fast-import (or a filter) would already receive the exporter's best attempt at a git-like ident string. We can clean up and normalize things like whitespace (and we probably should if we do not do so already). But beyond that, we have no context about the name; only the exporter has that. So if we receive: Foo Bar or: Foo Bar or: Foo Barhttp://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On 11/11/2012 07:41 AM, Felipe Contreras wrote: On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM wrote: On 11/10/2012 01:43 PM, Felipe Contreras wrote: So, the options are: a) Leave the name conversion to the export tools, and when they miss some weird corner case, like 'Author c) Do the name conversion, and whatever other cleanup and manipulations you're interesting in, in a filter between the exporter and git-fast-import. Such a filter would probably be quite complicated, and would decrease performance. Really? The fast import stream protocol is pretty simple. All the filter really needs to do is pass through everything that isn't a 'commit' command. And for the 'commit' command, it only needs to do something with the 'author' and 'committer' lines; passing through everything else. I agree that an additional filter _may_ decrease performance somewhat if you are already CPU constrained. But I suspect that the effect would be negligible compared to the all of the SHA-1 calculations. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM wrote: > On 11/10/2012 01:43 PM, Felipe Contreras wrote: >> So, the options are: >> >> a) Leave the name conversion to the export tools, and when they miss >> some weird corner case, like 'Author> consequences, perhaps after an hour of the process. >> >> We know there are sources of data that don't have git-formatted author >> names, so we know every tool out there must do this checking. >> >> In addition to that, let the export tool decide what to do when one of >> these bad names appear, which in many cases probably means do nothing, >> so the user would not even see that such a bad name was there, which >> might not be what they want. >> >> b) Do the name conversion in fast-import itself, perhaps optionally, >> so if a tool missed some weird corner case, the user does not have to >> face the consequences. >> >> The tool writers don't have to worry about this, so we would not have >> tools out there doing a half-assed job of this. >> >> And what happens when such bad names end up being consistent: warning, >> a scaffold mapping of bad names, etc. >> >> >> One is bad for the users, and the tools writers, only disadvantages, >> the other is good for the users and the tools writers, only >> advantages. >> > > c) Do the name conversion, and whatever other cleanup and manipulations > you're interesting in, in a filter between the exporter and git-fast-import. Such a filter would probably be quite complicated, and would decrease performance. -- Felipe Contreras -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On 11/10/2012 01:43 PM, Felipe Contreras wrote: On Sat, Nov 10, 2012 at 6:28 PM, Michael J Gruber wrote: Felipe Contreras venit, vidit, dixit 09.11.2012 15:34: On Fri, Nov 9, 2012 at 10:28 AM, Michael J Gruber wrote: Hg seems to store just anything in the author field ("committer"). The various interfaces that are floating around do some behind-the-back conversion to git format. The more conversions they do, the better they seem to work (no erroring out) but I'm wondering whether it's really a good thing, or whether we should encourage a more diligent approach which requires a user to map non-conforming author names wilfully. So you propose that when somebody does 'git clone hg::hg hg-git' the thing should fail. I hope you don't think it's too unbecoming for me to say that I disagree. There is no need to disagree with a proposal I haven't made. I would disagree with the proposal that I haven't made, too. All right, we shouldn't encourage a more diligent approach which requires a user to map author names then. IMO it should be git fast-import the one that converts these bad authors, not every single tool out there. Maybe throw a warning, but that's all. Or maybe generate a list of bad authors ready to be filled out. That way when a project is doing a real conversion, say, when moving to git, they can run the conversion once and see which authors are bad and not multiple times, each try taking longer than the next. As Jeff pointed out, git-fast-import expects output conforming to a certain standard, and that's not going to change. import is agnostic to where its import stream is coming from. Only the producer of that stream can have additional information about the provenience of the stream's data which may aid (possibly together with user input or choices) in transforming that into something conforming. We already know where the import of those streams come from: mercurial, bazaar, etc. There's absolutely nothing the tools exporting data from those repositories can do, except try to convert all kind of weird names--and many tools do it poorly. So, the options are: a) Leave the name conversion to the export tools, and when they miss some weird corner case, like 'Author c) Do the name conversion, and whatever other cleanup and manipulations you're interesting in, in a filter between the exporter and git-fast-import. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Sat, Nov 10, 2012 at 6:28 PM, Michael J Gruber wrote: > Felipe Contreras venit, vidit, dixit 09.11.2012 15:34: >> On Fri, Nov 9, 2012 at 10:28 AM, Michael J Gruber >> wrote: >> >>> Hg seems to store just anything in the author field ("committer"). The >>> various interfaces that are floating around do some behind-the-back >>> conversion to git format. The more conversions they do, the better they >>> seem to work (no erroring out) but I'm wondering whether it's really a >>> good thing, or whether we should encourage a more diligent approach >>> which requires a user to map non-conforming author names wilfully. >> >> So you propose that when somebody does 'git clone hg::hg hg-git' the >> thing should fail. I hope you don't think it's too unbecoming for me >> to say that I disagree. > > There is no need to disagree with a proposal I haven't made. I would > disagree with the proposal that I haven't made, too. All right, we shouldn't encourage a more diligent approach which requires a user to map author names then. >> IMO it should be git fast-import the one that converts these bad >> authors, not every single tool out there. Maybe throw a warning, but >> that's all. Or maybe generate a list of bad authors ready to be filled >> out. That way when a project is doing a real conversion, say, when >> moving to git, they can run the conversion once and see which authors >> are bad and not multiple times, each try taking longer than the next. > > As Jeff pointed out, git-fast-import expects output conforming to a > certain standard, and that's not going to change. import is agnostic to > where its import stream is coming from. Only the producer of that stream > can have additional information about the provenience of the stream's > data which may aid (possibly together with user input or choices) in > transforming that into something conforming. We already know where the import of those streams come from: mercurial, bazaar, etc. There's absolutely nothing the tools exporting data from those repositories can do, except try to convert all kind of weird names--and many tools do it poorly. So, the options are: a) Leave the name conversion to the export tools, and when they miss some weird corner case, like 'Author http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
Felipe Contreras venit, vidit, dixit 09.11.2012 15:34: > On Fri, Nov 9, 2012 at 10:28 AM, Michael J Gruber > wrote: > >> Hg seems to store just anything in the author field ("committer"). The >> various interfaces that are floating around do some behind-the-back >> conversion to git format. The more conversions they do, the better they >> seem to work (no erroring out) but I'm wondering whether it's really a >> good thing, or whether we should encourage a more diligent approach >> which requires a user to map non-conforming author names wilfully. > > So you propose that when somebody does 'git clone hg::hg hg-git' the > thing should fail. I hope you don't think it's too unbecoming for me > to say that I disagree. There is no need to disagree with a proposal I haven't made. I would disagree with the proposal that I haven't made, too. > IMO it should be git fast-import the one that converts these bad > authors, not every single tool out there. Maybe throw a warning, but > that's all. Or maybe generate a list of bad authors ready to be filled > out. That way when a project is doing a real conversion, say, when > moving to git, they can run the conversion once and see which authors > are bad and not multiple times, each try taking longer than the next. As Jeff pointed out, git-fast-import expects output conforming to a certain standard, and that's not going to change. import is agnostic to where its import stream is coming from. Only the producer of that stream can have additional information about the provenience of the stream's data which may aid (possibly together with user input or choices) in transforming that into something conforming. Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
Jeff King venit, vidit, dixit 08.11.2012 21:09: > On Fri, Nov 02, 2012 at 03:43:24PM +0100, Michael J Gruber wrote: > >> It seems that our fast-import is super picky with regards to author >> names. I've encountered author names like >> >> Foo Bar >> Foo Bar > foo@dev.null >> >> in the self-hosting repo of some other dvcs, and the question is how to >> translate them faithfully into a git author name. > > It is not just fast-import. Git's author field looks like an rfc822 > address, but it's much simpler. It fundamentally does not allow angle > brackets in the "name" field, regardless of any quoting. As you noted in > your followup, we strip them out if you provide them via > GIT_AUTHOR_NAME. > > I doubt this will change anytime soon due to the compatibility fallout. > So it is up to generators of fast-import streams to decide how to encode > what they get from another system (you could come up with an encoding > scheme that represents angle brackets). I don't expect our requirements to change. For one thing, I was surprised that git-commit is more tolerant than git-fast-import, but it makes a lot of sense to avoid any behind-the-back conversions in the importer. >> In general, we try to do >> >> fullotherdvcsname >> >> if the other system's entry does not parse as a git author name, but >> fast-import does not accept either of >> >> Foo Bar >> "Foo Bar" >> >> because of the way it parses for <>. While the above could be easily >> turned into >> >> Foo Bar >> >> it would not be a faithful representation of the original commit in the >> other dvcs. > > I'd think that if a remote system has names with angle brackets and > email-looking things inside them, we would do better to stick them in > the email field rather than putting in a useless . The latter > should only be used for systems that lack the information. > > But that is a quality-of-implementation issue for the import scripts > (and they may even want to have options, just like git-cvsimport allows > mapping cvs usernames into full identities). That was more my real concern. In our cvs and svn interfaces, we even encourage the use of author maps. For example, if you use an author map, git-svn errors out if it encounters an svn user name which is not in the map. On the other hand, we can map all (most?) svn user names faithfully without using a map (e.g. to "username "). Hg seems to store just anything in the author field ("committer"). The various interfaces that are floating around do some behind-the-back conversion to git format. The more conversions they do, the better they seem to work (no erroring out) but I'm wondering whether it's really a good thing, or whether we should encourage a more diligent approach which requires a user to map non-conforming author names wilfully. Michael -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
On Fri, Nov 02, 2012 at 03:43:24PM +0100, Michael J Gruber wrote: > It seems that our fast-import is super picky with regards to author > names. I've encountered author names like > > Foo Bar > Foo Bar foo@dev.null > > in the self-hosting repo of some other dvcs, and the question is how to > translate them faithfully into a git author name. It is not just fast-import. Git's author field looks like an rfc822 address, but it's much simpler. It fundamentally does not allow angle brackets in the "name" field, regardless of any quoting. As you noted in your followup, we strip them out if you provide them via GIT_AUTHOR_NAME. I doubt this will change anytime soon due to the compatibility fallout. So it is up to generators of fast-import streams to decide how to encode what they get from another system (you could come up with an encoding scheme that represents angle brackets). > In general, we try to do > > fullotherdvcsname > > if the other system's entry does not parse as a git author name, but > fast-import does not accept either of > > Foo Bar > "Foo Bar" > > because of the way it parses for <>. While the above could be easily > turned into > > Foo Bar > > it would not be a faithful representation of the original commit in the > other dvcs. I'd think that if a remote system has names with angle brackets and email-looking things inside them, we would do better to stick them in the email field rather than putting in a useless . The latter should only be used for systems that lack the information. But that is a quality-of-implementation issue for the import scripts (and they may even want to have options, just like git-cvsimport allows mapping cvs usernames into full identities). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)
Some additional input: [mjg@localhost git]$ git commit --author='"is this" ' --allow-empty -m test [detached HEAD 0734308] test Author: is thi...@or.not [mjg@localhost git]$ git show commit 0734308b7bf372227bf9f5b9fd6b4b403df33b9e Author: is thi...@or.not Date: Fri Nov 2 15:45:23 2012 +0100 test -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFD: fast-import is picky with author names (and maybe it should - but how much so?)
It seems that our fast-import is super picky with regards to author names. I've encountered author names like Foo Bar Foo Bar if the other system's entry does not parse as a git author name, but fast-import does not accept either of Foo Bar "Foo Bar" because of the way it parses for <>. While the above could be easily turned into Foo Bar it would not be a faithful representation of the original commit in the other dvcs. So the question is: - How should we represent botched author entries faithfully? As a cororollary, fast-import may need to change or not. Michael P.S.: Yes, dvcs=hg, and the "earlier" remote-hg helper chokes on these. garbage in crash out :( -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html