Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-13 Thread Felipe Contreras
On Tue, Nov 13, 2012 at 11:15 AM, Michael J Gruber
 wrote:
> Felipe Contreras venit, vidit, dixit 12.11.2012 23:47:
>> On Mon, Nov 12, 2012 at 10:41 PM, Jeff King  wrote:
>>> On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote:
>>>
>   3. Exporters should not use it if they have any broken-down
>  representation at all. Even knowing that the first half is a human
>  name and the second half is something else would give it a better
>  shot at cleaning than fast-import would get.

 I'm not sure what you mean by this. If they have name and email, then
 sure, it's easy.
>>>
>>> But not as easy as just printing it. What if you have this:
>>>
>>>   name="Peff  King"
>>>   email=""
>>>
>>> Concatenating them does not produce a valid git author name. Sending the
>>> concatenation through fast-import's cleanup function would lose
>>> information (namely, the location of the boundary between name and
>>> email).
>>
>> Right. Unfortunately I'm not aware of any DSCM that does that.
>>
>>> Similarly, one might have other structured data (e.g., CVS username)
>>> where the structure is a useful hint, but some conversion to name+email
>>> is still necessary.
>>
>> CVS might be the only one that has such structured data. I think in
>> subversion the username has no meaning. A 'felipec' subversion
>> username is as bad as a mercurial 'felipec' username.
>
> In subversion, the username has the clearly defined meaning of being a
> username on the subversion host. If the host is, e.g., a sourceforge
> site then I can easily look up the user profile and convert the username
> into a valid e-mail address (@users.sf.net). That is the
> advantage that the exporter (together with user knowledge) has over the
> importer.
>
> If the initial clone process aborts after every single "unknown" user
> it's no fun, of course. On the other hand, if an incremental clone
> (fetch) let's commits with unknown author sneak in it's no fun either
> (because I may want to fetch in crontab and publish that converted beast
> automatically). That is why I proposed neither approach.
>
> Most conveniently, the export side of a remote helper would
>
> - do "obvious" automatic lossless transformations
> - use an author map for other names

This should be done by fast-import. It doesn't make any sense that
every remote helper and fast-exporter out there have their own way of
mapping authors (or none).

> - For names not covered by the above (or having an empty map entry):
> Stop exporting commits but continue parsing commits and amend the author
> map with any unknown usernames (empty entry), and warn the user.
> (crontab script can notify me based on the return code.)

Stop exporting commits but continue parsing commits? I don't know what
that means.

fast-import should try it's best to clean it up, warn the user, sure,
but also store the missing entry on a file, so that it can be filed
later (if the user so wishes).

> If the cloning involves a "foreign clone" (like the hg clone behind the
> scene) then the runtime of the second pass should be much smaller. In
> principle, one could even store all blobs and trees on the first run and
> skip that step on the second, but that would rely on immutability on the
> foreign side, so I dunno. (And to check the sha1, we have to get the
> blob anyways.)

No. There's no concept of partial clones... Either you clone, or you don't.

Wait if the remote helper didn't notice that the author was bad?
fast-import could just just leave everything up to that point, warn
abut what happened, and exit, but the exporter side would die in the
middle of exporting, and it might end up in a bad state, not saving
marks, or who knows what.

It wouldn't work.

The cloning should be full, and the bad authors stored in a scaffold author map.

> As for the format for incomplete entries (foo ), a technical
> guideline should suffice for those that follow guidelines.

fast-import should do that.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-13 Thread Michael J Gruber
Felipe Contreras venit, vidit, dixit 12.11.2012 23:47:
> On Mon, Nov 12, 2012 at 10:41 PM, Jeff King  wrote:
>> On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote:
>>
   3. Exporters should not use it if they have any broken-down
  representation at all. Even knowing that the first half is a human
  name and the second half is something else would give it a better
  shot at cleaning than fast-import would get.
>>>
>>> I'm not sure what you mean by this. If they have name and email, then
>>> sure, it's easy.
>>
>> But not as easy as just printing it. What if you have this:
>>
>>   name="Peff  King"
>>   email=""
>>
>> Concatenating them does not produce a valid git author name. Sending the
>> concatenation through fast-import's cleanup function would lose
>> information (namely, the location of the boundary between name and
>> email).
> 
> Right. Unfortunately I'm not aware of any DSCM that does that.
> 
>> Similarly, one might have other structured data (e.g., CVS username)
>> where the structure is a useful hint, but some conversion to name+email
>> is still necessary.
> 
> CVS might be the only one that has such structured data. I think in
> subversion the username has no meaning. A 'felipec' subversion
> username is as bad as a mercurial 'felipec' username.

In subversion, the username has the clearly defined meaning of being a
username on the subversion host. If the host is, e.g., a sourceforge
site then I can easily look up the user profile and convert the username
into a valid e-mail address (@users.sf.net). That is the
advantage that the exporter (together with user knowledge) has over the
importer.

If the initial clone process aborts after every single "unknown" user
it's no fun, of course. On the other hand, if an incremental clone
(fetch) let's commits with unknown author sneak in it's no fun either
(because I may want to fetch in crontab and publish that converted beast
automatically). That is why I proposed neither approach.

Most conveniently, the export side of a remote helper would

- do "obvious" automatic lossless transformations
- use an author map for other names
- For names not covered by the above (or having an empty map entry):
Stop exporting commits but continue parsing commits and amend the author
map with any unknown usernames (empty entry), and warn the user.
(crontab script can notify me based on the return code.)

If the cloning involves a "foreign clone" (like the hg clone behind the
scene) then the runtime of the second pass should be much smaller. In
principle, one could even store all blobs and trees on the first run and
skip that step on the second, but that would rely on immutability on the
foreign side, so I dunno. (And to check the sha1, we have to get the
blob anyways.)

As for the format for incomplete entries (foo ), a technical
guideline should suffice for those that follow guidelines.

Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-12 Thread Felipe Contreras
On Mon, Nov 12, 2012 at 10:41 PM, Jeff King  wrote:
> On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote:
>
>> >   3. Exporters should not use it if they have any broken-down
>> >  representation at all. Even knowing that the first half is a human
>> >  name and the second half is something else would give it a better
>> >  shot at cleaning than fast-import would get.
>>
>> I'm not sure what you mean by this. If they have name and email, then
>> sure, it's easy.
>
> But not as easy as just printing it. What if you have this:
>
>   name="Peff  King"
>   email=""
>
> Concatenating them does not produce a valid git author name. Sending the
> concatenation through fast-import's cleanup function would lose
> information (namely, the location of the boundary between name and
> email).

Right. Unfortunately I'm not aware of any DSCM that does that.

> Similarly, one might have other structured data (e.g., CVS username)
> where the structure is a useful hint, but some conversion to name+email
> is still necessary.

CVS might be the only one that has such structured data. I think in
subversion the username has no meaning. A 'felipec' subversion
username is as bad as a mercurial 'felipec' username.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-12 Thread Jeff King
On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote:

> >   3. Exporters should not use it if they have any broken-down
> >  representation at all. Even knowing that the first half is a human
> >  name and the second half is something else would give it a better
> >  shot at cleaning than fast-import would get.
> 
> I'm not sure what you mean by this. If they have name and email, then
> sure, it's easy.

But not as easy as just printing it. What if you have this:

  name="Peff  King"
  email=""

Concatenating them does not produce a valid git author name. Sending the
concatenation through fast-import's cleanup function would lose
information (namely, the location of the boundary between name and
email).

Similarly, one might have other structured data (e.g., CVS username)
where the structure is a useful hint, but some conversion to name+email
is still necessary.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-12 Thread Felipe Contreras
On Mon, Nov 12, 2012 at 6:45 PM, Junio C Hamano  wrote:
> A Large Angry SCM  writes:
>
>> On 11/11/2012 07:41 AM, Felipe Contreras wrote:
>>> On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM  
>>> wrote:
 On 11/10/2012 01:43 PM, Felipe Contreras wrote:
>>>
> So, the options are:
>
> a) Leave the name conversion to the export tools, and when they miss
> some weird corner case, like 'Author consequences, perhaps after an hour of the process.
>
> We know there are sources of data that don't have git-formatted author
> names, so we know every tool out there must do this checking.
>
> In addition to that, let the export tool decide what to do when one of
> these bad names appear, which in many cases probably means do nothing,
> so the user would not even see that such a bad name was there, which
> might not be what they want.
>
> b) Do the name conversion in fast-import itself, perhaps optionally,
> so if a tool missed some weird corner case, the user does not have to
> face the consequences.
>
> The tool writers don't have to worry about this, so we would not have
> tools out there doing a half-assed job of this.
>
> And what happens when such bad names end up being consistent: warning,
> a scaffold mapping of bad names, etc.
>
>
> One is bad for the users, and the tools writers, only disadvantages,
> the other is good for the users and the tools writers, only
> advantages.
>

 c) Do the name conversion, and whatever other cleanup and manipulations
 you're interesting in, in a filter between the exporter and 
 git-fast-import.
>>>
>>> Such a filter would probably be quite complicated, and would decrease
>>> performance.
>>>
>>
>> Really?
>>
>> The fast import stream protocol is pretty simple. All the filter
>> really needs to do is pass through everything that isn't a 'commit'
>> command. And for the 'commit' command, it only needs to do something
>> with the 'author' and 'committer' lines; passing through everything
>> else.
>>
>> I agree that an additional filter _may_ decrease performance somewhat
>> if you are already CPU constrained. But I suspect that the effect
>> would be negligible compared to the all of the SHA-1 calculations.
>
> More importantly, which do users prefer: quickly produce an
> incorrect result, or spend some more time to get it right?

Why not both?

If I do 'git clone hg::http://selenic.com/hg' I expect it to work, no
matter what. Then, if I care about getting it right, like for example
if the project is moving to git, then check
.git/hg/origin/bad-authors, and fill them with the right ones.

Of course, the current remote helper framework doesn't have the option
to map authors, but it could be added. That would be better than
letting every remote helper tool to have a custom way of mapping
authors, and also custom configuration for them.

> Because the exporting tool has a lot more intimate knowledge about
> how the names are represented in the history of the original SCM,
> canonicalization of the names, if done at that point, would likely
> to give us more useful results, than a canonicalization done at the
> beginning of the importer, which lacks SCM specific details.  So in
> that sense, (a) is more preferrable than (b).

But it doesn't have more intimate knowledge. It has exactly the same
information as fast-import; nothing.

What intimate knowledge is a tool expected to get from this?

% hg commit -u 'Foo Bar ' -m one
% hg --debug log
changeset:   0:5ef37a2c773f02d0e01f1ecdcc59149832d294e8
tag: tip
phase:   draft
parent:  -1:
parent:  -1:
manifest:0:c6d4cd25b9fc2f83b0dd51f4acbea9486fce54d7
user:Foo Bar 
date:Sun Nov 11 18:33:00 2012 +0100
files+:  file
extra:   branch=default
description:
one

Some tools might, but if they did, then bad authors wouldn't be a problem.

> On the other hand, we would want consistency across the converted
> results no matter what SCM the history was originally in.  E.g. a
> name without email that came from CVS or SVN would consistently want
> to become "name " or "name " or whatever, and
> letting exporting tools responsible for the canonicalization will
> lead them to create their own garbage.  In that sense, (b) can be
> better than (a).

Or 'Unknown ' or '' or '<>', or any of the forms
conversion tools have been doing for ages.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-12 Thread Junio C Hamano
A Large Angry SCM  writes:

> On 11/11/2012 07:41 AM, Felipe Contreras wrote:
>> On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM  
>> wrote:
>>> On 11/10/2012 01:43 PM, Felipe Contreras wrote:
>>
 So, the options are:

 a) Leave the name conversion to the export tools, and when they miss
 some weird corner case, like 'Author>>> consequences, perhaps after an hour of the process.

 We know there are sources of data that don't have git-formatted author
 names, so we know every tool out there must do this checking.

 In addition to that, let the export tool decide what to do when one of
 these bad names appear, which in many cases probably means do nothing,
 so the user would not even see that such a bad name was there, which
 might not be what they want.

 b) Do the name conversion in fast-import itself, perhaps optionally,
 so if a tool missed some weird corner case, the user does not have to
 face the consequences.

 The tool writers don't have to worry about this, so we would not have
 tools out there doing a half-assed job of this.

 And what happens when such bad names end up being consistent: warning,
 a scaffold mapping of bad names, etc.


 One is bad for the users, and the tools writers, only disadvantages,
 the other is good for the users and the tools writers, only
 advantages.

>>>
>>> c) Do the name conversion, and whatever other cleanup and manipulations
>>> you're interesting in, in a filter between the exporter and git-fast-import.
>>
>> Such a filter would probably be quite complicated, and would decrease
>> performance.
>>
>
> Really?
>
> The fast import stream protocol is pretty simple. All the filter
> really needs to do is pass through everything that isn't a 'commit'
> command. And for the 'commit' command, it only needs to do something
> with the 'author' and 'committer' lines; passing through everything
> else.
>
> I agree that an additional filter _may_ decrease performance somewhat
> if you are already CPU constrained. But I suspect that the effect
> would be negligible compared to the all of the SHA-1 calculations.

More importantly, which do users prefer: quickly produce an
incorrect result, or spend some more time to get it right?

Because the exporting tool has a lot more intimate knowledge about
how the names are represented in the history of the original SCM,
canonicalization of the names, if done at that point, would likely
to give us more useful results, than a canonicalization done at the
beginning of the importer, which lacks SCM specific details.  So in
that sense, (a) is more preferrable than (b).

On the other hand, we would want consistency across the converted
results no matter what SCM the history was originally in.  E.g. a
name without email that came from CVS or SVN would consistently want
to become "name " or "name " or whatever, and
letting exporting tools responsible for the canonicalization will
lead them to create their own garbage.  In that sense, (b) can be
better than (a).

I think (c) implements worst of both choices. It cannot exploit
knowledge specific to the original SCM like (a) would, and while it
can enforce consistency the same way as (b) would, it would be a
separate program, unlike (b).

So...

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread Felipe Contreras
On Sun, Nov 11, 2012 at 7:14 PM, Jeff King  wrote:
> On Sun, Nov 11, 2012 at 06:45:32PM +0100, Felipe Contreras wrote:
>
>> > If there is a standard filter, then what is the advantage in doing it as
>> > a pipe? Why not just teach fast-import the same trick (and possibly make
>> > it optional)? That would be simpler, more efficient, and it would make
>> > it easier for remote helpers to turn it on (they use a command-line
>> > switch rather than setting up an extra process).
>>
>> Right, but instead of a command-line switch it probably should be
>> enabled on the stream:
>>
>>   feature clean-authors
>>
>> Or something.
>
> Yeah, I was thinking it would need a feature switch to the remote helper
> to turn on the command-line, but I forgot that fast-import can take
> feature lines directly.
>
>> > We can clean up and normalize
>> > things like whitespace (and we probably should if we do not do so
>> > already). But beyond that, we have no context about the name; only the
>> > exporter has that.
>>
>> There is no context.
>
> There may not be a lot, but there is some:
>
>> These are exactly the same questions every exporter must answer. And
>> there's no answer, because the field is not a git author, it's a
>> mercurial user, or a bazaar committer, or who knows what.
>
> The exporter knows that the field is a mercurial user (or whatever).
> Fast-import does not even know that, and cannot apply any rules or
> heuristics about the format of a mercurial user string, what is common
> in the mercurial world, etc. It may not be a lot of context in some
> cases (I do not know anything about mercurial's formats, so I can't say
> what knowledge is available). But at least the exporter has a chance at
> domain-specific interpretation of the string. Fast-import has no chance,
> because it does not know the domain.
>
> I've snipped the rest of your argument, which is basically that
> mercurial does not have any context at all, and knowing that it is a
> mercurial author is useless.  I am not sure that is true; even knowing
> that it is a free-form field versus something structured (e.g., we know
> CVS authors are usernames on the server server) is useful.

It is useful in the sense that we know we cannot do anything sensible
about it. All we can do is try.

> But I would agree there are probably multiple systems that are like
> mercurial in that the author field is usually something like "name
> ", but may be arbitrary text (I assume bzr is the same way, but
> you would know better than me).  So it may make sense to have some stock
> algorithm to try to convert arbitrary almost-name-and-email text into
> name and email to reduce duplication between exporters, but:

Yes, bazaar seems to be the same way.

% bzr log

revno: 1
committer: Foo Bar1. It must be turned on explicitly by the exporter, since we do not
>  want to munge more structured input from clueful exporters.

Agreed.

>   2. The exporter should only turn it on after replacing its own munging
>  (e.g., it shouldn't be adding junk like ; fast-import
>  would need to receive as pristine an input as possible).

Agreed.

>   3. Exporters should not use it if they have any broken-down
>  representation at all. Even knowing that the first half is a human
>  name and the second half is something else would give it a better
>  shot at cleaning than fast-import would get.

I'm not sure what you mean by this. If they have name and email, then
sure, it's easy.

And for the record, I've have encountered this problem also with
monotone. There's quite a lot of strategies to convert names to git
authors.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread A Large Angry SCM

On 11/11/2012 12:15 PM, Jeff King wrote:

On Sun, Nov 11, 2012 at 12:00:44PM -0500, A Large Angry SCM wrote:


a) Leave the name conversion to the export tools, and when they miss
some weird corner case, like 'Author
[...]

b) Do the name conversion in fast-import itself, perhaps optionally,
so if a tool missed some weird corner case, the user does not have to
face the consequences.

[...]

c) Do the name conversion, and whatever other cleanup and manipulations
you're interesting in, in a filter between the exporter and git-fast-import.


Such a filter would probably be quite complicated, and would decrease
performance.



Really?

The fast import stream protocol is pretty simple. All the filter
really needs to do is pass through everything that isn't a 'commit'
command. And for the 'commit' command, it only needs to do something
with the 'author' and 'committer' lines; passing through everything
else.

I agree that an additional filter _may_ decrease performance somewhat
if you are already CPU constrained. But I suspect that the effect
would be negligible compared to the all of the SHA-1 calculations.


It might be measurable, as you are passing every byte of every version
of every file in the repo through an extra pipe. But more importantly, I
don't think it helps.

If there is not a standard filter for fixing up names, we do not need to
care. The user can use "sed" or whatever and pay the performance penalty
(and deal with the possibility of errors from being lazy about parsing
the fast-import stream).

If there is a standard filter, then what is the advantage in doing it as
a pipe? Why not just teach fast-import the same trick (and possibly make
it optional)? That would be simpler, more efficient, and it would make
it easier for remote helpers to turn it on (they use a command-line
switch rather than setting up an extra process).

But what I don't understand is: what would such a standard filter look
like? Fast-import (or a filter) would already receive the exporter's
best attempt at a git-like ident string. We can clean up and normalize
things like whitespace (and we probably should if we do not do so
already). But beyond that, we have no context about the name; only the
exporter has that.

So if we receive:

   Foo Bar  

or:

   Foo Bar

or:

   Foo Bar

I don't think that there is or can be a standard filter. Cleaning up 
after a broken exporter is likely to always be a repository unique 
situation. The example here is about names and email addresses but it 
could easily be about other things (dates, history, content, etc.). Some 
of which that could possible be fixed using git-filter-branch; some 
possibly not.


Fixing the exporter is always the most desirable option, but it may not 
be the best option for the particular situation. Locally modifying 
git-fast-import is another option; again, possibly not the best option. 
Convincing the git maintainers to handle your specific situation, though 
a good option for you, is not likely to be scalable. A filter in front 
of git-fast-import is always _an_ option and can tailored to the 
particular situation.


My preference is to follow the "Unix philosophy": the tools are focused 
on what they need to do and can be composed with other tools/scripts to 
accomplish the desired result.


d) Another (bad) option is to make git-fast-import very permissive and 
warn the user to fix things via git-filter-branch before distributing 
the repository or git's standard repository checks find the problems.


This isn't my itch so I think I may have exhausted my $0.02 on this subject.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread Jeff King
On Sun, Nov 11, 2012 at 06:45:32PM +0100, Felipe Contreras wrote:

> > If there is a standard filter, then what is the advantage in doing it as
> > a pipe? Why not just teach fast-import the same trick (and possibly make
> > it optional)? That would be simpler, more efficient, and it would make
> > it easier for remote helpers to turn it on (they use a command-line
> > switch rather than setting up an extra process).
> 
> Right, but instead of a command-line switch it probably should be
> enabled on the stream:
> 
>   feature clean-authors
> 
> Or something.

Yeah, I was thinking it would need a feature switch to the remote helper
to turn on the command-line, but I forgot that fast-import can take
feature lines directly.

> > We can clean up and normalize
> > things like whitespace (and we probably should if we do not do so
> > already). But beyond that, we have no context about the name; only the
> > exporter has that.
> 
> There is no context.

There may not be a lot, but there is some:

> These are exactly the same questions every exporter must answer. And
> there's no answer, because the field is not a git author, it's a
> mercurial user, or a bazaar committer, or who knows what.

The exporter knows that the field is a mercurial user (or whatever).
Fast-import does not even know that, and cannot apply any rules or
heuristics about the format of a mercurial user string, what is common
in the mercurial world, etc. It may not be a lot of context in some
cases (I do not know anything about mercurial's formats, so I can't say
what knowledge is available). But at least the exporter has a chance at
domain-specific interpretation of the string. Fast-import has no chance,
because it does not know the domain.

I've snipped the rest of your argument, which is basically that
mercurial does not have any context at all, and knowing that it is a
mercurial author is useless.  I am not sure that is true; even knowing
that it is a free-form field versus something structured (e.g., we know
CVS authors are usernames on the server server) is useful.

But I would agree there are probably multiple systems that are like
mercurial in that the author field is usually something like "name
", but may be arbitrary text (I assume bzr is the same way, but
you would know better than me).  So it may make sense to have some stock
algorithm to try to convert arbitrary almost-name-and-email text into
name and email to reduce duplication between exporters, but:

  1. It must be turned on explicitly by the exporter, since we do not
 want to munge more structured input from clueful exporters.

  2. The exporter should only turn it on after replacing its own munging
 (e.g., it shouldn't be adding junk like ; fast-import
 would need to receive as pristine an input as possible).

  3. Exporters should not use it if they have any broken-down
 representation at all. Even knowing that the first half is a human
 name and the second half is something else would give it a better
 shot at cleaning than fast-import would get.

 Alternatively, the feature could enable the exporter to pass a more
 structured ident to git.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread Felipe Contreras
On Sun, Nov 11, 2012 at 6:39 PM, A Large Angry SCM  wrote:
> On 11/11/2012 12:16 PM, Felipe Contreras wrote:

>> And how do you propose to find the commit commands without parsing all
>> the other commands? If you randomly look for lines that begin with
>> 'commit /refs' you might end up in the middle of a commit message or
>> the contents of a file.
>
> I didn't say you didn't have to parse the protocol. I said that the protocol
> is pretty simple.

Parsing is never simple.

>>> I agree that an additional filter _may_ decrease performance somewhat if
>>> you
>>> are already CPU constrained. But I suspect that the effect would be
>>> negligible compared to the all of the SHA-1 calculations.
>>
>> Well. If it's so easy surely you can write one quickly, and I can measure
>> it.
>
> Not my itch; You care, you do it.

It was your idea, I don't care.

If it's so simple, why don't you do it? Because it's not that simple.
And anyway it will have a performance penalty.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread Felipe Contreras
On Sun, Nov 11, 2012 at 6:15 PM, Jeff King  wrote:
> On Sun, Nov 11, 2012 at 12:00:44PM -0500, A Large Angry SCM wrote:

> If there is a standard filter, then what is the advantage in doing it as
> a pipe? Why not just teach fast-import the same trick (and possibly make
> it optional)? That would be simpler, more efficient, and it would make
> it easier for remote helpers to turn it on (they use a command-line
> switch rather than setting up an extra process).

Right, but instead of a command-line switch it probably should be
enabled on the stream:

  feature clean-authors

Or something.

> But what I don't understand is: what would such a standard filter look
> like? Fast-import (or a filter) would already receive the exporter's
> best attempt at a git-like ident string.

Currently, yeah, because there's no other option. It's either try to
clean it up, or fail.

But if 'git fast-import' as a superior alternative, I certainly would
remove my custom code and enable that feature.

> We can clean up and normalize
> things like whitespace (and we probably should if we do not do so
> already). But beyond that, we have no context about the name; only the
> exporter has that.

There is no context.

> So if we receive:
>
>   Foo Bar 
>
> or:
>
>   Foo Bar
>
> or:
>
>   Foo Bar
> what do we do with it? Is the first part a malformed name/email pair,
> and the second part is crap added by a lazy exporter? Or does the
> exporter want to keep the angle brackets as part of the name field? Is
> there a malformed email in the last one, or no email at all?

These are exactly the same questions every exporter must answer. And
there's no answer, because the field is not a git author, it's a
mercurial user, or a bazaar committer, or who knows what.

>From whatever source, these all might be valid authors:
john
john  (grease)

t...@test.com
test
test 
test >t...@est.com>
test  test  com>
<>
>
<
The first chapter of the LOTR

There is no context.

> The exporter is the only program that actually knows where the data came
> from,

It doesn't matter where it came from, it's not a name/email pair.

> how it should be broken down,

It cannot be broken down, it's free-form text. Any text.

> and what is appropriate for pulling
> data out of its particular source system.

This free-form text is the lowest granularity. There is nothing else.

> For that reason, the exporter
> has to be the place where we come up with a syntactically correct and
> unambiguous ident.

*If* the exporter is able to do this, sure, but many don't have any
more information.

See:

% hg commit -u 'Foo Bar ' -m one
% hg --debug log
changeset:   0:5ef37a2c773f02d0e01f1ecdcc59149832d294e8
tag: tip
phase:   draft
parent:  -1:
parent:  -1:
manifest:0:c6d4cd25b9fc2f83b0dd51f4acbea9486fce54d7
user:Foo Bar 
date:Sun Nov 11 18:33:00 2012 +0100
files+:  file
extra:   branch=default
description:
one

What is a hg exporter tool supposed to do with that?

What such a tool can do, 'git fast-import' can do.

> I am not opposed to adding a mailmap-like feature to fast-import to map
> identities, but it has to start with sane, unambiguous output from the
> exporter.

And if that's not possible?

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread A Large Angry SCM

On 11/11/2012 12:16 PM, Felipe Contreras wrote:

On Sun, Nov 11, 2012 at 6:00 PM, A Large Angry SCM  wrote:

On 11/11/2012 07:41 AM, Felipe Contreras wrote:



Such a filter would probably be quite complicated, and would decrease
performance.


Really?

The fast import stream protocol is pretty simple. All the filter really
needs to do is pass through everything that isn't a 'commit' command. And
for the 'commit' command, it only needs to do something with the 'author'
and 'committer' lines; passing through everything else.


And how do you propose to find the commit commands without parsing all
the other commands? If you randomly look for lines that begin with
'commit /refs' you might end up in the middle of a commit message or
the contents of a file.


I didn't say you didn't have to parse the protocol. I said that the 
protocol is pretty simple.





I agree that an additional filter _may_ decrease performance somewhat if you
are already CPU constrained. But I suspect that the effect would be
negligible compared to the all of the SHA-1 calculations.


Well. If it's so easy surely you can write one quickly, and I can measure it.


Not my itch; You care, you do it.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread Felipe Contreras
On Sun, Nov 11, 2012 at 6:00 PM, A Large Angry SCM  wrote:
> On 11/11/2012 07:41 AM, Felipe Contreras wrote:

>> Such a filter would probably be quite complicated, and would decrease
>> performance.
>
> Really?
>
> The fast import stream protocol is pretty simple. All the filter really
> needs to do is pass through everything that isn't a 'commit' command. And
> for the 'commit' command, it only needs to do something with the 'author'
> and 'committer' lines; passing through everything else.

And how do you propose to find the commit commands without parsing all
the other commands? If you randomly look for lines that begin with
'commit /refs' you might end up in the middle of a commit message or
the contents of a file.

> I agree that an additional filter _may_ decrease performance somewhat if you
> are already CPU constrained. But I suspect that the effect would be
> negligible compared to the all of the SHA-1 calculations.

Well. If it's so easy surely you can write one quickly, and I can measure it.

Cheers.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread Jeff King
On Sun, Nov 11, 2012 at 12:00:44PM -0500, A Large Angry SCM wrote:

> >>>a) Leave the name conversion to the export tools, and when they miss
> >>>some weird corner case, like 'Author >>>consequences, perhaps after an hour of the process.
> [...]
> >>>b) Do the name conversion in fast-import itself, perhaps optionally,
> >>>so if a tool missed some weird corner case, the user does not have to
> >>>face the consequences.
> [...]
> >>c) Do the name conversion, and whatever other cleanup and manipulations
> >>you're interesting in, in a filter between the exporter and git-fast-import.
> >
> >Such a filter would probably be quite complicated, and would decrease
> >performance.
> >
> 
> Really?
> 
> The fast import stream protocol is pretty simple. All the filter
> really needs to do is pass through everything that isn't a 'commit'
> command. And for the 'commit' command, it only needs to do something
> with the 'author' and 'committer' lines; passing through everything
> else.
> 
> I agree that an additional filter _may_ decrease performance somewhat
> if you are already CPU constrained. But I suspect that the effect
> would be negligible compared to the all of the SHA-1 calculations.

It might be measurable, as you are passing every byte of every version
of every file in the repo through an extra pipe. But more importantly, I
don't think it helps.

If there is not a standard filter for fixing up names, we do not need to
care. The user can use "sed" or whatever and pay the performance penalty
(and deal with the possibility of errors from being lazy about parsing
the fast-import stream).

If there is a standard filter, then what is the advantage in doing it as
a pipe? Why not just teach fast-import the same trick (and possibly make
it optional)? That would be simpler, more efficient, and it would make
it easier for remote helpers to turn it on (they use a command-line
switch rather than setting up an extra process).

But what I don't understand is: what would such a standard filter look
like? Fast-import (or a filter) would already receive the exporter's
best attempt at a git-like ident string. We can clean up and normalize
things like whitespace (and we probably should if we do not do so
already). But beyond that, we have no context about the name; only the
exporter has that.

So if we receive:

  Foo Bar 

or:

  Foo Bar

or:

  Foo Barhttp://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread A Large Angry SCM

On 11/11/2012 07:41 AM, Felipe Contreras wrote:

On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM  wrote:

On 11/10/2012 01:43 PM, Felipe Contreras wrote:



So, the options are:

a) Leave the name conversion to the export tools, and when they miss
some weird corner case, like 'Author

c) Do the name conversion, and whatever other cleanup and manipulations
you're interesting in, in a filter between the exporter and git-fast-import.


Such a filter would probably be quite complicated, and would decrease
performance.



Really?

The fast import stream protocol is pretty simple. All the filter really 
needs to do is pass through everything that isn't a 'commit' command. 
And for the 'commit' command, it only needs to do something with the 
'author' and 'committer' lines; passing through everything else.


I agree that an additional filter _may_ decrease performance somewhat if 
you are already CPU constrained. But I suspect that the effect would be 
negligible compared to the all of the SHA-1 calculations.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-11 Thread Felipe Contreras
On Sat, Nov 10, 2012 at 8:25 PM, A Large Angry SCM  wrote:
> On 11/10/2012 01:43 PM, Felipe Contreras wrote:

>> So, the options are:
>>
>> a) Leave the name conversion to the export tools, and when they miss
>> some weird corner case, like 'Author> consequences, perhaps after an hour of the process.
>>
>> We know there are sources of data that don't have git-formatted author
>> names, so we know every tool out there must do this checking.
>>
>> In addition to that, let the export tool decide what to do when one of
>> these bad names appear, which in many cases probably means do nothing,
>> so the user would not even see that such a bad name was there, which
>> might not be what they want.
>>
>> b) Do the name conversion in fast-import itself, perhaps optionally,
>> so if a tool missed some weird corner case, the user does not have to
>> face the consequences.
>>
>> The tool writers don't have to worry about this, so we would not have
>> tools out there doing a half-assed job of this.
>>
>> And what happens when such bad names end up being consistent: warning,
>> a scaffold mapping of bad names, etc.
>>
>>
>> One is bad for the users, and the tools writers, only disadvantages,
>> the other is good for the users and the tools writers, only
>> advantages.
>>
>
> c) Do the name conversion, and whatever other cleanup and manipulations
> you're interesting in, in a filter between the exporter and git-fast-import.

Such a filter would probably be quite complicated, and would decrease
performance.

-- 
Felipe Contreras
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-10 Thread A Large Angry SCM

On 11/10/2012 01:43 PM, Felipe Contreras wrote:

On Sat, Nov 10, 2012 at 6:28 PM, Michael J Gruber
  wrote:

Felipe Contreras venit, vidit, dixit 09.11.2012 15:34:

On Fri, Nov 9, 2012 at 10:28 AM, Michael J Gruber
  wrote:


Hg seems to store just anything in the author field ("committer"). The
various interfaces that are floating around do some behind-the-back
conversion to git format. The more conversions they do, the better they
seem to work (no erroring out) but I'm wondering whether it's really a
good thing, or whether we should encourage a more diligent approach
which requires a user to map non-conforming author names wilfully.


So you propose that when somebody does 'git clone hg::hg hg-git' the
thing should fail. I hope you don't think it's too unbecoming for me
to say that I disagree.


There is no need to disagree with a proposal I haven't made. I would
disagree with the proposal that I haven't made, too.


All right, we shouldn't encourage a more diligent approach which
requires a user to map author names then.


IMO it should be git fast-import the one that converts these bad
authors, not every single tool out there. Maybe throw a warning, but
that's all. Or maybe generate a list of bad authors ready to be filled
out. That way when a project is doing a real conversion, say, when
moving to git, they can run the conversion once and see which authors
are bad and not multiple times, each try taking longer than the next.


As Jeff pointed out, git-fast-import expects output conforming to a
certain standard, and that's not going to change. import is agnostic to
where its import stream is coming from. Only the producer of that stream
can have additional information about the provenience of the stream's
data which may aid (possibly together with user input or choices) in
transforming that into something conforming.


We already know where the import of those streams come from:
mercurial, bazaar, etc. There's absolutely nothing the tools exporting
data from those repositories can do, except try to convert all kind of
weird names--and many tools do it poorly.

So, the options are:

a) Leave the name conversion to the export tools, and when they miss
some weird corner case, like 'Author

c) Do the name conversion, and whatever other cleanup and manipulations 
you're interesting in, in a filter between the exporter and git-fast-import.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-10 Thread Felipe Contreras
On Sat, Nov 10, 2012 at 6:28 PM, Michael J Gruber
 wrote:
> Felipe Contreras venit, vidit, dixit 09.11.2012 15:34:
>> On Fri, Nov 9, 2012 at 10:28 AM, Michael J Gruber
>>  wrote:
>>
>>> Hg seems to store just anything in the author field ("committer"). The
>>> various interfaces that are floating around do some behind-the-back
>>> conversion to git format. The more conversions they do, the better they
>>> seem to work (no erroring out) but I'm wondering whether it's really a
>>> good thing, or whether we should encourage a more diligent approach
>>> which requires a user to map non-conforming author names wilfully.
>>
>> So you propose that when somebody does 'git clone hg::hg hg-git' the
>> thing should fail. I hope you don't think it's too unbecoming for me
>> to say that I disagree.
>
> There is no need to disagree with a proposal I haven't made. I would
> disagree with the proposal that I haven't made, too.

All right, we shouldn't encourage a more diligent approach which
requires a user to map author names then.

>> IMO it should be git fast-import the one that converts these bad
>> authors, not every single tool out there. Maybe throw a warning, but
>> that's all. Or maybe generate a list of bad authors ready to be filled
>> out. That way when a project is doing a real conversion, say, when
>> moving to git, they can run the conversion once and see which authors
>> are bad and not multiple times, each try taking longer than the next.
>
> As Jeff pointed out, git-fast-import expects output conforming to a
> certain standard, and that's not going to change. import is agnostic to
> where its import stream is coming from. Only the producer of that stream
> can have additional information about the provenience of the stream's
> data which may aid (possibly together with user input or choices) in
> transforming that into something conforming.

We already know where the import of those streams come from:
mercurial, bazaar, etc. There's absolutely nothing the tools exporting
data from those repositories can do, except try to convert all kind of
weird names--and many tools do it poorly.

So, the options are:

a) Leave the name conversion to the export tools, and when they miss
some weird corner case, like 'Author http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-10 Thread Michael J Gruber
Felipe Contreras venit, vidit, dixit 09.11.2012 15:34:
> On Fri, Nov 9, 2012 at 10:28 AM, Michael J Gruber
>  wrote:
> 
>> Hg seems to store just anything in the author field ("committer"). The
>> various interfaces that are floating around do some behind-the-back
>> conversion to git format. The more conversions they do, the better they
>> seem to work (no erroring out) but I'm wondering whether it's really a
>> good thing, or whether we should encourage a more diligent approach
>> which requires a user to map non-conforming author names wilfully.
> 
> So you propose that when somebody does 'git clone hg::hg hg-git' the
> thing should fail. I hope you don't think it's too unbecoming for me
> to say that I disagree.

There is no need to disagree with a proposal I haven't made. I would
disagree with the proposal that I haven't made, too.

> IMO it should be git fast-import the one that converts these bad
> authors, not every single tool out there. Maybe throw a warning, but
> that's all. Or maybe generate a list of bad authors ready to be filled
> out. That way when a project is doing a real conversion, say, when
> moving to git, they can run the conversion once and see which authors
> are bad and not multiple times, each try taking longer than the next.

As Jeff pointed out, git-fast-import expects output conforming to a
certain standard, and that's not going to change. import is agnostic to
where its import stream is coming from. Only the producer of that stream
can have additional information about the provenience of the stream's
data which may aid (possibly together with user input or choices) in
transforming that into something conforming.

Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-09 Thread Michael J Gruber
Jeff King venit, vidit, dixit 08.11.2012 21:09:
> On Fri, Nov 02, 2012 at 03:43:24PM +0100, Michael J Gruber wrote:
> 
>> It seems that our fast-import is super picky with regards to author
>> names. I've encountered author names like
>>
>> Foo Bar
>> Foo Bar > foo@dev.null
>>
>> in the self-hosting repo of some other dvcs, and the question is how to
>> translate them faithfully into a git author name.
> 
> It is not just fast-import. Git's author field looks like an rfc822
> address, but it's much simpler. It fundamentally does not allow angle
> brackets in the "name" field, regardless of any quoting. As you noted in
> your followup, we strip them out if you provide them via
> GIT_AUTHOR_NAME.
> 
> I doubt this will change anytime soon due to the compatibility fallout.
> So it is up to generators of fast-import streams to decide how to encode
> what they get from another system (you could come up with an encoding
> scheme that represents angle brackets).

I don't expect our requirements to change. For one thing, I was
surprised that git-commit is more tolerant than git-fast-import, but it
makes a lot of sense to avoid any behind-the-back conversions in the
importer.

>> In general, we try to do
>>
>> fullotherdvcsname 
>>
>> if the other system's entry does not parse as a git author name, but
>> fast-import does not accept either of
>>
>> Foo Bar 
>> "Foo Bar" 
>>
>> because of the way it parses for <>. While the above could be easily
>> turned into
>>
>> Foo Bar 
>>
>> it would not be a faithful representation of the original commit in the
>> other dvcs.
> 
> I'd think that if a remote system has names with angle brackets and
> email-looking things inside them, we would do better to stick them in
> the email field rather than putting in a useless . The latter
> should only be used for systems that lack the information.
> 
> But that is a quality-of-implementation issue for the import scripts
> (and they may even want to have options, just like git-cvsimport allows
> mapping cvs usernames into full identities).

That was more my real concern. In our cvs and svn interfaces, we even
encourage the use of author maps. For example, if you use an author map,
git-svn errors out if it encounters an svn user name which is not in the
map. On the other hand, we can map all (most?) svn user names faithfully
without using a map (e.g. to "username ").

Hg seems to store just anything in the author field ("committer"). The
various interfaces that are floating around do some behind-the-back
conversion to git format. The more conversions they do, the better they
seem to work (no erroring out) but I'm wondering whether it's really a
good thing, or whether we should encourage a more diligent approach
which requires a user to map non-conforming author names wilfully.

Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-08 Thread Jeff King
On Fri, Nov 02, 2012 at 03:43:24PM +0100, Michael J Gruber wrote:

> It seems that our fast-import is super picky with regards to author
> names. I've encountered author names like
> 
> Foo Bar
> Foo Bar  foo@dev.null
> 
> in the self-hosting repo of some other dvcs, and the question is how to
> translate them faithfully into a git author name.

It is not just fast-import. Git's author field looks like an rfc822
address, but it's much simpler. It fundamentally does not allow angle
brackets in the "name" field, regardless of any quoting. As you noted in
your followup, we strip them out if you provide them via
GIT_AUTHOR_NAME.

I doubt this will change anytime soon due to the compatibility fallout.
So it is up to generators of fast-import streams to decide how to encode
what they get from another system (you could come up with an encoding
scheme that represents angle brackets).

> In general, we try to do
> 
> fullotherdvcsname 
> 
> if the other system's entry does not parse as a git author name, but
> fast-import does not accept either of
> 
> Foo Bar 
> "Foo Bar" 
> 
> because of the way it parses for <>. While the above could be easily
> turned into
> 
> Foo Bar 
> 
> it would not be a faithful representation of the original commit in the
> other dvcs.

I'd think that if a remote system has names with angle brackets and
email-looking things inside them, we would do better to stick them in
the email field rather than putting in a useless . The latter
should only be used for systems that lack the information.

But that is a quality-of-implementation issue for the import scripts
(and they may even want to have options, just like git-cvsimport allows
mapping cvs usernames into full identities).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-02 Thread Michael J Gruber
Some additional input:

[mjg@localhost git]$ git commit --author='"is this"
' --allow-empty -m test
[detached HEAD 0734308] test
 Author: is thi...@or.not 
[mjg@localhost git]$ git show
commit 0734308b7bf372227bf9f5b9fd6b4b403df33b9e
Author: is thi...@or.not 
Date:   Fri Nov 2 15:45:23 2012 +0100

test

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFD: fast-import is picky with author names (and maybe it should - but how much so?)

2012-11-02 Thread Michael J Gruber
It seems that our fast-import is super picky with regards to author
names. I've encountered author names like

Foo Bar
Foo Bar 

if the other system's entry does not parse as a git author name, but
fast-import does not accept either of

Foo Bar 
"Foo Bar" 

because of the way it parses for <>. While the above could be easily
turned into

Foo Bar 

it would not be a faithful representation of the original commit in the
other dvcs.

So the question is:

- How should we represent botched author entries faithfully?

As a cororollary, fast-import may need to change or not.

Michael

P.S.: Yes, dvcs=hg, and the "earlier" remote-hg helper chokes on these.
garbage in crash out :(
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html