Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

Michael J Gruber Tue, 13 Nov 2012 02:15:32 -0800

Felipe Contreras venit, vidit, dixit 12.11.2012 23:47:
> On Mon, Nov 12, 2012 at 10:41 PM, Jeff King <[email protected]> wrote:
>> On Sun, Nov 11, 2012 at 07:48:14PM +0100, Felipe Contreras wrote:
>>
>>>>   3. Exporters should not use it if they have any broken-down
>>>>      representation at all. Even knowing that the first half is a human
>>>>      name and the second half is something else would give it a better
>>>>      shot at cleaning than fast-import would get.
>>>
>>> I'm not sure what you mean by this. If they have name and email, then
>>> sure, it's easy.
>>
>> But not as easy as just printing it. What if you have this:
>>
>>   name="Peff <angle brackets> King"
>>   email="<[email protected]>"
>>
>> Concatenating them does not produce a valid git author name. Sending the
>> concatenation through fast-import's cleanup function would lose
>> information (namely, the location of the boundary between name and
>> email).
> 
> Right. Unfortunately I'm not aware of any DSCM that does that.
> 
>> Similarly, one might have other structured data (e.g., CVS username)
>> where the structure is a useful hint, but some conversion to name+email
>> is still necessary.
> 
> CVS might be the only one that has such structured data. I think in
> subversion the username has no meaning. A 'felipec' subversion
> username is as bad as a mercurial 'felipec' username.


In subversion, the username has the clearly defined meaning of being a
username on the subversion host. If the host is, e.g., a sourceforge
site then I can easily look up the user profile and convert the username
into a valid e-mail address (<username>@users.sf.net). That is the
advantage that the exporter (together with user knowledge) has over the
importer.

If the initial clone process aborts after every single "unknown" user
it's no fun, of course. On the other hand, if an incremental clone
(fetch) let's commits with unknown author sneak in it's no fun either
(because I may want to fetch in crontab and publish that converted beast
automatically). That is why I proposed neither approach.

Most conveniently, the export side of a remote helper would

- do "obvious" automatic lossless transformations
- use an author map for other names
- For names not covered by the above (or having an empty map entry):
Stop exporting commits but continue parsing commits and amend the author
map with any unknown usernames (empty entry), and warn the user.
(crontab script can notify me based on the return code.)

If the cloning involves a "foreign clone" (like the hg clone behind the
scene) then the runtime of the second pass should be much smaller. In
principle, one could even store all blobs and trees on the first run and
skip that step on the second, but that would rely on immutability on the
foreign side, so I dunno. (And to check the sha1, we have to get the
blob anyways.)

As for the format for incomplete entries (foo <some@where>), a technical
guideline should suffice for those that follow guidelines.

Michael
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFD: fast-import is picky with author names (and maybe it should - but how much so?)

Reply via email to