On Wed, Aug 25, 2010 at 13:03, Max Bowsher <m...@f2s.com> wrote: > On 25/08/10 09:18, Magnus Hagander wrote: >> On Wed, Aug 25, 2010 at 07:11, Tom Lane <t...@sss.pgh.pa.us> wrote: >>> Robert Haas <robertmh...@gmail.com> writes: > >>>> 2. Any non-ASCII characters in, for example, contributor's names show >>>> up differently in the two repos. Generally, the original repo is OK >>>> and the new repo is garbled; although I found one very old example >>>> that went the other way. >>> >>> What it looks like to me is that a Latin1->UTF8 conversion has been >>> applied to the log text. Which might be a good idea if it all *was* >>> Latin1, but a fair-sized percentage isn't. Applying this conversion to >>> UTF8 entries results in garbage, of course. Even if this could be done >>> reliably, I think this counts as editorializing on the historical >>> record, and should be switched off if possible. >> >> I think the problem is that we have a mix of them :( git requires it to be >> utf8. >> >> cvs2git is configured to try, in order, latin1, utf8 and ascii, and >> use whichever first returns correct result. In this case it seems it >> does return saying things are right, because the result is valid utf8 >> - just not the utf8 we expected. >> >> I can give it a try the other way around - trying utf8 *before* >> latin1, to see if that makes it better - utf8 tends to be more strict. > > *Every* byte sequence is valid latin1, therefore if you try latin1, > utf8, ascii in that order, latin1 will always be used. > > You most likely want utf8, latin1 (no point also including ascii since > it's a strict subset of latin1).
Yup. I re-ran it with utf8, latin1, ascii and that commit looks better now. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers