Hi Peter, On 5 December 2017 at 11:55, Peter Uhnák <[email protected]> wrote: >> Would you be open to a PR that incorporates Sven's changes suggested >> above? > > Certainly. One thing to keep in mind is that I had to modify > MemoryStore/MemoryHandle so it can process unicode at all, but it also > forces everything to be unicode.
This is probably a good thing. The fast-import documentation says that it is strict about character encoding and only accepts UTF-8. I'll let you know when the PRs are ready. Thanks, Alistair >> I hate having to deal with string encoding, and know as little as >> possible. :-) > > Apparently me taking this approach resulted in the current situation. :-D > > Peter > > On Tue, Dec 5, 2017 at 10:27 AM, Alistair Grant <[email protected]> > wrote: >> >> On Tue, Dec 05, 2017 at 08:56:53AM +0100, Sven Van Caekenberghe wrote: >> > >> > > On 5 Dec 2017, at 08:34, Alistair Grant <[email protected]> wrote: >> > > >> > > On 5 December 2017 at 03:41, Martin Dias <[email protected]> wrote: >> > >> I suspect it's related to the large number of commits in my repo. I >> > >> made >> > >> some tweaks and succeeded to create the fast-import file. But I get: >> > >> >> > >> fatal: Unsupported command: . >> > >> fast-import: dumping crash report to .git/fast_import_crash_10301 >> > >> >> > >> Do you recognize this error? >> > >> I will check my changes tweaking the git-migration tool to see if I >> > >> modified >> > >> some behavior my mistake... >> > > >> > > I had the same error just last night. >> > > >> > > In my case, it turned out to be a non-UTF8 encoded character in one of >> > > the commit messages. >> > > >> > > I tracked it down by looking at the crash report and searching for a >> > > nearby command. I've deleted the crash reports now, but I think it >> > > was the number associated with a mark command that got me near the >> > > problem character in the fast-import file. >> > > >> > > I also modified the code to halt whenever it found a non-UTF8 >> > > character. I'm sure there are better ways to do this, but: >> > > >> > > >> > > GitMigrationCommitInfo>>inlineDataFor: aString >> > > >> > > | theString | >> > > theString := aString. >> > > "Ensure the message has valid UTF-8 encoding (this will raise an >> > > error if it doesn't)" >> > > [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString >> > > asByteArray ] >> > > on: Error >> > > do: [ :ex | self halt: 'Illegal string encoding'. >> > > theString := aString select: [ :each | each asciiValue >> > > between: 32 and: 128 ] ]. >> > > ^ 'data ' , theString size asString , String cr , (theString >> > > ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ]) >> > >> > There is also ByteArray>>#utf8Decoded (as well as >> > String>>#utf8Encoded), both using the newer ZnCharacterEncoders. So >> > you could write it shorter as: >> > >> > aString asByteArray utf8Decoded >> > >> > Instead of Error you could use the more intention revealing >> > ZnCharacterEncodingError. >> >> I agree, this was a hack to get me past the immediate issue and see if >> I could get the migration to work. >> >> >> > Apart from that, and I known you did not create this mess, encoding a >> > String into a String, although it can be done, is so wrong. You encode >> > a String (a collection of Characters, of Unicode code points) into a >> > ByteArray and you decode a ByteArray into a String. >> > >> > Sending #asByteArray to a String in almost always wrong, as is sending >> > #asString to a ByteArray. These are implicit conversions (null >> > conversions, like ZnNullEncoder) that only work for pure ASCII or >> > Latin1 (iso-8859-1), but not for the much richer set of Characters >> > that Pharo supports. So these will eventually fail, one day, in a >> > country far away. >> >> I hate having to deal with string encoding, and know as little as >> possible. :-) >> >> >> >> On Tue, Dec 05, 2017 at 08:59:34AM +0100, Peter Uhn??k wrote: >> > > In my case, it turned out to be a non-UTF8 encoded character in one of >> > > the >> > commit messages. >> > >> > I've ran into this problem in a sister project (tonel-migration), and do >> > not >> > have a proper resolution yet. I was forcing everything to be unicode, so >> > I need >> > a better way to read and write encoded strings. :< >> >> I think this is always going to be a tricky problem since the file may >> simply be corrupt, and the resolution won't always be the same. In my >> case I was happy to simply remove the offending characters. >> >> I do think it would be nice if the migration tool at least stopped and >> warned the user that the input is invalid. >> >> Would you be open to a PR that incorporates Sven's changes suggested >> above? >> >> >> >> On Tue, Dec 05, 2017 at 09:05:24AM +0100, Sven Van Caekenberghe wrote: >> > >> > Maybe ZnCharacterEncoder class>>#detectEncoding: can help, although it >> > is far from perfect. >> >> In this particular case I think it is better to try decoding with UTF8, >> since #detectEncoding: isn't completely reliable and we just want to >> know if it isn't valid UTF8. >> >> >> Cheers, >> Alistair >> >
