Hi Peter,

On 5 December 2017 at 11:55, Peter Uhnák <[email protected]> wrote:
>> Would you be open to a PR that incorporates Sven's changes suggested
>> above?
>
> Certainly. One thing to keep in mind is that I had to modify
> MemoryStore/MemoryHandle so it can process unicode at all, but it also
> forces everything to be unicode.

This is probably a good thing.  The fast-import documentation says
that it is strict about character encoding and only accepts UTF-8.

I'll let you know when the PRs are ready.

Thanks,
Alistair


>> I hate having to deal with string encoding, and know as little as
>> possible. :-)
>
> Apparently me taking this approach resulted in the current situation. :-D
>
> Peter
>
> On Tue, Dec 5, 2017 at 10:27 AM, Alistair Grant <[email protected]>
> wrote:
>>
>> On Tue, Dec 05, 2017 at 08:56:53AM +0100, Sven Van Caekenberghe wrote:
>> >
>> > > On 5 Dec 2017, at 08:34, Alistair Grant <[email protected]> wrote:
>> > >
>> > > On 5 December 2017 at 03:41, Martin Dias <[email protected]> wrote:
>> > >> I suspect it's related to the large number of commits in my repo. I
>> > >> made
>> > >> some tweaks and succeeded to create the fast-import file. But I get:
>> > >>
>> > >> fatal: Unsupported command: .
>> > >> fast-import: dumping crash report to .git/fast_import_crash_10301
>> > >>
>> > >> Do you recognize this error?
>> > >> I will check my changes tweaking the git-migration tool to see if I
>> > >> modified
>> > >> some behavior my mistake...
>> > >
>> > > I had the same error just last night.
>> > >
>> > > In my case, it turned out to be a non-UTF8 encoded character in one of
>> > > the commit messages.
>> > >
>> > > I tracked it down by looking at the crash report and searching for a
>> > > nearby command.  I've deleted the crash reports now, but I think it
>> > > was the number associated with a mark command that got me near the
>> > > problem character in the fast-import file.
>> > >
>> > > I also modified the code to halt whenever it found a non-UTF8
>> > > character.  I'm sure there are better ways to do this, but:
>> > >
>> > >
>> > > GitMigrationCommitInfo>>inlineDataFor: aString
>> > >
>> > >    | theString |
>> > >    theString := aString.
>> > >    "Ensure the message has valid UTF-8 encoding (this will raise an
>> > > error if it doesn't)"
>> > >    [ (ZnCharacterEncoder newForEncoding: 'utf8') decodeBytes: aString
>> > > asByteArray ]
>> > >        on: Error
>> > >        do: [ :ex | self halt: 'Illegal string encoding'.
>> > >            theString := aString select: [ :each | each asciiValue
>> > > between: 32 and: 128 ] ].
>> > >    ^ 'data ' , theString size asString , String cr , (theString
>> > > ifEmpty: [ '' ] ifNotEmpty: [ theString , String cr ])
>> >
>> > There is also ByteArray>>#utf8Decoded (as well as
>> > String>>#utf8Encoded), both using the newer ZnCharacterEncoders. So
>> > you could write it shorter as:
>> >
>> >   aString asByteArray utf8Decoded
>> >
>> > Instead of Error you could use the more intention revealing
>> > ZnCharacterEncodingError.
>>
>> I agree, this was a hack to get me past the immediate issue and see if
>> I could get the migration to work.
>>
>>
>> > Apart from that, and I known you did not create this mess, encoding a
>> > String into a String, although it can be done, is so wrong. You encode
>> > a String (a collection of Characters, of Unicode code points) into a
>> > ByteArray and you decode a ByteArray into a String.
>> >
>> > Sending #asByteArray to a String in almost always wrong, as is sending
>> > #asString to a ByteArray. These are implicit conversions (null
>> > conversions, like ZnNullEncoder) that only work for pure ASCII or
>> > Latin1 (iso-8859-1), but not for the much richer set of Characters
>> > that Pharo supports. So these will eventually fail, one day, in a
>> > country far away.
>>
>> I hate having to deal with string encoding, and know as little as
>> possible. :-)
>>
>>
>>
>> On Tue, Dec 05, 2017 at 08:59:34AM +0100, Peter Uhn??k wrote:
>> > > In my case, it turned out to be a non-UTF8 encoded character in one of
>> > > the
>> > commit messages.
>> >
>> > I've ran into this problem in a sister project (tonel-migration), and do
>> > not
>> > have a proper resolution yet. I was forcing everything to be unicode, so
>> > I need
>> > a better way to read and write encoded strings. :<
>>
>> I think this is always going to be a tricky problem since the file may
>> simply be corrupt, and the resolution won't always be the same.  In my
>> case I was happy to simply remove the offending characters.
>>
>> I do think it would be nice if the migration tool at least stopped and
>> warned the user that the input is invalid.
>>
>> Would you be open to a PR that incorporates Sven's changes suggested
>> above?
>>
>>
>>
>> On Tue, Dec 05, 2017 at 09:05:24AM +0100, Sven Van Caekenberghe wrote:
>> >
>> > Maybe ZnCharacterEncoder class>>#detectEncoding: can help, although it
>> > is far from perfect.
>>
>> In this particular case I think it is better to try decoding with UTF8,
>> since #detectEncoding: isn't completely reliable and we just want to
>> know if it isn't valid UTF8.
>>
>>
>> Cheers,
>> Alistair
>>
>

Reply via email to