Re: extending the blame callback

Branko Čibej Sun, 06 Jan 2019 10:37:19 -0800

On 06.01.2019 17:16, Stefan Kueng wrote:
>
>
> On 06.01.2019 15:05, Branko Čibej wrote:
>
>> Sorry about getting into this late, but as Julian said:
>>
>>> * we already have a (char*, len) wrapper, called svn_string_t, so I
>>> would expect it would be neatest to use that;
>>
>> but the patch has:
>>
>>> @@ -758,6 +758,28 @@
>>> ·*·will·be·true·if·the·reason·there·is·no·blame·information·is·that·the·line
>>>
>>>
>>> ·*·was·modified·locally.·In·all·other·cases·@a·local_change·will·be·false.
>>>
>>>
>>> ·*
>>> +·*·@since·New·in·1.12.
>>> +·*/
>>> +typedef·svn_error_t·*(*svn_client_blame_receiver4_t)(
>>> +··void·*baton,
>>> +··svn_revnum_t·start_revnum,
>>> +··svn_revnum_t·end_revnum,
>>> +··apr_int64_t·line_no,
>>> +··svn_revnum_t·revision,
>>> +··apr_hash_t·*rev_props,
>>> +··svn_revnum_t·merged_revision,
>>> +··apr_hash_t·*merged_rev_props,
>>> +··const·char·*merged_path,
>>> +··svn_stringbuf_t·*line,
>>> +··svn_boolean_t·local_change,
>>> +··apr_pool_t·*pool);
>>
>>
>> The svn_stringbuf_t struct is not appropriate here; please use
>> svn_string_t. The former is used as a buffer to construct larger
>> strings, that's why it contains a reference to a pool and a blocksize.
>> the blame receiver gets data that are already allocated from a pool and
>> should be immutable to the receiver; the appropriate declaration here is
>>
>>      const svn_string_t *line;
>
> good point.
>
>> as you only need the data pointer and the length, and very much do _not_
>> need the pool and blocksize, nor do you want the data or length to be
>> modifiable by the callback.
>>
>> That will make the changes in libsvn_client/blame.c a bit more complex
>> since you'll have to create a local svn_string_t object (on stack)
>> before calling the receiver, but it makes the receiver's interface
>> cleaner. Something like this would work:
>
> see attached patch.
>
>> Other than that, I think it's a bad idea to leave the trailing null byte
>> from the UTF-16-LE representation of U+000a at the beginning of the next
>> "line". If we're making this change in a *public* API, we should not
>> expect all users to hack around such a misfeature.
>
> problem is: if we just skip any null char after an lf, then we would
> screw up utf16be encodings (or le? I always get those two confused).



Windows default is UTF-16-LE, at least on x86(_64) and other
little-endian architectures. I'm not sure what they do on ARM but I'd be
surprised if Windows doesn't put it in little-endian mode, given that
decades of legacy software assume little-endian.

A simple check would be:

  * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a
    UTF-16-LE linefeed;
  * else if 0x0a is on an even offset, and the _previous_ byte is 0x00,
    then it's a UTF-16-BE linefeed;
  * otherwise just hope it's a linefeed and move on.


But given your example below of mixing ASCII/UTF-8/UTF-16, it's better
to just leave well enough alone. :(

>> Until we have better support for other Unicode representations, we
>> should detect that null byte and include it as part of the reported
>> blame line.
>
> another situation which unfortunately is not that uncommon: files that
> have different encodings in different lines or even inside the same
> line. I've seen that especially in log files where the logger first
> prints timestamp in ascii, but then the log-text is in utf8

... which isn't a problem since ASCII and UTF-8 are compatible ...

> or even utf16. 

... but this is a rather huge problem.

> Such situations I think would require too much detection logic for the
> blame function and should be left to clients - they might not be able
> to even show such lines so why bother trying to do the detection right
> for that.


You're right about that. I wouldn't dream of supporting such things
within the blame callback itself. However it would still be nice to at
least document what's happening.

The patch itself looks OK to me now.

-- Brane

Re: extending the blame callback

Reply via email to