On 06.01.2019 17:16, Stefan Kueng wrote: > > > On 06.01.2019 15:05, Branko Čibej wrote: > >> Sorry about getting into this late, but as Julian said: >> >>> * we already have a (char*, len) wrapper, called svn_string_t, so I >>> would expect it would be neatest to use that; >> >> but the patch has: >> >>> @@ -758,6 +758,28 @@ >>> ·*·will·be·true·if·the·reason·there·is·no·blame·information·is·that·the·line >>> >>> >>> ·*·was·modified·locally.·In·all·other·cases·@a·local_change·will·be·false. >>> >>> >>> ·* >>> +·*·@since·New·in·1.12. >>> +·*/ >>> +typedef·svn_error_t·*(*svn_client_blame_receiver4_t)( >>> +··void·*baton, >>> +··svn_revnum_t·start_revnum, >>> +··svn_revnum_t·end_revnum, >>> +··apr_int64_t·line_no, >>> +··svn_revnum_t·revision, >>> +··apr_hash_t·*rev_props, >>> +··svn_revnum_t·merged_revision, >>> +··apr_hash_t·*merged_rev_props, >>> +··const·char·*merged_path, >>> +··svn_stringbuf_t·*line, >>> +··svn_boolean_t·local_change, >>> +··apr_pool_t·*pool); >> >> >> The svn_stringbuf_t struct is not appropriate here; please use >> svn_string_t. The former is used as a buffer to construct larger >> strings, that's why it contains a reference to a pool and a blocksize. >> the blame receiver gets data that are already allocated from a pool and >> should be immutable to the receiver; the appropriate declaration here is >> >> const svn_string_t *line; > > good point. > >> as you only need the data pointer and the length, and very much do _not_ >> need the pool and blocksize, nor do you want the data or length to be >> modifiable by the callback. >> >> That will make the changes in libsvn_client/blame.c a bit more complex >> since you'll have to create a local svn_string_t object (on stack) >> before calling the receiver, but it makes the receiver's interface >> cleaner. Something like this would work: > > see attached patch. > >> Other than that, I think it's a bad idea to leave the trailing null byte >> from the UTF-16-LE representation of U+000a at the beginning of the next >> "line". If we're making this change in a *public* API, we should not >> expect all users to hack around such a misfeature. > > problem is: if we just skip any null char after an lf, then we would > screw up utf16be encodings (or le? I always get those two confused).
Windows default is UTF-16-LE, at least on x86(_64) and other little-endian architectures. I'm not sure what they do on ARM but I'd be surprised if Windows doesn't put it in little-endian mode, given that decades of legacy software assume little-endian. A simple check would be: * if 0x0a is on an odd offset, and the next byte is 0x00, then it's a UTF-16-LE linefeed; * else if 0x0a is on an even offset, and the _previous_ byte is 0x00, then it's a UTF-16-BE linefeed; * otherwise just hope it's a linefeed and move on. But given your example below of mixing ASCII/UTF-8/UTF-16, it's better to just leave well enough alone. :( >> Until we have better support for other Unicode representations, we >> should detect that null byte and include it as part of the reported >> blame line. > > another situation which unfortunately is not that uncommon: files that > have different encodings in different lines or even inside the same > line. I've seen that especially in log files where the logger first > prints timestamp in ascii, but then the log-text is in utf8 ... which isn't a problem since ASCII and UTF-8 are compatible ... > or even utf16. ... but this is a rather huge problem. > Such situations I think would require too much detection logic for the > blame function and should be left to clients - they might not be able > to even show such lines so why bother trying to do the detection right > for that. You're right about that. I wouldn't dream of supporting such things within the blame callback itself. However it would still be nice to at least document what's happening. The patch itself looks OK to me now. -- Brane