On 14.01.2019 13:36, Julian Foad wrote:
> Stefan, thanks for taking account of the feedback and updating the doc string 
> in r1851197.
>
> I took a look and thought to rewrite the part about encoding and line 
> splitting like this:
>
>  * Character Encoding and Line Splitting:
>  *
>  * It is up to the client to determine the character encoding. The @a line
>  * content is delivered without any encoding conversion. The line splitting
>  * is designed to work with ASCII-compatible encodings including UTF-8. Any
>  * of the byte sequences LF ("\n"), CR ("\n"), CR LF ("\r\n") ends a line
>  * and is not included in @a line. The @a line content can include all other
>  * byte values including zero (ASCII NUL).
>
> I dropped the reference to svn_subst_stream_translated() because it wasn't 
> much use without saying what parameters it is given, and instead I was able 
> to say exactly what happens overall.
>
> Problem 1: Using this blame function on a 16-bit character encoding is still 
> really ugly: the receiver cannot know which byte sequences were stripped out.
>
> We should address this issue properly by passing a "line splitter" function 
> in to svn_client_blame6().


I started on that then decided that svn_client_blame6 is far too narrow
scope for that. In order to do this right, we have to introduce a line
splitter callback to svn_subst_stream_translated.

My idea was to do something like this:

  * If the line-splitter is NULL, use the current default and check MIME
    type for text/*
  * If it's not NULL, ignore MIME type and provide the callback with
    file props so it can use that to determine encoding if it wants to.
  * Remove the ignore_mime_type flag and use the presence of a line
    splitter to convey the same intent; consequently,
  * Expose the "default" line splitter as a public function.



> Problem 2: Then I noticed that where this splitting algorithm is used, it is 
> in the second pass over the file data, where we associate each "struct blame" 
> item with a line of text.
>
> It looks like a different algorithm may be used in the first pass, when 
> calculating the differences and constructing the blame list. It diffs all 
> repository revisions of the file using svn_diff_file_diff_2() which splits 
> lines according to the "ignore_eol_style" option from 
> svn_client_blame6(diff_options.ignore_eol_style). The effect of 
> "ignore_eol_style" is not entirely clear to me; it is used in 
> svn_diff__normalize_buffer(). In addition, if svn_client_blame6() ends up 
> processing a local WC revision of the file, that is processed through 
> svn_client__get_normalized_stream(normalize_eols=(svn:eol-style == native)) 
> before being diffed.
>
> This is horrible. Surely we should use a consistent "line splitter" 
> everywhere.


Yes, we should introduce the line splitter concept everywhere in the
diff/patch/blame group and really anywhere where we're expected to read
"lines" from files (svn_stream_readline comes to mind). This is why I
stopped working on a splitter for blame ... the problem is much larger
than that and "fixing" blame would really not help by itself.


> I would expect this means it is possible for blame output to go wrong, with 
> revision numbers assigned to the wrong lines.


This has always been possible due to the imprecise nature of diff, which
blame relies on. I wouldn't worry too much about it.


>  I have been unable to construct such a wrong output, so far, after trying 
> for half an hour or so. Possibly there is no code path that produces such a 
> wrong output. But surely this is the sort of issue that leads to problems 
> down the road.
>
> Can we re-write this properly?

As I said, this is not limited to blame. The current change in the blame
API at least makes some sense and is a localised change, that's why I
support it. It doesn't break anything that wasn't broken before.

Changing the way we handle text-like files for diff, blame and patch, on
the other hand, is quite a bit more involved and I'm afraid it'll touch
a lot of code. I wouldn't dream of rejecting svn_client_blame6 just
because it doesn't solve the larger problem.

-- Brane

Reply via email to