Should the --encoding argument to log/show commands make any guarantees about their output?

Jan-Philip Gehrcke Mon, 15 Jun 2015 01:51:16 -0700

Hello,

I was surprised to see that the output of


    git log --encoding=utf-8 "--format=format:%b"

can contain byte sequences that are invalid in UTF-8. Note: I am usinggit 2.1.4 and the %b format specifier represents the commit message body.


I have seen this with the Linux git repository and the following test:

    git log --encoding=utf-8 "--format=format:%b" | python2 -c \
        'import sys; [l.decode("utf-8") for l in sys.stdin]'

Soon enough errors like this appears:

    'utf8' codec can't decode byte 0xf6 in position 19

The help message to the --encoding argument reads:

The commit objects record the encoding used for the log message in
their encoding header; this option can be used to tell the command to
re-code the commit log message in the encoding preferred by the user

I realize that this message does not give any guarantee about the outputof the command, in the sense that --encoding=utf-8 produces valid UTF-8data in all cases.

However, I wonder what --encoding precisely does and if it has thebehavior most users would expect.


Let me describe what I think it currently does:

The program attempts to re-code a log message, so it follows the chain

        raw input -> unicode -> raw output

For the first step, knowledge about the input encoding is required. Thisis retrieved from the encoding header of the commit object if present or(from the docs) "lack of this header implies that the commit log messageis encoded in UTF-8." If this step fails (if the entry contains a bytesequence that is invalid in the specified/assumed input codec), theprocedure is aborted and the data is dumped as is (obviously withoutapplying the requested output encoding).


Is that correct?

From my point of view the most natural abstraction of a log *message*is *text*, not bytes. The same is true for author names. If I want tobuild a tool chain on top of log/show, this usually means that I want towork with text information. Hence, I want to retrieve text (a sequenceof code points) from git show/log. Text must be transported in encodedform, sure, but it must not contain byte sequences that are invalid inthis codec. Because otherwise it's just not text anymore.

Hence, from my point of view, the rational that git show/log should beable to output *text* information means that they should not emit bytesequences that are invalid in the codec specified via the --encodingargument. In the current situation, the work of dealing with invalidbyte sequences is just outsourced to software further below in the toolchain (at some point a replacement character � should be displayed tothe user instead of the invalid raw bytes).

I am not entirely sure where this discussion should lead to. However, Ithink that if the behavior of the software will not be changed, then thedocumentation for the --encoding option should be more precise andclarify what actually happens behind the scenes. What do you think?



Cheers,


Jan-Philip Gehrcke


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Should the --encoding argument to log/show commands make any guarantees about their output?

Reply via email to