Re: [PATCH 6/9] pretty: two phase conversion for non utf-8 commits

2012-09-23 Thread Nguyen Thai Ngoc Duy
On Sun, Sep 23, 2012 at 8:54 PM, Robin Rosenberg
 wrote:
>> This of course only works with encodings that are compatible with
>> Unicode.
> Such as? Unicode was defined to encompass all knows encodings.

Just a precaution because I have never read Unicode standard (and it
keeps getting updated, hence "incomplete")

>
>> -static size_t format_commit_one(struct strbuf *sb, const char
>> *placeholder,
>> +static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
>> + const char *placeholder,
>>   void *context)
>>  {
>>   struct format_commit_context *c = context;
>
> Which parameter does the comment apply to? I believe most conventions
> nowadays include parameter documentation in the comment preceding
> the function header.

Yeah. I should have followed that.

>> b/t/t6006/commit-msg.iso8859-1
>> new file mode 100644
>> index 000..f8fe808
>> --- /dev/null
>> +++ b/t/t6006/commit-msg.iso8859-1
>> @@ -0,0 +1,5 @@
>> +Test printing of complex bodies
>> +
>> +This commit message is much longer than the others,
>> +and it will be encoded in iso8859-1. We should therefore
>> +include an iso8859 character: �bueno!
>
> "8859-1" to be exact. Only three 8859 encoding has the
> character.

Yep. But i'll probably need a closer look at t6006. It seems there's a
few "upside down exclamation" (not sure what it's called) in UTF-8 in
that test. I'll fix the text too.
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/9] pretty: two phase conversion for non utf-8 commits

2012-09-23 Thread Robin Rosenberg

A few nitpicks

- Ursprungligt meddelande -
> Always assume format_commit_item() takes an utf-8 string for
> simplicity. If commit message is in non-utf8, or output encoding is
> not, then the commit is first converted to utf-8, processed, then
> output converted to output encoding.
> 
> This of course only works with encodings that are compatible with
> Unicode.
Such as? Unicode was defined to encompass all knows encodings.

> -static size_t format_commit_one(struct strbuf *sb, const char
> *placeholder,
> +static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
> + const char *placeholder,
>   void *context)
>  {
>   struct format_commit_context *c = context;

Which parameter does the comment apply to? I believe most conventions
nowadays include parameter documentation in the comment preceding
the function header.

[...]

> b/t/t6006/commit-msg.iso8859-1
> new file mode 100644
> index 000..f8fe808
> --- /dev/null
> +++ b/t/t6006/commit-msg.iso8859-1
> @@ -0,0 +1,5 @@
> +Test printing of complex bodies
> +
> +This commit message is much longer than the others,
> +and it will be encoded in iso8859-1. We should therefore
> +include an iso8859 character: �bueno!

"8859-1" to be exact. Only three 8859 encoding has the
character.

-- robin
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] pretty: two phase conversion for non utf-8 commits

2012-09-23 Thread Nguyễn Thái Ngọc Duy
Always assume format_commit_item() takes an utf-8 string for
simplicity. If commit message is in non-utf8, or output encoding is
not, then the commit is first converted to utf-8, processed, then
output converted to output encoding.

This of course only works with encodings that are compatible with
Unicode.

This also fixes the iso8859-1 test. It's supposed to create an
iso8859-1 commit, but the commit content in t6006-rev-list-format.sh
is in UTF-8. Split the content out in a separate file (so its encoding
won't accidentally be converted) and convert it back to iso8859-1.

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 pretty.c | 29 +++--
 t/t6006-rev-list-format.sh   | 16 +---
 t/t6006/commit-msg.iso8859-1 |  5 +
 3 files changed, 29 insertions(+), 21 deletions(-)
 create mode 100644 t/t6006/commit-msg.iso8859-1

diff --git a/pretty.c b/pretty.c
index 45fe878..f3275a7 100644
--- a/pretty.c
+++ b/pretty.c
@@ -916,7 +916,8 @@ static size_t parse_color_placeholder(struct strbuf *sb,
return 0;
 }
 
-static size_t format_commit_one(struct strbuf *sb, const char *placeholder,
+static size_t format_commit_one(struct strbuf *sb, /* in UTF-8 */
+   const char *placeholder,
void *context)
 {
struct format_commit_context *c = context;
@@ -1121,7 +1122,8 @@ static size_t format_commit_one(struct strbuf *sb, const 
char *placeholder,
return 0;   /* unknown placeholder */
 }
 
-static size_t format_commit_item(struct strbuf *sb, const char *placeholder,
+static size_t format_commit_item(struct strbuf *sb, /* in UTF-8 */
+const char *placeholder,
 void *context)
 {
struct format_commit_context *c = context;
@@ -1205,25 +1207,32 @@ void format_commit_message(const struct commit *commit,
struct format_commit_context context;
static const char utf8[] = "UTF-8";
const char *output_enc = pretty_ctx->output_encoding;
+   char *enc;
 
memset(&context, 0, sizeof(context));
context.commit = commit;
context.pretty_ctx = pretty_ctx;
context.wrap_start = sb->len;
context.message = commit->buffer;
-   if (output_enc) {
-   char *enc = get_header(commit, "encoding");
-   if (strcmp(enc ? enc : utf8, output_enc)) {
-   context.message = logmsg_reencode(commit, output_enc);
-   if (!context.message)
-   context.message = commit->buffer;
-   }
-   free(enc);
+   enc = get_header(commit, "encoding");
+   if (enc && strcmp(utf8, enc)) {
+   context.message = reencode_string(context.message, utf8, enc);
+   if (!context.message)
+   context.message = commit->buffer;
}
+   free(enc);
 
strbuf_expand(sb, format, format_commit_item, &context);
rewrap_message_tail(sb, &context, 0, 0, 0);
 
+   if (output_enc && strcmp(utf8, output_enc)) {
+   char *out = reencode_string(sb->buf, output_enc, utf8);
+   if (out) {
+   int len = strlen(out);
+   strbuf_attach(sb, out, len, len + 1);
+   }
+   }
+
if (context.message != commit->buffer)
free(context.message);
free(context.signature.gpg_output);
diff --git a/t/t6006-rev-list-format.sh b/t/t6006-rev-list-format.sh
index f94f0c4..cd24839 100755
--- a/t/t6006-rev-list-format.sh
+++ b/t/t6006-rev-list-format.sh
@@ -124,27 +124,21 @@ commit 86c75cfd708a0e5868dc876ed5b8bb66c80b4873
 foo
 EOF
 
-cat >commit-msg <<'EOF'
-Test printing of complex bodies
-
-This commit message is much longer than the others,
-and it will be encoded in iso8859-1. We should therefore
-include an iso8859 character: ¡bueno!
-EOF
+cat "$TEST_DIRECTORY/t6006/commit-msg.iso8859-1" >commit-msg
 test_expect_success 'setup complex body' '
 git config i18n.commitencoding iso8859-1 &&
   echo change2 >foo && git commit -a -F commit-msg
 '
 
 test_format complex-encoding %e <<'EOF'
-commit f58db70b055c5718631e5c61528b28b12090cdea
+commit 1ed88da4a5b5ed8c449114ac131efc62178734c3
 iso8859-1
 commit 131a310eb913d107dd3c09a65d1651175898735d
 commit 86c75cfd708a0e5868dc876ed5b8bb66c80b4873
 EOF
 
 test_format complex-subject %s <<'EOF'
-commit f58db70b055c5718631e5c61528b28b12090cdea
+commit 1ed88da4a5b5ed8c449114ac131efc62178734c3
 Test printing of complex bodies
 commit 131a310eb913d107dd3c09a65d1651175898735d
 changed foo
@@ -153,7 +147,7 @@ added foo
 EOF
 
 test_format complex-body %b <<'EOF'
-commit f58db70b055c5718631e5c61528b28b12090cdea
+commit 1ed88da4a5b5ed8c449114ac131efc62178734c3
 This commit message is much longer than the others,
 and it will be encoded in iso8859-1. We should therefore
 include an iso8859 character: ¡bueno!
@@