Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

2013-04-10 Thread Jakub Narębski
On 09.04.2013 at 23:59, Jürgen Kreileder wrote:
 Jakub Narębski jna...@gmail.com writes:
 
 On 08.04.2013, Junio C Hamano wrote:
 j...@blackdown.de (Jürgen Kreileder) writes:

 Fixes the encoding for several _plain actions and for text/* and */*+xml 
 blobs. 

 Signed-off-by: Jürgen Kreileder j...@blackdown.de

 I see that this patch does (or tries to do) two _independent_ things,
 and should be split into at least two separate commits:
[...]
 First, it extends adding ; charset=$default_text_plain_charset to
 other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
 mimetypes (without changing name of variable... though this is more
 complicated as it is configuration variable and we would want to
 preserve backward compatibility, but at least a comment would be,
 I think, needed).
  
 Originally it applied only to 'text/plain' files, which can be
 displayed inline by web browser, and which need charset in
 'Content-Type:' HTTP header to be displayed correctly, as they
 do not include such information inside the file.
 
 What prompted the change is that some browsers (Chrome and Safari) also
 display other file types inline: text/x-chdr, text/x-java, text/x-objc,
 text/x-sh, ...
 
 At least on my system these files do get served with charset set!
 (ISO-8859-1 even though Apache has AddDefaultCharset UTF-8...)
[...]

I think this (the reason behind this change) should be explained in the
commit message.

 'text/html' and 'application/xhtml+xml' can include such information
 inside them (meta http-equiv for 'text/html' and ?xml ... for
 'application/xhtml+xml').  I don't know what browser does when there
 is conflicting information about charset, i.e. which one wins, but
 it is certainly something to consider.
 
 As shown above, even without my patch, this already can happen!
 
 It might be a good change; I don't know what web browser do when
 serving 'text/css', 'text/javascript', 'text/xml' to client to
 view without media type known.

There are few options on how to handle this.

First, there is an issue of $default_text_plain_charset configuration
variable.  Any changes to its use (or adding new configuration
variables) should be accompanied with changing / adding comment near the
place where the default value is set, and changing / adding to
the documentation, namely gitweb.conf.txt

We can either:
1.) Use $default_text_plain_charset variable both for 'text/plain'
and 'text/*' etc. content types in 'blob_plain' (aka 'raw') view
for backwards compatibility, just add comment that it applies more
than 'text/plain'

2.) Add new configuration variable, e.g. $default_blob_plain_charset
and use it in place of $default_text_plain_charset as above,
initializing it to $default_text_plain_charset.

This practically renames $default_text_plain_charset preserving
backward compatibility.  Documentation would talk now about new
variable name.

3.) Add new configuration variable, e.g. $default_text_other_charset,
or $default_blob_plain_charset, or $default_inline_charset, etc.
which will be used for files other than 'text/plain'.  This would
make it possible to turn this new feature on and off.

3.a.) New feature default to on, i.e. to $default_text_plain_charset
3.b.) New feature default to off, i.e. undef


Second, there is an issue of file types, like HTML or XML (or XHTML,
or sometimes for LaTeX), which have the information about encoding
embedded in file.

We can:

A.) Skip this issue, and always add charset for 'text/*' and
'*/*-xml' files

B.) Threat those few media types in special way, excluding them from
adding charset

C.) Make which types to be added charset configurable.


Anyway I think this feature is much less urgent.  It fixes an annoyance
rather than bug, as you can always choose which charset to use to
display content in a browser.  We can take time to implement it well,
especially that it interacts with config (and backward compatibility
of config variables).

 BTW I have noticed that we do $prevent_xss dance outside
 blob_contenttype(), in it's only caller i.e. git_blob_plain()...
 which for example means that 'text/html' converted to 'text/plain'
 don't get '; charset=...' added.  I guess that it *might* be
 what prompted this part of change... but if it is so, it needs
 to be fixed at source, e.g. by moving $prevent_xss to
 blob_contenttype() subroutine.
 
 Jep.

But I see this is yet another separate issue.

 Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
 that should be abandoned in favor of 'patch' view; but that is
 a separate issue) and 'patch' views so they use binary-safe output.

 Note that in all cases (I think) we use

$cgi-header(
  -type = 'text/plain',
  -charset = 'utf-8',
  ...
);

 promising web browser that output is as whole in 'utf-8' encoding.
 
 Yes.
 
 It is not explained in the commit message what is the reason for this
 change.  Is it better handing of a situation 

Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

2013-04-09 Thread Jakub Narębski
On 08.04.2013, Junio C Hamano wrote:
 j...@blackdown.de (Jürgen Kreileder) writes:
 
 Fixes the encoding for several _plain actions and for text/* and */*+xml 
 blobs. 

 Signed-off-by: Jürgen Kreileder j...@blackdown.de

I see that this patch does (or tries to do) two _independent_ things,
and should be split into at least two separate commits:

 ---
 
 Thanks, will queue but not hold until I hear something from Jakub.
 
  gitweb/gitweb.perl |8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

 diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
 index 1309196..9cfe5b5 100755
 --- a/gitweb/gitweb.perl
 +++ b/gitweb/gitweb.perl
 @@ -3823,7 +3823,7 @@ sub blob_contenttype {
  my ($fd, $file_name, $type) = @_;
  
  $type ||= blob_mimetype($fd, $file_name);
 -if ($type eq 'text/plain'  defined $default_text_plain_charset) {
 +if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
 m!^\w[-\w]*/\w[-\w]*\+xml$!)  defined $default_text_plain_charset) {
  $type .= ; charset=$default_text_plain_charset;
  }

First, it extends adding ; charset=$default_text_plain_charset to
other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
mimetypes (without changing name of variable... though this is more
complicated as it is configuration variable and we would want to
preserve backward compatibility, but at least a comment would be,
I think, needed).
 
Originally it applied only to 'text/plain' files, which can be
displayed inline by web browser, and which need charset in
'Content-Type:' HTTP header to be displayed correctly, as they
do not include such information inside the file.

'text/html' and 'application/xhtml+xml' can include such information
inside them (meta http-equiv for 'text/html' and ?xml ... for
'application/xhtml+xml').  I don't know what browser does when there
is conflicting information about charset, i.e. which one wins, but
it is certainly something to consider.

It might be a good change; I don't know what web browser do when
serving 'text/css', 'text/javascript', 'text/xml' to client to
view without media type known.


BTW I have noticed that we do $prevent_xss dance outside
blob_contenttype(), in it's only caller i.e. git_blob_plain()...
which for example means that 'text/html' converted to 'text/plain'
don't get '; charset=...' added.  I guess that it *might* be
what prompted this part of change... but if it is so, it needs
to be fixed at source, e.g. by moving $prevent_xss to
blob_contenttype() subroutine.


About implementation:

 +if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
 m!^\w[-\w]*/\w[-\w]*\+xml$!)  defined $default_text_plain_charset) {
  $type .= ; charset=$default_text_plain_charset;

would be better with line split

 +if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
 m!^\w[-\w]*/\w[-\w]*\+xml$!) 
 +defined $default_text_plain_charset) {
  $type .= ; charset=$default_text_plain_charset;

Also: do we need to be strict with '\w[-\w]*', or would '.*' suffice?

 @@ -7637,7 +7637,9 @@ sub git_blobdiff {
  last if $line =~ m!^\+\+\+!;
  }
  local $/ = undef;
 +binmode STDOUT, ':raw';
  print $fd;
 +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
  close $fd;
  }
  }
 @@ -7884,12 +7886,16 @@ sub git_commitdiff {
  
  } elsif ($format eq 'plain') {
  local $/ = undef;
 +binmode STDOUT, ':raw';
  print $fd;
 +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
  close $fd
  or print Reading git-diff-tree failed\n;
  } elsif ($format eq 'patch') {
  local $/ = undef;
 +binmode STDOUT, ':raw';
  print $fd;
 +binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
  close $fd
  or print Reading git-format-patch failed\n;
  }

Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
that should be abandoned in favor of 'patch' view; but that is
a separate issue) and 'patch' views so they use binary-safe output.

Note that in all cases (I think) we use

   $cgi-header(
 -type = 'text/plain',
 -charset = 'utf-8',
 ...
   );

promising web browser that output is as whole in 'utf-8' encoding.

It is not explained in the commit message what is the reason for this
change.  Is it better handing of a situation where files being diff-ed
being in other encoding (like for example in commit that changes
encoding of files from legacy encoding such like e.g. iso-8859-2
or cp1250 to utf-8)?  But what about  -charset = 'utf-8'  then?


About implementation: I think after this change common code crosses
threshold for refactoring it into subroutine, for example:

  sub dump_fh_raw {
my $fh = shift;

local $/ = undef;
binmode STDOUT, ':raw';
print $fh;
binmode STDOUT, 

Re: [PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

2013-04-09 Thread Jürgen Kreileder
Jakub Narębski jna...@gmail.com writes:

 On 08.04.2013, Junio C Hamano wrote:
 j...@blackdown.de (Jürgen Kreileder) writes:
 
 Fixes the encoding for several _plain actions and for text/* and */*+xml 
 blobs. 

 Signed-off-by: Jürgen Kreileder j...@blackdown.de

 I see that this patch does (or tries to do) two _independent_ things,
 and should be split into at least two separate commits:

Agreed.

 ---
 
 Thanks, will queue but not hold until I hear something from Jakub.
 
  gitweb/gitweb.perl |8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

 diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
 index 1309196..9cfe5b5 100755
 --- a/gitweb/gitweb.perl
 +++ b/gitweb/gitweb.perl
 @@ -3823,7 +3823,7 @@ sub blob_contenttype {
 my ($fd, $file_name, $type) = @_;
  
 $type ||= blob_mimetype($fd, $file_name);
 -   if ($type eq 'text/plain'  defined $default_text_plain_charset) {
 +   if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
 m!^\w[-\w]*/\w[-\w]*\+xml$!)  defined $default_text_plain_charset) {
 $type .= ; charset=$default_text_plain_charset;
 }

 First, it extends adding ; charset=$default_text_plain_charset to
 other mimetypes for 'blob_plain' view to all 'text/*' and '*/*+xml'
 mimetypes (without changing name of variable... though this is more
 complicated as it is configuration variable and we would want to
 preserve backward compatibility, but at least a comment would be,
 I think, needed).
  
 Originally it applied only to 'text/plain' files, which can be
 displayed inline by web browser, and which need charset in
 'Content-Type:' HTTP header to be displayed correctly, as they
 do not include such information inside the file.

What prompted the change is that some browsers (Chrome and Safari) also
display other file types inline: text/x-chdr, text/x-java, text/x-objc,
text/x-sh, ...

At least on my system these files do get served with charset set!
(ISO-8859-1 even though Apache has AddDefaultCharset UTF-8...)

$ curl -I 
https://git.blackdown.de/old.cgi?p=contactalbum.git;a=blob_plain;f=Classes/ContactAlbum.h;hb=HEAD;
HTTP/1.1 200 OK
Date: Tue, 09 Apr 2013 21:47:30 GMT
Server: Apache
Content-disposition: inline; filename=Classes/ContactAlbum.h
X-Frame-Options: SAMEORIGIN
Vary: User-Agent
X-UA-Compatible: IE=edge
Strict-Transport-Security: max-age=31536000
Content-Type: text/x-chdr; charset=ISO-8859-1

 'text/html' and 'application/xhtml+xml' can include such information
 inside them (meta http-equiv for 'text/html' and ?xml ... for
 'application/xhtml+xml').  I don't know what browser does when there
 is conflicting information about charset, i.e. which one wins, but
 it is certainly something to consider.

As shown above, even without my patch, this already can happen!

 It might be a good change; I don't know what web browser do when
 serving 'text/css', 'text/javascript', 'text/xml' to client to
 view without media type known.


 BTW I have noticed that we do $prevent_xss dance outside
 blob_contenttype(), in it's only caller i.e. git_blob_plain()...
 which for example means that 'text/html' converted to 'text/plain'
 don't get '; charset=...' added.  I guess that it *might* be
 what prompted this part of change... but if it is so, it needs
 to be fixed at source, e.g. by moving $prevent_xss to
 blob_contenttype() subroutine.

Jep.

[...]

 Second it changes 'blobdiff_plain', 'commitdiff_plain' (which I think
 that should be abandoned in favor of 'patch' view; but that is
 a separate issue) and 'patch' views so they use binary-safe output.

 Note that in all cases (I think) we use

$cgi-header(
  -type = 'text/plain',
  -charset = 'utf-8',
  ...
);

 promising web browser that output is as whole in 'utf-8' encoding.

Yes.

 It is not explained in the commit message what is the reason for this
 change.  Is it better handing of a situation where files being diff-ed
 being in other encoding (like for example in commit that changes
 encoding of files from legacy encoding such like e.g. iso-8859-2
 or cp1250 to utf-8)?

I do see encoding problems when comparing utf8 to utf8 files (i.e. no
encoding change).  For instance:
https://git.blackdown.de/old.cgi?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8

I don't claim to be an expert in Perl's utf8 handling but I guess this
because Perl's internal utf8 form differs from the normal utf8 out from
git commands.  Switching to :raw fixes that: We write plain utf8 (and,
as noted above, charset is set to utf8 already)

With the patch: 
https://git.blackdown.de/?p=contactalbum.git;a=commitdiff_plain;h=cc4eaa64c2b399dd9bdbf1f67f6d621aa24df5f8

 But what about -charset = 'utf-8'  then?

That's a good question.  I think I never tried git with anything besides
ISO-8859-1 (rarely), UTF8 (mostly), and UTF16 (some Xcode files).
(UTF16 definitely causes problems for gitweb.)

 About implementation: I think after this change common code 

[PATCH 1/4] gitweb: Fix utf8 encoding for blob_plain, blobdiff_plain, commitdiff_plain, and patch

2013-04-08 Thread Jürgen Kreileder
Fixes the encoding for several _plain actions and for text/* and */*+xml blobs. 

Signed-off-by: Jürgen Kreileder j...@blackdown.de
---
 gitweb/gitweb.perl |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 1309196..9cfe5b5 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -3823,7 +3823,7 @@ sub blob_contenttype {
my ($fd, $file_name, $type) = @_;
 
$type ||= blob_mimetype($fd, $file_name);
-   if ($type eq 'text/plain'  defined $default_text_plain_charset) {
+   if (($type =~ m!^text/\w[-\w]*$! || $type =~ 
m!^\w[-\w]*/\w[-\w]*\+xml$!)  defined $default_text_plain_charset) {
$type .= ; charset=$default_text_plain_charset;
}
 
@@ -7637,7 +7637,9 @@ sub git_blobdiff {
last if $line =~ m!^\+\+\+!;
}
local $/ = undef;
+   binmode STDOUT, ':raw';
print $fd;
+   binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
close $fd;
}
 }
@@ -7884,12 +7886,16 @@ sub git_commitdiff {
 
} elsif ($format eq 'plain') {
local $/ = undef;
+   binmode STDOUT, ':raw';
print $fd;
+   binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
close $fd
or print Reading git-diff-tree failed\n;
} elsif ($format eq 'patch') {
local $/ = undef;
+   binmode STDOUT, ':raw';
print $fd;
+   binmode STDOUT, ':utf8'; # as set at the beginning of gitweb.cgi
close $fd
or print Reading git-format-patch failed\n;
}
-- 
1.7.10.4
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html