Re: [PATCH v3 2/2] gitweb: use highlight's shebang detection
On Sat, Sep 24, 2016, at 09:21 AM, Jakub Narębski wrote: > W dniu 24.09.2016 o 00:15, Jakub Narębski pisze: > > Sidenote: this way of benchmarking of gitweb falls between two ways of > doing a benchmark. > > The first method is to simply run gitweb as a standalone script, passing > its parameters in CGI environment variables; just like the test suite > does it. You would 'time' / 'times' it a few times, drop outliers, and > take average or a median. With this method you don't even need to set > up a web server. > > The second is to use a specialized program to benchmark the server-side > of a web page, for example 'ab' (ApacheBench), httperf, curl-loader > or JMeter. The first one is usually distributed together with Apache > web server, so you probably have it installed already. Those tools > provide timing statistics. Good to know. Thanks.
Re: [PATCH v3 2/2] gitweb: use highlight's shebang detection
On Fri, Sep 23, 2016, at 03:15 PM, Jakub Narębski wrote: > W dniu 23.09.2016 o 11:08, Ian Kelling napisał: > > > The "highlight" binary can, in some cases, determine the language type > > by the means of file contents, for example the shebang in the first line > > for some scripting languages. Make use of this autodetection for files > > which syntax is not known by gitweb. In that case, pass the blob > > contents to "highlight --force"; the parameter is needed to make it > > always generate HTML output (which includes HTML-escaping). > > Right. > > > > > Although we now run highlight on files which do not end up highlighted, > > performance is virtually unaffected because when we call highlight, we > > also call sanitize() instead of esc_html(), which is significantly > > slower. > > This paragraph is a bit unclear, for example it is not obvious what > "..., which is significantly slower" refers to: sanitize() or esc_html(). > > I think it would be better to write: > > Although we now run highlight on files which do not end up highlighted, > performance is virtually unaffected because when we call highlight, it > is used for escaping HTML. In the case that highlight is used, gitweb > calls sanitize() instead of esc_html(), and the latter is significantly > slower (it does more, being roughly a superset of sanitize()). Agree. Done in v4. > > >After curling blob view of unhighlighted large and small text > > files of perl code and license text 100 times each on a local > > Apache/2.4.23 (Debian) instance, it's logs indicate +-1% difference in > > request time for all file types. > > Also, "curling" is not the word I would like to see. I would say: > > Simple benchmark comparing performance of 'blob' view of files without > syntax highlighting in gitweb before and after this change indicates > ±1% difference in request time for all file types. Benchmark was > performed on local instance on Debian, using Apache/2.4.23 web server > and CGI/PSGI/FCGI/mod_perl. > > ^^--- select one > > Or something like that; I'm not sure how detailed this should be. > But it is nice to have such benchmark in the commit message. Sounds good. Used it in v4. > > Anyway I think that adding yet another configuration toggle for selecting > whether to use "highlight" syntax autodetection or not would be just an > unnecessary complication. > > Note that the performance loss might be quite higher on MS Windows, with > its higher cost of fork. But then they probably do not configure > server-side highligher anyway. > > > > > Document the feature and improve syntax highlight documentation, add > > test to ensure gitweb doesn't crash when language detection is used. > > Good. > > > > > Signed-off-by: Ian Kelling > > --- > > Documentation/gitweb.conf.txt | 21 ++--- > > gitweb/gitweb.perl | 10 +- > > t/t9500-gitweb-standalone-no-errors.sh | 8 > > 3 files changed, 27 insertions(+), 12 deletions(-) > > > > diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt > > index a79e350..e632089 100644 > > --- a/Documentation/gitweb.conf.txt > > +++ b/Documentation/gitweb.conf.txt > > @@ -246,13 +246,20 @@ $highlight_bin:: > > We should probably say what does it mean to be "highlight"[1] compatible, > but it is outside of scope for this patch, and I think also out of scope > of this series. > > > Note that 'highlight' feature must be set for gitweb to actually > > use syntax highlighting. > > + > > -*NOTE*: if you want to add support for new file type (supported by > > -"highlight" but not used by gitweb), you need to modify `%highlight_ext` > > -or `%highlight_basename`, depending on whether you detect type of file > > -based on extension (for example "sh") or on its basename (for example > > -"Makefile"). The keys of these hashes are extension and basename, > > -respectively, and value for given key is name of syntax to be passed via > > -`--syntax ` to highlighter. > > +*NOTE*: for a file to be highlighted, its syntax type must be detected > > +and that syntax must be supported by "highlight". The default syntax > > +detection is minimal, and there are many supported syntax types with no > > +detection by default. There are three options for adding syntax > > +detection. The first and second priority are `%highlight_basename` and > > +`%highlight_ext`, which detect based on basename (the full filename, for > > +example "Makefile") and extension (for example "sh"). The keys of these > > +hashes are the basename and extension, respectively, and the value for a > > +given key is the name of the syntax to be passed via `--syntax ` > > +to "highlight". The last priority is the "highlight" configuration of > > +`Shebang` regular expressions to detect the language based on the first > > +line in the file, (for example, matching the line "#!/bin/bash"). See > > +the highlight documentation a
Re: [PATCH v3 2/2] gitweb: use highlight's shebang detection
Jakub Narębski writes: >> Also, "curling" is not the word I would like to see. I would say: >> >> Simple benchmark comparing performance of 'blob' view of files without >> syntax highlighting in gitweb before and after this change indicates >> ±1% difference in request time for all file types. Benchmark was >> performed on local instance on Debian, using Apache/2.4.23 web server >> and CGI/PSGI/FCGI/mod_perl. >> >> ^^--- select one or state that all of them produced similar results ;-) >> Or something like that; I'm not sure how detailed this should be. >> But it is nice to have such benchmark in the commit message. > > Sidenote: this way of benchmarking of gitweb falls between two ways of > doing a benchmark. All good comments. Thanks.
Re: [PATCH v3 2/2] gitweb: use highlight's shebang detection
W dniu 24.09.2016 o 00:15, Jakub Narębski pisze: > W dniu 23.09.2016 o 11:08, Ian Kelling napisał: >>After curling blob view of unhighlighted large and small text >> files of perl code and license text 100 times each on a local >> Apache/2.4.23 (Debian) instance, it's logs indicate +-1% difference in >> request time for all file types. > > Also, "curling" is not the word I would like to see. I would say: > > Simple benchmark comparing performance of 'blob' view of files without > syntax highlighting in gitweb before and after this change indicates > ±1% difference in request time for all file types. Benchmark was > performed on local instance on Debian, using Apache/2.4.23 web server > and CGI/PSGI/FCGI/mod_perl. > > ^^--- select one > > Or something like that; I'm not sure how detailed this should be. > But it is nice to have such benchmark in the commit message. Sidenote: this way of benchmarking of gitweb falls between two ways of doing a benchmark. The first method is to simply run gitweb as a standalone script, passing its parameters in CGI environment variables; just like the test suite does it. You would 'time' / 'times' it a few times, drop outliers, and take average or a median. With this method you don't even need to set up a web server. The second is to use a specialized program to benchmark the server-side of a web page, for example 'ab' (ApacheBench), httperf, curl-loader or JMeter. The first one is usually distributed together with Apache web server, so you probably have it installed already. Those tools provide timing statistics. [...] > Note that the performance loss might be quite higher on MS Windows, with > its higher cost of fork. But then they probably do not configure > server-side highligher anyway.
Re: [PATCH v3 2/2] gitweb: use highlight's shebang detection
W dniu 23.09.2016 o 11:08, Ian Kelling napisał: > The "highlight" binary can, in some cases, determine the language type > by the means of file contents, for example the shebang in the first line > for some scripting languages. Make use of this autodetection for files > which syntax is not known by gitweb. In that case, pass the blob > contents to "highlight --force"; the parameter is needed to make it > always generate HTML output (which includes HTML-escaping). Right. > > Although we now run highlight on files which do not end up highlighted, > performance is virtually unaffected because when we call highlight, we > also call sanitize() instead of esc_html(), which is significantly > slower. This paragraph is a bit unclear, for example it is not obvious what "..., which is significantly slower" refers to: sanitize() or esc_html(). I think it would be better to write: Although we now run highlight on files which do not end up highlighted, performance is virtually unaffected because when we call highlight, it is used for escaping HTML. In the case that highlight is used, gitweb calls sanitize() instead of esc_html(), and the latter is significantly slower (it does more, being roughly a superset of sanitize()). >After curling blob view of unhighlighted large and small text > files of perl code and license text 100 times each on a local > Apache/2.4.23 (Debian) instance, it's logs indicate +-1% difference in > request time for all file types. Also, "curling" is not the word I would like to see. I would say: Simple benchmark comparing performance of 'blob' view of files without syntax highlighting in gitweb before and after this change indicates ±1% difference in request time for all file types. Benchmark was performed on local instance on Debian, using Apache/2.4.23 web server and CGI/PSGI/FCGI/mod_perl. ^^--- select one Or something like that; I'm not sure how detailed this should be. But it is nice to have such benchmark in the commit message. Anyway I think that adding yet another configuration toggle for selecting whether to use "highlight" syntax autodetection or not would be just an unnecessary complication. Note that the performance loss might be quite higher on MS Windows, with its higher cost of fork. But then they probably do not configure server-side highligher anyway. > > Document the feature and improve syntax highlight documentation, add > test to ensure gitweb doesn't crash when language detection is used. Good. > > Signed-off-by: Ian Kelling > --- > Documentation/gitweb.conf.txt | 21 ++--- > gitweb/gitweb.perl | 10 +- > t/t9500-gitweb-standalone-no-errors.sh | 8 > 3 files changed, 27 insertions(+), 12 deletions(-) > > diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt > index a79e350..e632089 100644 > --- a/Documentation/gitweb.conf.txt > +++ b/Documentation/gitweb.conf.txt > @@ -246,13 +246,20 @@ $highlight_bin:: We should probably say what does it mean to be "highlight"[1] compatible, but it is outside of scope for this patch, and I think also out of scope of this series. > Note that 'highlight' feature must be set for gitweb to actually > use syntax highlighting. > + > -*NOTE*: if you want to add support for new file type (supported by > -"highlight" but not used by gitweb), you need to modify `%highlight_ext` > -or `%highlight_basename`, depending on whether you detect type of file > -based on extension (for example "sh") or on its basename (for example > -"Makefile"). The keys of these hashes are extension and basename, > -respectively, and value for given key is name of syntax to be passed via > -`--syntax ` to highlighter. > +*NOTE*: for a file to be highlighted, its syntax type must be detected > +and that syntax must be supported by "highlight". The default syntax > +detection is minimal, and there are many supported syntax types with no > +detection by default. There are three options for adding syntax > +detection. The first and second priority are `%highlight_basename` and > +`%highlight_ext`, which detect based on basename (the full filename, for > +example "Makefile") and extension (for example "sh"). The keys of these > +hashes are the basename and extension, respectively, and the value for a > +given key is the name of the syntax to be passed via `--syntax ` > +to "highlight". The last priority is the "highlight" configuration of > +`Shebang` regular expressions to detect the language based on the first > +line in the file, (for example, matching the line "#!/bin/bash"). See > +the highlight documentation and the default config at > +/etc/highlight/filetypes.conf for more details. All right. I guess /etc/highlight/filetypes.conf is the standard location? > + > For example if repositories you are hosting use "phtml" extension for > PHP files, and you want to have correct syntax-high
[PATCH v3 2/2] gitweb: use highlight's shebang detection
The "highlight" binary can, in some cases, determine the language type by the means of file contents, for example the shebang in the first line for some scripting languages. Make use of this autodetection for files which syntax is not known by gitweb. In that case, pass the blob contents to "highlight --force"; the parameter is needed to make it always generate HTML output (which includes HTML-escaping). Although we now run highlight on files which do not end up highlighted, performance is virtually unaffected because when we call highlight, we also call sanitize() instead of esc_html(), which is significantly slower. After curling blob view of unhighlighted large and small text files of perl code and license text 100 times each on a local Apache/2.4.23 (Debian) instance, it's logs indicate +-1% difference in request time for all file types. Document the feature and improve syntax highlight documentation, add test to ensure gitweb doesn't crash when language detection is used. Signed-off-by: Ian Kelling --- Documentation/gitweb.conf.txt | 21 ++--- gitweb/gitweb.perl | 10 +- t/t9500-gitweb-standalone-no-errors.sh | 8 3 files changed, 27 insertions(+), 12 deletions(-) diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt index a79e350..e632089 100644 --- a/Documentation/gitweb.conf.txt +++ b/Documentation/gitweb.conf.txt @@ -246,13 +246,20 @@ $highlight_bin:: Note that 'highlight' feature must be set for gitweb to actually use syntax highlighting. + -*NOTE*: if you want to add support for new file type (supported by -"highlight" but not used by gitweb), you need to modify `%highlight_ext` -or `%highlight_basename`, depending on whether you detect type of file -based on extension (for example "sh") or on its basename (for example -"Makefile"). The keys of these hashes are extension and basename, -respectively, and value for given key is name of syntax to be passed via -`--syntax ` to highlighter. +*NOTE*: for a file to be highlighted, its syntax type must be detected +and that syntax must be supported by "highlight". The default syntax +detection is minimal, and there are many supported syntax types with no +detection by default. There are three options for adding syntax +detection. The first and second priority are `%highlight_basename` and +`%highlight_ext`, which detect based on basename (the full filename, for +example "Makefile") and extension (for example "sh"). The keys of these +hashes are the basename and extension, respectively, and the value for a +given key is the name of the syntax to be passed via `--syntax ` +to "highlight". The last priority is the "highlight" configuration of +`Shebang` regular expressions to detect the language based on the first +line in the file, (for example, matching the line "#!/bin/bash"). See +the highlight documentation and the default config at +/etc/highlight/filetypes.conf for more details. + For example if repositories you are hosting use "phtml" extension for PHP files, and you want to have correct syntax-highlighting for those diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index 6cb4280..44094f4 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -3931,15 +3931,16 @@ sub guess_file_syntax { # or return original FD if no highlighting sub run_highlighter { my ($fd, $highlight, $syntax) = @_; - return $fd unless ($highlight && defined $syntax); + return $fd unless ($highlight); close $fd; + my $syntax_arg = (defined $syntax) ? "--syntax $syntax" : "--force"; open $fd, quote_command(git_cmd(), "cat-file", "blob", $hash)." | ". quote_command($^X, '-CO', '-MEncode=decode,FB_DEFAULT', '-pse', '$_ = decode($fe, $_, FB_DEFAULT) if !utf8::decode($_);', '--', "-fe=$fallback_encoding")." | ". quote_command($highlight_bin). - " --replace-tabs=8 --fragment --syntax $syntax |" + " --replace-tabs=8 --fragment $syntax_arg |" or die_error(500, "Couldn't open file or run syntax highlighter"); return $fd; } @@ -7063,8 +7064,7 @@ sub git_blob { my $highlight = gitweb_check_feature('highlight'); my $syntax = guess_file_syntax($highlight, $file_name); - $fd = run_highlighter($fd, $highlight, $syntax) - if $syntax; + $fd = run_highlighter($fd, $highlight, $syntax); git_header_html(undef, $expires); my $formats_nav = ''; @@ -7117,7 +7117,7 @@ sub git_blob { $line = untabify($line); printf qq!%4i %s\n!, $nr, esc_attr(href(-replay => 1)), $nr, $nr, - $syntax ? sanitize($line) : esc_html($line, -nbsp=>1); + $highlight ? sanitize($line) : esc_html($line, -nbsp=>1);