Hmm, I'm not sure if I submitted this correctly because it is showing up on https://forum.nginx.org/read.php?29,271801 from "Anonymous user". I'll see if I can figure out what I did wrong. Sorry, I haven't submitted patches via a mailing list before.
*Joshua Spence* *mobile:* +61 422 990 263 *email:* [email protected] *web:* http://www.joshuaspence.com On 3 January 2017 at 17:24, <[email protected]> wrote: > # HG changeset patch > # User Joshua Spence <[email protected]> > # Date 1483420178 -39600 > # Tue Jan 03 16:09:38 2017 +1100 > # Node ID 8e734d01f026c4346b46006ee9b558031439be61 > # Parent a42afc225e98afe1f7c3b428c81c8af7eba362c0 > Escape strings produced by the `%V` format specification > > Currently nginx error logs loosely adhere to the following pattern: > > ``` > %{TIMESTAMP} [%{SEVERITY}] %{PROCESS_ID}#%{THREAD_ID}:( > *%{CONNECTION_ID})?%{MESSAGE}(, %{KEY}: "%{VALUE}")* > ``` > > In our environment we are using Filebeat to ship nginx error logs into > Logstash. Our Logstash filter for parsing nginx error logs is as follows: > > ``` > filter { > grok { > match => { > "message" => "(?<timestamp>%{YEAR}/%{MONTHNUM2}/%{MONTHDAY} > %{TIME}) \[%{LOGLEVEL:severity}\] > %{POSINT:process_id:int}#%{NONNEGINT:thread_id:int}:(?: > \*%{NONNEGINT:connection_id:int})? (?<message>[^,]*)(?:, > %{GREEDYDATA:nginx_keyvalue_data})?" > } > overwrite => ["message"] > } > > date { > match => ["timestamp", "yyyy/MM/dd HH:mm:ss"] > remove_field => ["timestamp"] > } > > kv { > field_split => ", " > remove_field => "nginx_keyvalue_data" > source => "nginx_keyvalue_data" > value_split => ": " > } > } > ``` > > The problem, however, is that it is difficult to unambigiously parse the > key-value data from the nginx error log format because of a lack of data > sanitization. Specifically, consider the following entry from our nginx > error logs: > > ``` > 2016/11/15 03:45:51 [error] 23770#23770: *21535000 open() > "/REDACTED/+"BauGutachter"++41+2016+.ch&tbo=1" failed (2: No such file or > directory), client: REDACTED, server: REDACTED, request: "GET > /job/+%22BauGutachter%22++41+2016+.ch&tbo=1 HTTP/1.1", host: "REDACTED", > referrer: "https://www.google.com.au/search?hl=en&q="BauGutachter" +41 > 2016 .ch&num=100&start=0" > ``` > > Specifically, the problem in the above sample is the `referrer` field, the > value of which contains double quotes. I think that it makes sense to > escape `\` and `"` characters before they are written to the error log. > > diff -r a42afc225e98 -r 8e734d01f026 src/core/ngx_string.c > --- a/src/core/ngx_string.c Tue Dec 27 17:23:08 2016 +0300 > +++ b/src/core/ngx_string.c Tue Jan 03 16:09:38 2017 +1100 > @@ -15,7 +15,7 @@ > const u_char *basis, ngx_uint_t padding); > static ngx_int_t ngx_decode_base64_internal(ngx_str_t *dst, ngx_str_t > *src, > const u_char *basis); > - > +static uintptr_t ngx_escape_string(u_char *dst, u_char *src, size_t size); > > void > ngx_strlow(u_char *dst, u_char *src, size_t n) > @@ -235,7 +235,7 @@ > v = va_arg(args, ngx_str_t *); > > len = ngx_min(((size_t) (last - buf)), v->len); > - buf = ngx_cpymem(buf, v->data, len); > + buf = (u_char *) ngx_escape_string(buf, v->data, len); > fmt++; > > continue; > @@ -1844,6 +1844,43 @@ > } > > > +static uintptr_t > +ngx_escape_string(u_char *dst, u_char *src, size_t size) > +{ > + u_char ch; > + ngx_uint_t len; > + > + if (dst == NULL) { > + len = 0; > + > + while (size) { > + ch = *src++; > + > + if (ch == '\\' || ch == '"') { > + len++; > + } > + > + size--; > + } > + > + return (uintptr_t) len; > + } > + > + while (size) { > + ch = *src++; > + > + if (ch == '\\' || ch == '"') { > + *dst++ = '\\'; > + } > + > + *dst++ = ch; > + size--; > + } > + > + return (uintptr_t) dst; > +} > + > + > void > ngx_str_rbtree_insert_value(ngx_rbtree_node_t *temp, > ngx_rbtree_node_t *node, ngx_rbtree_node_t *sentinel) >
_______________________________________________ nginx-devel mailing list [email protected] http://mailman.nginx.org/mailman/listinfo/nginx-devel
