Thanks for the details!

Contextual auto-sanitization is definitely interesting for the
project, but I don't yet know what approach will be the ideal. Some of
my thoughts/concerns follows.

1) I think that for a template engine it's useful to know that we have
a HTML or CSS in a string-like value for other reasons than security
too. For me, these are like types in a type system, where String
simply means plain text.

2) In the solution you show it's assumed that the literals that the
template author has entered needs no escaping. I understand that from
security standpoint, but not so much as far as correctness and
understandability goes. When the template author enters a literal he
has a concrete format in mind. Like he enters the "R&D" literal, he
meant it to be plain text, and so, it has to be escaped when it's
inserted into HTML context, or else the output will be broken (a
correctness problem, even if not a security problem). Of course we
can't read the mind of the template author, so FM assumes what I
believe is the easiest to grasp and produces the least dangerous
template bugs when misunderstood: string literals are plain text,
because other strings that come from outside (like business data from
the data base) are also usually plain text. If the literal (or any
other string) is not plain text, you have to be explicit about it.
Another example is when you pass in a literal string message to show
to a macro. The macro will have to print the message HTML-escaped or
not, based on if the message is plain text or HTML, for correctness.
It would be anti-intuitive to decide that based on if the message was
specified as a literal or as some variable. I mean, programmers expect
that the literal "foo", and myVar whose value is "foo", are two
indistinguishable things after you have passed the value over to
something. I wouldn't break that natural expectation. In 2.3.24, by
"My <b>message</b>"?noEsc, you have explicitly marked the string as
HTML (if the static output format is HTML at that point), otherwise
it's plain text. The reason why the first value behaves differently is
simply that it has a different type (?isString gives false for it,
?upperCase will give error, etc... it's not a string,)

3) Substring level tracking seems to be problematic for FM3+, because
I consider switching from adapter pattern (called ObjectWrapper in FM)
to Meta-Object-Protocol eventually. If we do that, we will only have
String-s, which can't carry the necessary extra information. But, when
the value is converted to a "richer" format, like HTML, CSS, CSSName,
CSSValue, JavaScriptStringPart, URLPart, etc., the value is not a
string anymore, and so its Java class is provided by FreeMarker, not
the standard Java API. So if you concatenate together such richer
values, no information has to be lost. The way the user ends up having
these rich values is usually by calling some "escape" and "no-escape"
operations, which should always give "rich" types. Like foo?escUrl
shouldn't give a String, but a value of type URLPart. Then we can say
that String is always just simple plain text, and so we don't have to
care what other plain text pieces it was put together from, because
plain text needs to be fully escaped in any non-plain-text context
anyway. Not only for security reasons, but because we have two
different formats and so a conversion is needed (correctness).

4) The solution you show tracks the context kind on runtime. I'm not
yet sure if doing that statically is unfeasible. And indeed Closure
Templates goes for the static approach too. When you write a piece of
template, you know what kind of context it meant to appear in (like
inside <script>, or between HTML tags, or inside a HTML tag as
attributes, etc.). So that information can be attached on parsing time
(manually where it's not possible to auto-detect). If then, on
runtime, your piece of template is invoked from another kind of
static(!) context, which isn't know to be compatible with static
context kind of the invoked part, that's an error. Like, you have a
piece of template that generates HTML, and it's invoked from inside
<script>...</script>, that's an error. Even if we could escape it to
be safe, it doesn't make sense to call it from there (correctness
again).

4) Regarding non-String types in FreeMarker. Currently, before they
are printed, they are converted to String (or to
TemplateMarkupOutputModel since 2.3.24) by the configured formatter.
Thus string auto-escaping is applied as usual. So inserting a number
or date is no less safe than inserting a string.

-- 
Thanks,
 Daniel Dekany

Thursday, March 17, 2016, 6:56:15 PM, Matthew Van Gundy wrote:

> On 3/16/16 5:33 AM, Daniel Dekany wrote:
>> I would like to figure out if forking is indeed needed for this to
>> work. And if so, if that could be avoided in the case of FM 3.
>
> Agreed.
>
>> The way it works in 2.3.24 is simply that anything that's not a
>> TemplateMarkupOutputModel (which might should be called
>> TemplateSanitizedContantModel) is untrusted (to use your terms), and
>> so will be escaped. So strings (TemplateScalarModel-s) are always
>> untrusted. When you concatenate trusted and untrusted values, the
>> untrusted part is escaped, so the result is trusted. So that's how it
>> works in 2.3.24.
>
> The way our modifications work is that any escaping that is not
> explicitly requested is delayed until the object is actually being
> written out to the output stream because the correct escaping to apply
> isn't known until the output context is known.  In order to accomplish
> this, we support labeling substrings down to the character level.  Only
> data that is not already escaped correctly for the current output
> context will be escaped.  For example if the data model specifies:
>
>   data = "javascript:/<script>1/(alert(1337))//</script>"
>
> then rendering the following template:
>
>   <#assign title = "<h1>Hello ${data}</h1>">
>   <#assign link = "<a href='${data}'></a>">
>   ${title}
>   ${link}
>
> will result in:
>
>   <h1>Hello
> javascript:/&lt;script&gt;1/(alert(1337))//&lt;/script&gt;</h1>
>   <a href='#zSoyz'></a>
>
> This illustrates both precise data flow tracking and context-sensitive
> sanitization.  First, because the <h1> and <a> tags are static strings
> specified by the programmer, they are automatically labeled as
> ALWAYS_SAFE, allowing them to be used in any context without further
> escaping.  Because "${data}" comes from the data model, it is UNSAFE
> (unsanitized) by default.  In memory, the variables are represented as:
>
> title = "<h1>Hello ":ALWAYS_SAFE +
>         "javascript:/<script>...</script>":UNSAFE + "</h1>":ALWAYS_SAFE
>
> link = "<a href='":ALWAYS_SAFE +
>        "javascript:/<script>...</script>":UNSAFE + "'></a>":ALWAYS_SAFE
>
> When the template is rendered and DollarVariable writes its value to the
> output Writer, our custom writer class iterates through each
> distinctly-labeled substring in the resulting SimpleScalar.
>
> When rendering title, in output context HTML_PCDATA, "<h1>Hello
> ":ALWAYS_SAFE is output without sanitization because it is safe for any
> context (ALWAYS_SAFE).  It is parsed to determine the new context of the
> output buffer, HTML_PCDATA in this case.
> "javascript:/<script>...</script>":UNSAFE is HTML-escaped to make it
> safe for HTML_PCDATA context, leaving the output buffer context
> unchanged (HTML_PCDATA).  And, "</h1>":ALWAYS_SAFE is output without
> sanitization because it is safe for any context.  After parsing, the the
> output buffer is determined to be in HTML_PCDATA context once again.
>
> Moving on to "${link}", "<a href='":ALWAYS_SAFE is output without
> sanitization because it is safe for any context.  After parsing the
> string, the output buffer context is determined to be
> NORMAL_URI_SINGLE_QUOTE_START.  In NORMAL_URI_SINGLE_QUOTE_START
> context, if the protocol scheme is not known to be safe, the value is
> transformed into an innocuous URI.  Otherwise, any characters which
> could allow the value to break out of the attribute are HTML-escaped.
> Because "javascript:/<script>...</script>":UNSAFE does not begin with a
> known-safe protocol scheme, it is transformed to the innocuous URI
> "#zSoyz".  After parsing the result, the output buffer is determined to
> be in NORMAL_URI_SINGLE_QUOTE_FRAGMENT context.  Lastly,
"'>></a>":ALWAYS_SAFE is output without escaping because it is always
> safe, parsing determines that the output buffer in HTML_PCDATA context.
>
> By delaying sanitization until output, we can ensure that any data that
> has not been sanitized correctly for the context where it appears can be
> sanitized specifically for that context.
>
>> What marks a value to be trusted or untrusted in your solution?
>
> By default, all values appearing in any template (i.e. static values
> written by a template author) and all dynamic values computed only from
> static values are considered trusted.  They are labeled ALWAYS_SAFE and
> will not be sanitized.
>
> All values coming from the data model are considered untrusted because
> they may have been specified by an attacker.  They are labeled UNSAFE by
> default.  If values in the data model have been sanitized in some way by
> the application or are known to be trusted, when the value is added to
> the data model it can be labeled explicitly.  Conversely, the template
> author can downgrade the label on any static data.  We've also modified
> all escaping built-in functions to label their output appropriately.
> For example, if x = "<a>": x?html -> "&lt;a&gt;":SAFE(HTML).
>
>> Especially, why do you track such state for non-string values? After
>> all, no hacker can put HTML tags and such into a number or boolean or
>> date. Is labelling all type of values is indeed unavoidable, or we
>> could just store more detailed information in
>> TemplateMarkupOutputModel and that could solve this problem?
>
> Initially we labeled all data types because characters present in a
> variety of types have special meaning in various contexts.  For instance
> in "<div data-value=${datetime}:${result}/>" a space in the datetime
> value formatted as a string could break out of the data-value attribute
> and cause ${result} to be treated as an attribute rather than an
> attribute value.  Similarly, in "<script>result =
> value.match(/${double}/)</script>", the decimal point in a double will
> be interpreted as a wildcard if it is not escaped correctly.
>
> In hindsight, I think I might argue that non-string data types like
> floats, dates, etc. should _never_ be permitted to change the context
> and should always be escaped appropriately for the enclosing context.
> In that case, we could probably stop tracking data flow for non-string
> types.
>
>> 
>> How do you track the output browser's parser state and then where does
>> the final sanitization happen? My first idea is passing a Writer to
>> Template.process that parses what's written into it, and then ${} has
>> to call some special Writer methods to which it can pass the
>> sanitization information. The problem with such a solution is that if
>> you do output capturing (<#assign x>...</#assign>) or apply any kind
>> of filter directive, as they are working with "unlabelled" character
>> streams, there will be no browser's state. In FM3, I think we will use
>> some kind of FreeMarkerWriter (a term you have used too!) that treats
>> the output stream more than just a stream of characters, and the plain
>> Writer will be only the final stage. My initial motivation there was
>> better white-space handling, but the same thing can also be useful for
>> sanitization. That means that we dumb down the output to a mere stream
>> of characters as late as possible. Instead, you have a stream of
>> objects where some objects are TemplateMarkupOutputModel-s. Just an
>> idea anyway.
>
> This is essentially what we do.  When I mentioned FreeMarkerWriter, I
> forgot that was a class that we added to make the Writer API aware of
> FreeMarker TemplateModels.  If I recall correctly, Environment renders
> the template by traversing the AST and for each statement that generates
> output, it converts it to a string and writes it to its java.io.Writer.
>  The vanilla FreeMarkerWriter which just wraps java.io.Writer and
> converts labeled TemplateScalarModels to java.lang.String and writes them.
>
> Since our labeled strings already capture the degree to which each
> substring has been sanitized, when auto-sanitization is enabled, the
> vanilla writer gets replaced with a sanitizing writer that essentially does:
>
>     public void write(TemplateScalarModel node) throws IOException {
>         SimpleScalar ss = null;
>         try {
>             ss = node.getAsSimpleScalar();
>         } catch(TemplateModelException e) {
>             throw new IOException(e);
>         }
>         // Separately sanitize each discretely-labeled substring
>         for( SimpleScalar s : ss.substrings() ) {
>             String escaped = sanitizeFor(s, context);
>             context = parse(escaped, context);
>             m_writer.write(escaped);
>         }
>     }
>
> As for handling <#assign x>...</#assign>, in case it's not clear from
> the foregoing, a labeled TemplateScalarModel will be assigned to x and
> sanitization will only happen when x is actually output (because at that
> point, the context *is* defined).
>
>> 
>> BTW, what's your fork is based on? If on 2.3.23, as opposed to on the
>> 2.3-gae head from a no more than a few months ago, you will find that
>> there were a lot of changes due to code formatting clean up...
>> 
>
> Unfortunately, it's based on 2.3.20.  (We did the bulk of the work in
> late 2013, early 2014 but it's taken a while to borrow enough time to
> navigate the open source contribution process.)  Obviously, we'll have
> to spend some time bringing it up to date with the current release if it
> were going to be a parallel branch of the current 2.3 release.
>
> If you think it's something that FreeMarker might potentially be
> interested in, even for 3.0, I think that gives me enough information to
> push along our contribution request.  At that point, we would be able to
> share the code and have concrete discussions about design and
> implementation rather than the more hand-wavy stuff above.
>
> Thanks,
> Matt
>

Reply via email to