Hi everyone,

tl;dr : too often HTML is created with ad-hoc string manipulation.
This is about an attempt to promote better practices, particularly in
the Ruby world.


I've been speaking at a couple of Ruby conferences in the last months
about using a more structured approach to HTML generation. I wanted to
bring this to your attention because the idea has been greatly
influenced by Langsec. Basically it's taking some of the principles
for secure parsers and applying them to language generators.

The main selling point is increased security, especially preventing
XSS. The current practice is to use a combination of template
languages and "helpers", functions that return strings representing
HTML fragments. Because of this the semantics of a string are unclear,
it can be mere textual data, or a serialized fragment of a HTML
document. The programmer needs to constantly indicate this difference
by manually adding calls to escape the HTML, and this is error prone.

What's worse, it's possible to generate all kinds of badly structured,
invalid documents, and you might not know until your app is running.
Or never at all, because browsers are so forgiving about these things.
But all of this adds up to make your app more vulnerable to injection
attacks in your HTML.

The suggested alternative is to first create a syntax tree of the
document you want to transmit, and only generate serialized HTML right
before it gets sent over the wire. What for parsers is called "Full
Recognition Before Processing" we can rephrase to "Full Generation
After Processing". That is, decouple your program logic from the
language generator.

Because now a dedicated component handles the language generation,
it's output can be made strictly context-free HTML at all times, thus
sticking to the good part of Postel's principle, "Be conservative in
what you send", and making it easier to reason about how browsers will
handle this input.

I'm focusing on Ruby because it's the language I use and love, and
because I want the Ruby web community to do better. But as I said, the
syntax tree approach is far removed from established practices. And
while I'm convinced that using a data structure/syntax tree approach
is inherently more expressive and more productive as a developer, as
well as more secure, the fact of the matter is that you lose the huge
ecosystem of string-based tools and libraries for Ruby that are out
there. So I'm trying to do some foundational work to bridge this gap.

I started a project called Hexp [1], which is an API for easily and
efficiently creating and manipulating HTML syntax trees. It's already
quite usable, I'm using it on several smaller projects already.
Hopefully this can form the ground work for a new collection of tools.

I'm still waiting for the videos of my talks to come on-line. The
slides of my latest talk at Eurucamp, Berlin are available at [2]. My
previous talk at Rulu, Lyon are less concise but cover more theory
[3]. (Hit space to cycle throught the slides.)

I know Ruby isn't the only language where this pattern exists.
Hopefully we can get some of these ideas across to more application
developers out there.

Thanks for reading this far :),
Arne


[1] http://github.com/plexus
[2] http://arnebrasseur.net/talks/eurucamp2013/presentation.html
[3] http://arnebrasseur.net/talks/rulu2013/index.html

----
Arne Brasseur
Twitter/Github : @plexus
_______________________________________________
langsec-discuss mailing list
[email protected]
https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

Reply via email to