Hello all,

I talk about parsers regularly with developers, and some of them
objected that with JSON, they never need to worry about safe parsing,
since a lot of libraries take care of that for them. I would not bet
that most JSON parsers are correct, but there is a more interesting
point here: even if the underlying encoding format was correct, the
way data is integrated is part of the schema.

In fact, a lot of vulnerabilities happen because, even after
deserializing the data, the way the data structure is organized causes
bugs.

See for example:
- PHP unserialize bugs: https://www.owasp.org/index.php/PHP_Object_Injection
- Ruby mass assignment vulnerabilities:
https://github.com/rails/rails/issues/5228
- http://ronin-ruby.github.io/blog/2013/01/28/new-rails-poc.html

This kind of bug happen mostly because mapping serialized data to
internal data structures is a pain, so the natural developer reaction
is to deserialize automatically and access the needed fields directly,
without caring for additional data.

So a lot of recent serialization libraries were not built on the
scheme "give me an instance of class A from this buffer", but on the
scheme "give me an object from this buffer, and I'll believe it is
from class A, since I am in the right part of the code". The encoding
layer below is now tasked with interpreting the data without context
on its usage.

So, now, we have incomplete and unsafe solutions to automatically
handle data in APIs, while in the past, the main practice was to
describe the API in one document, and from there, generate clients and
servers, in any language, with complete data validation and error
codes. But we developers do not like SOAP ;)

Fortunately, the approach of generating parsers for APIs is coming
back with tools like Protobuf, Thrift or Avro.

But I do not know if they are enough. Could someone point me to
interesting works on data validation? I have a few questions that
worry me right now:
- should schemas be automatically obtained? There were a lot of bugs
with this in XML. I see now an interesting approach: a schema
directory where the parsers can get new schemas to deserialize to the
same data structures (you can update a protocol without updating the
code)
- are solutions like JSON schema enough (
http://json-schema.org/examples.html ) ?
- even with a correct deserialization and a schema validation, can we
catch bugs in the way the data structure is organized? Example (seen
somewhere in production): a boolean indicating whether a particular
field is in one format or not. This seems silly, but without
validation tools, people often come up with that kind of solution,
where parts of the data are inter dependent.

So, here are a few problems we will have to worry about once parsing
gets better :)

Best regards,

Geoffroy Couprie

-- 
http://geoffroycouprie.com
_______________________________________________
langsec-discuss mailing list
langsec-discuss@mail.langsec.org
https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss

Reply via email to