Hello all, I talk about parsers regularly with developers, and some of them objected that with JSON, they never need to worry about safe parsing, since a lot of libraries take care of that for them. I would not bet that most JSON parsers are correct, but there is a more interesting point here: even if the underlying encoding format was correct, the way data is integrated is part of the schema.
In fact, a lot of vulnerabilities happen because, even after deserializing the data, the way the data structure is organized causes bugs. See for example: - PHP unserialize bugs: https://www.owasp.org/index.php/PHP_Object_Injection - Ruby mass assignment vulnerabilities: https://github.com/rails/rails/issues/5228 - http://ronin-ruby.github.io/blog/2013/01/28/new-rails-poc.html This kind of bug happen mostly because mapping serialized data to internal data structures is a pain, so the natural developer reaction is to deserialize automatically and access the needed fields directly, without caring for additional data. So a lot of recent serialization libraries were not built on the scheme "give me an instance of class A from this buffer", but on the scheme "give me an object from this buffer, and I'll believe it is from class A, since I am in the right part of the code". The encoding layer below is now tasked with interpreting the data without context on its usage. So, now, we have incomplete and unsafe solutions to automatically handle data in APIs, while in the past, the main practice was to describe the API in one document, and from there, generate clients and servers, in any language, with complete data validation and error codes. But we developers do not like SOAP ;) Fortunately, the approach of generating parsers for APIs is coming back with tools like Protobuf, Thrift or Avro. But I do not know if they are enough. Could someone point me to interesting works on data validation? I have a few questions that worry me right now: - should schemas be automatically obtained? There were a lot of bugs with this in XML. I see now an interesting approach: a schema directory where the parsers can get new schemas to deserialize to the same data structures (you can update a protocol without updating the code) - are solutions like JSON schema enough ( http://json-schema.org/examples.html ) ? - even with a correct deserialization and a schema validation, can we catch bugs in the way the data structure is organized? Example (seen somewhere in production): a boolean indicating whether a particular field is in one format or not. This seems silly, but without validation tools, people often come up with that kind of solution, where parts of the data are inter dependent. So, here are a few problems we will have to worry about once parsing gets better :) Best regards, Geoffroy Couprie -- http://geoffroycouprie.com _______________________________________________ langsec-discuss mailing list langsec-discuss@mail.langsec.org https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss