Github user elprans commented on a diff in the pull request:
https://github.com/apache/thrift/pull/1274#discussion_r147559605
--- Diff: lib/py/src/ext/protocol.tcc ---
@@ -419,18 +419,30 @@ bool ProtocolBase<Impl>::encodeValue(PyObject* value,
TType type, PyObject* type
case T_STRING: {
ScopedPyObject nval;
+ Py_ssize_t len;
+ char *str;
if (PyUnicode_Check(value)) {
nval.reset(PyUnicode_AsUTF8String(value));
if (!nval) {
return false;
}
} else {
+ if (isUtf8(typeargs)) {
+ if (PyBytes_AsStringAndSize(value, &str, &len) < 0) {
+ return false;
+ }
+ // Check that input is a valid UTF-8 string.
+ nval.reset(PyUnicode_DecodeUTF8(str, len, 0));
+ if (!nval) {
+ return false;
+ }
+ }
--- End diff --
I see no other way to check the validity without breaking backwards compat.
On the other hand, this removes a real hazard of garbage data being written
into unicode fields.
---