Yasuhito FUTATSUKI wrote on Tue, Dec 11, 2018 at 01:18:19 +0900: > Hi, > I hear that property values may not be vaild UTF-8 string if they are > set with skipping validation.
Let's be more precise. In general, a property's name must satisfy the conditions documented in svn_prop_name_is_valid()'s docstring and a property's value is an opaque binary blob, just like file contents (when svn:eol-style and svn:keywords are unset). However, some specific properties have additional requirements on their values. If a property passes svn_prop_is_boolean() then its value MUST be "*" and if a property is an svn:* property then its value MUST be UTF-8 with LF line endings — the latter is enforced by svn_repos__validate_prop(). > (https://twitter.com/jun66j5/status/1067295499907084288 (written in > Japanese language); https://trac.edgewall.org/ticket/4321) > > However, current swig-py3 typemaps to return those values use > PyStr_FromStringAndSize() to convert svn_string_t into Python's str object, > which raise UnicodeDecodeError for invalid UTF-8 octet sequence, so there > is no way to get those strict value. > > To resolve this issue, there seems to be some some options: > > (1). Those API always return str (Unicode) with 'strict' conversion. > if error occured, abandon to get these values. (current implementation) > (2). Those API always return str with 'surrogateescape' conversion. > if applicatin want to detect irregular data, search \u+dc00-\u+dcff > character in values. > (3). Those API always return bytes. if applications want to handle as > str, decode them in application side. > (4). Those API return str for valid data, and return bytes for invalid data > to avoid missing way to get data. > (5). other (I have no idea..., though) > > I think (2) or (3) is appropriate, but I don't have confidence. > Any ideas? Generic APIs that work on any property should return bytes. More specific APIs that work on properties that have further restrictions (svn:needs-lock, svn:date), or even a structured value (svn:externals, svn:mergeinfo), may use appropriately more specific data types. In general, if an API has an input of type bytes and needs to return str, I would have it throw an exception if the conversion fails, so the caller is forced to deal with the failure mode explicitly. It'd be fine to use surrogateescape or return bytes *if the caller has explicitly requested that*, but the default postcondition should be as simple as possible: "This function either returns str or raises an exception". Makes sense? Cheers, Daniel