Jan Cholasta wrote:
On 29.2.2012 15:45, Rob Crittenden wrote:
Jan Cholasta wrote:
On 28.2.2012 18:58, Rob Crittenden wrote:
Jan Cholasta wrote:
On 28.2.2012 18:02, Petr Viktorin wrote:
On 02/28/2012 04:45 PM, Rob Crittenden wrote:
Petr Viktorin wrote:
On 02/28/2012 04:02 AM, Rob Crittenden wrote:
Petr Viktorin wrote:
On 02/27/2012 05:10 PM, Rob Crittenden wrote:
Rob Crittenden wrote:
Simo Sorce wrote:
On Mon, 2012-02-27 at 09:44 -0500, Rob Crittenden wrote:
We are pretty trusting that the data coming out of LDAP
schema but it is possible to stuff non-printable characters

I've added a sanity checker to keep a value as a python str
(treated as binary internally). This will result in a base64
blob be returned to the client.

I don't like the idea of having arbitrary binary data where unicode
strings are expected. It might cause some unexpected errors (I have a
feeling that --addattr and/or --delattr and possibly some plugins
not handle this very well). Wouldn't it be better to just throw away
value if it's invalid and warn the user?

This isn't for user input, it is for data stored in LDAP. User's are
going to have no way to provide binary data to us unless they use the
API themselves in which case they have to follow our rules.

Well my point was that --addattr and --delattr cause an LDAP search for
the given attribute and plugins might get the result of a LDAP search in
their post_callback and I'm not sure if they can cope with binary data.

It wouldn't be any different than if we had the value as a unicode.

Let's see what happens if the mail attribute of a user contains invalid
UTF-8 (ff 62 30 72 6b 65 64):

$ ipa user-find jdoe
1 user matched
User login: jdoe
First name: John
Last name: Doe
Home directory: /home/jdoe
Login shell: /bin/sh
Email address: /2IwcmtlZA==
UID: 526
GID: 526
Account disabled: False
Password: False
Kerberos keys available: False
Number of entries returned 1

$ ipa user-mod jdoe --addattr mail=j...@example.com
ipa: ERROR: an internal error has occurred

The internal error is:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ipaserver/rpcserver.py", line
302, in wsgi_execute
result = self.Command[name](*args, **options)
File "/usr/lib/python2.7/site-packages/ipalib/frontend.py", line 438, in
ret = self.run(*args, **options)
File "/usr/lib/python2.7/site-packages/ipalib/frontend.py", line 696, in
return self.execute(*args, **options)
File "/usr/lib/python2.7/site-packages/ipalib/plugins/baseldap.py", line
1217, in execute
ldap, dn, entry_attrs, attrs_list, *keys, **options
File "/usr/lib/python2.7/site-packages/ipalib/plugins/user.py", line
532, in pre_callback
entry_attrs['mail'] = self.obj._normalize_email(entry_attrs['mail'])
File "/usr/lib/python2.7/site-packages/ipalib/plugins/user.py", line
338, in _normalize_email
norm_email.append(m + u'@' + config['ipadefaultemaildomain'][0])
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0:
invalid start byte

$ ipa user-mod jdoe --delattr mail=/2IwcmtlZA==
ipa: ERROR: mail does not contain '/2IwcmtlZA=='

$ ipa user-mod jdoe --delattr mail=`echo 'ff 62 30 72 6b 65 64' | xxd -p
ipa: ERROR: UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in
position 5: invalid start byte
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1242, in run
File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1024, in run
kw = self.parse(cmd, argv)
File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1049, in parse
return dict(self.parse_iter(cmd, kw))
File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 1058, in
yield (key, self.Backend.textui.decode(value))
File "/usr/lib/python2.7/site-packages/ipalib/cli.py", line 136, in decode
return value.decode(encoding)
File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 5:
invalid start byte
ipa: ERROR: an internal error has occurred

I'm sure there is a lot more places in the code where things will break
when you feed them arbitrary data.

We treat the python type str as binary data. Anything that is a str gets
based64 encoded before json or xml-rpc transmission.

The type unicode is considered a "string" and goes in the clear.

We determine what this type should be not from the data but from the
schema. This is a big assumption. Hopefully this answer's Petr's point
as well.

We decided long ago that str means Binary and unicode means String. It
is a bit clumsy perhaps python handles it well. It will be more clear
when we switch to Python 3.0 and we'll have bytes and str instead as

Well, this is all super-obvious and I'm not really sure why do you bring
it up here...

My concern is that with your approach, you can no longer be sure that an
attribute value is a valid unicode string. This creates a
maintainability burden, as you *always* have to keep in mind that you
first have to check whether the value is valid or not before touching
it. You can easily forget to include the check in new code and there is
a lot of places in old code that need to be changed because of this (as
you can see in the mail example above).

Let me quote John Dennis

I'm very interested in this work. I too have recognized that we have a
very real structural problem with how we handle the types we use
internally, especially when it corresponds to an external type and
requires conversion. In fact we do a remarkably bad job in this area, we
have conversions which are done ad hoc all over the place. There is no
well established place for the conversions to occur and it's difficult
to know at any point in the code what type to expect.

Your patch adds yet another occurence of "it's difficult to know at any
point in the code what type to expect". IMO this is very bad and we
should not do this at all in new code.

I'm not sure if just dropping bad values (as I have suggested before) is
the best idea, but IMHO anything is better than letting these bad values
into the bowels of IPA.

We are trusting that the data in LDAP matches its schema. This is just
belt and suspenders verifying that it is the case.

Sure, but I still think we should allow any valid unicode data to come
from LDAP, not just what is valid in XML-RPC.

This won't affect data coming out of LDAP, only the way it is
transmitted to the client.

It will affect the data flowing through IPA, after we get it from LDAP,
before it is transmitted through XML-RPC.

In other words, I think it is more correct to do this:

LDAP -> check unicode validity -> (rest of IPA) -> check XML-RPC
validitity -> XML-RPC

than this:

LDAP -> check unicode validity -> check XML-RPC validity -> (rest of

(Let's keep things isolated please.)



I'm using a much narrower scope. I'm not trying to make it easy to manage non-printable characters, just not blow things up if they exist. Limiting to the XML-RPC supported set is for convenience, these are unprintable characters in any context. This is just a two-fer.

Petr was right, I need to encode to unicode before doing this comparison in order to compare invalid unicode characters so I moved that first.

I added a very basic unit test.

If you're wondering where this data might come from, two ways are via AD sync and migration.

Yes, the user will be left in a situation where they'll need to use --setattr or ldapmodify to manage the data in the field. The UI doesn't show the value at all but instead shows [object Object] (no errors in console, not sure where this is coming from). It is possible to highlight this and insert a new value though.

>From 7838e68226584b050d55d8d2f5983f427ea5639b Mon Sep 17 00:00:00 2001
From: Rob Crittenden <rcrit...@redhat.com>
Date: Wed, 28 Mar 2012 18:08:54 -0400
Subject: [PATCH] Detect non-printable characters in strings when decoding.

We use the LDAP schema to decide whether a value should be treated
as binary or not. This doesn't account for a user that somehow manages
to get binary data stuffed into a non-binary attribute though.

This has the potential to break either XML-RPC or the client trying to
display binary data as a string.

Internally anything that is a python str type is considered binary and
unicode is considered a string.

 ipalib/encoder.py                 |   33 ++++++++++++++++++++++++++++++---
 tests/test_ipalib/test_encoder.py |   11 +++++++++++
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/ipalib/encoder.py b/ipalib/encoder.py
index 8d59bd3161466e5d96ff03894a9b43871b8bb19e..17799529e389fbbee0745572c856fc0f799c1139 100644
--- a/ipalib/encoder.py
+++ b/ipalib/encoder.py
@@ -21,6 +21,11 @@ Encoding capabilities.
 from decimal import Decimal
+import re
+from ipapython.ipa_log_manager import *
+# Declaring globally so we only have to compile this once
+non_printable = re.compile(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]')
 class EncoderSettings(object):
@@ -65,6 +70,20 @@ class Encoder(object):
             return val
         return self.decode(val)
+    def contains_non_printable(self, val):
+        """
+        The XML-RPC spec defines the following characters as allowed:
+         #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
+        The regular expression for the inverse of this is smaller so use that
+        to find those not allowed.
+        Returns True if any invalid characters are found, False if the string
+        is ok.
+        """
+        invalid = re.search(non_printable, val)
+        return invalid is not None
     def encode(self, var):
         Encode any python built-in python type variable into `self.encode_to`.
@@ -130,9 +149,17 @@ class Encoder(object):
         if isinstance(var, unicode):
             return var
         elif isinstance(var, str):
-            return self.encoder_settings.decode_postprocessor(
-                var.decode(self.encoder_settings.decode_from)
-            )
+            try:
+                val = self.encoder_settings.decode_postprocessor(
+                    var.decode(self.encoder_settings.decode_from)
+                )
+                if self.contains_non_printable(val):
+                    # return original value if not printable
+                    return var
+                return val
+            except UnicodeDecodeError, e:
+                root_logger.error('Error decoding Unicode string %s: %s' % (var, str(e)))
+                return var
         elif isinstance(var, (bool, float, Decimal, int, long)):
             return var
         elif isinstance(var, list):
diff --git a/tests/test_ipalib/test_encoder.py b/tests/test_ipalib/test_encoder.py
index c43cea212571770a45661ef7d3af7b67d90143f7..78e41315aaf1f9ca93fad6b3b5d06b7fa1de4b9e 100644
--- a/tests/test_ipalib/test_encoder.py
+++ b/tests/test_ipalib/test_encoder.py
@@ -147,3 +147,14 @@ class test_Encoder(ClassChecker):
         o.encoder_settings.decode_none = True
         assert_equal(o.decode(None), str(None).decode(decode_from))
+    def test_non_printable(self):
+        """
+        Test the `ipalib.encoder.Encoder.contains_non_printable` method.
+        """
+        o = self.cls()
+        for bad in [u'\x00', u'\x00 post', u'pre\x0b', u'pre \x00 post',
+                    u'pre \x0e post', u'pre \uFFFE post', u'pre \x1b post']:
+            assert(o.contains_non_printable(bad))
+        for good in [u'\x32 post', u'pre\x0a', u'plain', u'pre \uFFFD post']:
+            assert(o.contains_non_printable(good)) is False

Freeipa-devel mailing list

Reply via email to