On 2016-10-17, eryk sun wrote: > On Mon, Oct 17, 2016 at 2:20 PM, Adam Funk <a24...@ducksburg.com> wrote: >> I'm using IDLE 3 (with python 3.5.2) to work interactively with >> Twitter data, which of course contains emojis. Whenever the running >> program tries to print the text of a tweet with an emoji, it barfs >> this & stops running: >> >> UnicodeEncodeError: 'UCS-2' codec can't encode characters in >> position 102-102: Non-BMP character not supported in Tk >> >> Is there any way to set IDLE to ignore these characters (either drop >> them or replace them with something else) instead of throwing the >> exception? >> >> If not, what's the best way to strip them out of the string before >> printing? > > You can patch print() to transcode non-BMP characters as surrogate > pairs. For example: > > import builtins > > def print_ucs2(*args, print=builtins.print, **kwds): > args2 = [] > for a in args: > a = str(a) > if max(a) > '\uffff': > b = a.encode('utf-16le', 'surrogatepass') > chars = [b[i:i+2].decode('utf-16le', 'surrogatepass') > for i in range(0, len(b), 2)] > a = ''.join(chars) > args2.append(a) > print(*args2, **kwds) > > builtins._print = builtins.print > builtins.print = print_ucs2 > > On Windows this should allow printing non-BMP characters such as > emojis (e.g. U+0001F44C). On Linux it prints a non-BMP character as a > pair of empty boxes. If you're not using Windows you can modify this > to print something else for non-BMP characters, such as a replacement > character or \U literals.
Clever, thanks. (I'm actually using Linux.) -- Consistently separating words by spaces became a general custom about the tenth century A. D., and lasted until about 1957, when FORTRAN abandoned the practice. --- Sun FORTRAN Reference Manual -- https://mail.python.org/mailman/listinfo/python-list