On Apr 1, 5:15 pm, Roy Smith <r...@panix.com> wrote: > In article <515941d8$0$29967$c3e8da3$54964...@news.astraweb.com>, > Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote: > > > [...] > > >> OK, that leads to the next question. Is there anyway I can (in Python > > >> 2.7) detect when a string is not entirely in the BMP? If I could find > > >> all the non-BMP characters, I could replace them with U+FFFD > > >> (REPLACEMENT CHARACTER) and life would be good (enough). > > > Of course you can do this, but you should not. If your input data > > includes character C, you should deal with character C and not just throw > > it away unnecessarily. That would be rude, and in Python 3.3 it should be > > unnecessary. > > The import job isn't done yet, but so far we've processed 116 million > records and had to clean up four of them. I can live with that. > Sometimes practicality trumps correctness.
That works out to 0.000003%. Of course I assume it is US only data. Still its good to know how skew the distribution is. -- http://mail.python.org/mailman/listinfo/python-list