On Thu, 12 Sep 2013 16:01:20 -0700, stephen.boulet wrote: > I have an excel file. When I select cells, copy from excel, and then use > win32clipboard to get the contents of the clipboard, I have a 131071 > character string.
How exactly are you using win32clipboard, and what exact result are you getting? Start with a smaller selection of cells, say, two cells, so the amount of data is manageable. I doubt very much you are getting 131071 *characters*. Perhaps you are getting that many *bytes*. Or perhaps you are getting HTML-formatted text. Without seeing what is in the clipboard, who can tell? Remember that the clipboard can contain multiple versions of the same data. > When I save the file as a text file, and use the python 3.3 open command > to read its contents, I only have 80684 characters. How exactly are you using the open function? The result you get may differ drastically depending on what you do. > Excel (and other programs too) appends a bunch of b'\x00' (or similar) > characters. Append to what? The .xls file? The text file? What tool are you using to see this? > Is there a pythonic way to strip these out? It's hard to tell from the limited description, but my guess is that you are misinterpreting what you are seeing. Part of the evidence for this is that you are conflating bytes and characters, e.g. above where you refer to the NUL *byte* b'\x00' as a character. Bytes are not characters. Scrub that out of your brain. They never were, not even back in the old ASCII days (may they soon be forgotten), despite what many people think. My guess is that when you save the spreadsheet as text, either Excel by default, or possibly because you have ticked a checkbox, have set it to save using the UTF-16 encoding. That's a good thing (not ideal, ideally they ought to use UTF-8, but UTF-16 is not a bad choice). But when you open the file in Python, Python defaults to UTF-8, which means you get an bunch of extraneous NULs when opening the file in text mode, or b'\x00' null bytes in binary mode. For example: py> with open('/tmp/rubbish', 'w', encoding='utf-16be') as f: ... f.write('hello world blah blah blah\n') ... 27 py> with open('/tmp/rubbish', 'r') as f: # default encoding is UTF-8 ... f.read() ... '\x00h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00 \x00b\x00l \x00a\x00h\x00 \x00b\x00l\x00a\x00h\x00 \x00b\x00l\x00a\x00h\x00\n' If you look carefully, you will see that every character appears to be preceded by the NUL control character, \x00. But that's because you've used the wrong encoding to decode the bytes in the file back to characters. The right way to do this is: py> with open('/tmp/rubbish', 'r', encoding='utf-16be') as f: ... f.read() ... 'hello world blah blah blah\n' Which encoding should you use? Unfortunately, there is no clean way for text files to record which encoding to use inside the file itself, so it's often impossible to know for sure. This is why everyone should move towards using UTF-8 everywhere. But I digress. The fact that you show the NULs as *bytes* b'\x00' rather than characters '\x00' suggests that you might be reading the file in binary mode. Presumably you did this because you got an error when trying to read it in text mode. If you got this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte then you should go back to text mode, but use 'utf-16' as the codec: open(filename, 'w', encoding='utf-16') Otherwise, if the NULs appear before the letters, as in my example, use 'utf-16be'. If they appear after the letters, use 'utf-16le'. The existence of three oh-so-very-slightly different versions of UTF-16 is why it is a sub-optimal encoding, and it is oh-so-bloody-typical that Microsoft screwed it up for everyone by picking it as their default implementation. If they had sensibly used UTF-8, you wouldn't be having this problem. Of course, it is possible I've misdiagnosed your problem. I've had to guess a lot because you didn't show us what you actually did to get the results you say you got. -- Steven -- https://mail.python.org/mailman/listinfo/python-list