I am trying to process a CSV file using Python 3.5 (CPython tip as of a week or so ago). According to chardet[1], the file is encoded as utf-8:
>>> s = open("data/meets-usms.csv", "rb").read() >>> len(s) 562272 >>> import chardet >>> chardet.detect(s) {'encoding': 'utf-8', 'confidence': 0.99} so I created the reader like so: rdr = csv.DictReader(open(csvfile, encoding="utf-8")) This seems to work. The rows are read and records added to a SQLite3 database. When I go into sqlite3, I get what looks to be raw utf-8 on output: % LANG=en_US.UTF-8 sqlite3 topten.db SQLite version 3.8.5 2014-08-15 22:37:57 Enter ".help" for usage hints. sqlite> select * from swimmeet where meetname like '%Barracuda%'; sqlite> select count(*) from swimmeet; 0 sqlite> select count(*) from swimmeet; 4171 sqlite> select meetname from swimmeet where meetname like '%Barracuda%Patrick%'; Anderson Barracudas St. Patrick's Day Swim Meet Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet Anderson Barracuda Masters St. Patrick's Day Meet Anderson Barracuda Masters St. Patrick's Day Meet 2014 Anderson Barracuda Masters 2015 St. Patrick’s Day Swim Meet Note the wacky three bytes where the apostrophe in "St. Patrick's" should be. The data came to me as an XLSX spreadsheet, which I dumped to CSV using LibreOffice. That's how the character was encoded at that point. I tweaked my CSV-to-SQLite script to print the meet name and id for those meets with "Barracuda" and "Patrick" in their name: if dry_run or verbose: if ("Barracuda" in row["MeetTitle"] and "Patrick" in row["MeetTitle"]): print("Insert", n, row["MeetTitle"], row["MeetID"]) When I run it, I see raw bytes instead of a properly rendered apostrophe: % LANG=en_US.utf-8 python3.5 src/usmsmeets2db.py -v data/meets-usms.csv topten.db Insert 1173 Anderson Barracudas St. Patrick's Day Swim Meet 20090321ABMSTPY Insert 1559 Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet 20100320CUDASY Insert 1995 Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet 20110319ANDERSY Insert 3012 Anderson Barracuda Masters St. Patrick's Day Meet 20130316AndersY Insert 3562 Anderson Barracuda Masters St. Patrick's Day Meet 2014 20140315ANDERSY Insert 4114 Anderson Barracuda Masters 2015 St. Patrick’s Day Swim Meet 20150321AndersY Read 4962 rows, inserted 4171 records Why am I not seeing what I believe to be a non-ASCII apostrophe of some sort properly printed? This is running on a Mac (Yosemite) in its Terminal app, with its encoding preference set to utf-8. It appears just as shown above, "a" with a caret, the Euro symbol, then the "TM" symbol. Have I perhaps lost the properly encoded bytes somewhere, and now it's just spewing the bogus bytes (mojibake)? Thanks, Skip -- [1] https://pypi.python.org/pypi/chardet/2.3.0
-- https://mail.python.org/mailman/listinfo/python-list