On Fri, Jun 24, 2016 at 12:03 PM, Scott Robison <sc...@casaderobison.com> wrote:
> On Windows, when you get a string of characters, you either get an ANSI > string using some code page, or you get a wide character string. > > When you get an ANSI string, it is just a sequence of 8 bit bytes. UTF-8 > is also a sequence of 8 bit bytes. The meaning / encoding of those 8 bit > bytes are very different. > > SQLite will allow you to write any 8 bit byte sequence you want as a > string. It does not attempt to validate the bytes. It will read the bytes > back exactly as written. So if you wrote an ANSI string to the database > instead of a UTF-8 string, you will get back the ANSI string. > > This all assumes you're using the UTF-8 functions, which might be more > accurately described as byte functions. SQLite databases have an encoding. > They store either UTF-8 text or UTF-16 text. If your database is UTF-8 and > you use the char/byte based interface, SQLite won't interpret the bytes. If > your database is UTF-16 and you use the wide character based interface, > SQLite won't interpret the wide characters. It assumes you've given it > valid data and will use it as is. This is particularly convenient when > dealing with variant columns. > > If, however, your database is UTF-8 and you use the UTF-16 interface > functions, SQLite will attempt to convert the data between UTF-8 & UTF-16. > If your database is UTF-16 and you use the UTF-8 interface functions, > SQLite will attempt to convert the data. In these cases, it is important to > have valid UTF-whatever in the database. > > It looks to me like, in your case, some program wrote a byte sequence to > the database that was not UTF-8. You later read that string back out of the > database, and attempt to convert it to a wstring with your C++ code. The > byte sequence was not UTF-8, hence the failure. > > I seem to recall a recent discussion on the list about the shell and > console input / output and it not being treated 100% accurately as > UTF-whatever. Library internals are, but the IO layer in the shell, not so > much. > > Thus you cannot depend on the shell to translate non-ASCII characters on > Windows and write them as UTF-whatever. If using the shell is essential to > your process, you can't currently get there from here. > > Though maybe ... instead of typing ALT+225, try typing ALT+195 ALT+159. In > your windows console, that would give you the equivalent byte sequence for > that character, compensating for the fact that SQLite doesn't (I believe) > transform console input to UTF-8. If I am mistaken on that point, I > apologize. > > If the two alt-code byte sequences create data your C++ code can then > process (because it's valid UTF-8), you'll know for certain that the SQLite > shell on Windows does not process UTF-8 for console IO, just internally to > the database layer. > Okay, rather than guessing, I just did a test from a Windows 10 command prompt. I am getting appropriate UTF-8 sequences. Here is my experiment: I opened a memory database and issued the following commands: create table test(a text); insert into test values('ß'),('▀'),('á'),('ß'); -- for the first value I typed ALT+225, then ALT+223, then ALT+0225, then ALT+0223 select a, hex(a) from test; Which resulted in four rows of output: ß|C3A1 ▀|C39F á|C2A0 ß|C3A1 I'm hoping all these extended characters are handled properly by gmail and whatever email program you use. Windows supports legacy ALT+### codes that map to the legacy code page. It also supports ALT+0### which map to Unicode code points. This allows people who're accustomed to the ALT+### format to still see the character they expect, but translated to the equivalent Unicode code point. Again, this is with Windows 10. Perhaps you could try a similar sequence to what I typed above on your SQLite shell and Windows command prompt version and see what you get back. -- Scott Robison _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users