[habari-dev] A Brief History of Habari Databases and Their Character Sets

Geoffrey Sneddon Mon, 15 Dec 2008 11:04:48 -0800

Hi,

Following the discussions last week about character sets and arthus'  
install breaking, it's probably time to look back in time at all the  
various states Habari MySQL databases might be in before we try and  
write anything to fix it.


Now, on to the history:

The beginning:

The Habari tables and the database connection followed whatever the  
default of the database was. We (naïvely) assumed everything that we  
received was UTF-8. This meant to function correctly either the  
character set must be UTF-8 or a SBCS (single byte character set;  
i.e., every character is represented by a single byte; e.g., all  
ISO-8859 character sets) in which UTF-8 could be stored as binary data.

r1377:

This changed to interacting the database by calling `SET NAMES utf8;`.  
This broke all blogs that weren't already using UTF-8, or using only  
the intersection between the character set in the database and UTF-8.

The database could then be in three states:
- UTF-8,
- Only characters used in the intersection between the database  
character set and UTF-8 (normally ASCII only in an ASCII-superset such  
as ISO-8859-1);
- Fresh installs are stored in whatever the default database character  
set is (this could be something completely different like UCS-2 which  
isn't even an ASCII-superset).

Regardless of what the content is stored as in the database, it is now  
passed to PHP from MySQL as UTF-8.

r1530:

This converted all installs to UTF-8 tables, and in the process broke  
everything that didn't already use UTF-8, or used only the  
intersection between the character set in the database and UTF-8.

This brought us down to two states:
- UTF-8;
- Fresh installs are stored in whatever the default database character  
set is (this could be something completely different like UCS-2 which  
isn't even an ASCII-superset).

r2909:

This made new installs use UTF-8. This also tried to move all existing  
installs to UTF-8, but failed (see arthus's breakage). This upgrade  
script was the same as in r1530 (this was wrong as we're coming from a  
different state).

This resulted in everything being UTF-8, and breaking anything that  
was installed between r1530–r2908 where the default database character  
set was not UTF-8 (or didn't use only the intersection between the  
database character set and UTF-8).

r2927:

This replaced the upgrade script added in r2909. This should be the  
upgrade script we want.

This brought us down to knowing the database is UTF-8.

r2932:

This reverted r2927. Both myself and Matthias thought the patch was  
wrong as the linked IRC discussion shows. This brings us back to the  
same undesirable state that r2909 left us in.

This brings us to the present.


Now, to get us out of this hole, the upgrade script in r2927 should be  
re-added and the r2909 one removed. Myself and Matt were wrong because  
we did not realize that the r1530 upgrade script would avoid UTF-8  
stored in a SBCS ever reaching this upgrade script. If anyone thinks  
this is wrong, please do say.


--
Geoffrey Sneddon
<http://gsnedders.com/>


--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at http://groups.google.com/group/habari-dev
-~----------~----~----~----~------~----~------~--~---

[habari-dev] A Brief History of Habari Databases and Their Character Sets

Reply via email to