Committed by David Christensen <[email protected]>
Subject: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM
---
TODO.utf8 | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 161 insertions(+), 0 deletions(-)
diff --git a/TODO.utf8 b/TODO.utf8
new file mode 100644
index 0000000..5260bac
--- /dev/null
+++ b/TODO.utf8
@@ -0,0 +1,161 @@
+Summary of design changes from discussions with GSM and DWC re: utf-8 in
DBD::Pg
+================================================================================
+
+Behavior of the pg_unicode/pg_utf8_strings connection attribute
+---------------------------------------------------------------
+We will utilize a connect attribute (enabled by default) to enable the
+use of an immediate SET client_encoding. The current name of this is
+"pg_utf8_strings", but DWC prefers something non-encoding specific;
+examples wanted, but "pg_unicode" or "pg_internal" seem best.
+
+If the "pg_internal" attribute is explicitly provided in the DBI
+connect attributes it will be one of (0, 1), to enable/disable the
+pg_internal behavior explicitly. If not provided, we check the
+initial "server_encoding" and "client_encoding" settings.
+
+The logic for setting "pg_internal" when unspecified is:
+
+ - If "server_encoding" is "SQL_ASCII" set pg_internal to 0.
+
+ - If "client_encoding" <> "server_encoding", or perhaps better yet if
+ the pg_setting("client_encoding") returns a different value than
+ the default version for that setting, then we assuming that the
+ client encoding choice is *explicit* and the user will be wanting
+ to get raw octets back from DBI, thus set pg_internal to 0.
+
+ - Otherwise set pg_internal to 1.
+
+Immediately after the connection initialization completes, we will
+check for the set pg_internal flag; if set, we issue a "SET
+client_encoding TO 'utf-8'" and commit.
+
+
+Proposal for an "encoding" DBD attribute interface
+--------------------------------------------------
+
+DWC suggested a DBD::db attribute handle, suggested to be called
+"encoding" which when set would effectively pass-thru to the
+underlying: "SET client_encoding = $blah" and *disable* the
+pg_internal flag. Specifically, by setting the encoding attribute,
+you are effectively indicating that you want the data from PostgreSQL
+back
+
+If such a mechanism *was* instituted, we could utilize `pg_encoding =>
+'blah'` as the connection-level attribute and just tie the underlying
+implementation of the pg_internal mechanism to this, by having a
+keyword ('internal') as the special-case encoding, which could be
+enabled/disabled via $dbh->{pg_encoding} = 'internal';
+
+This would allow us to pass-through utf-8 *without* setting the SvUTF8
+flag by setting $dbh->{pg_encoding} = 'utf-8'.
+
+
+Behavior changes if pg_internal is set
+--------------------------------------
+
+There will be two distinct changes that need to take place,
+specifically input and output.
+
+When processing the result sets returned by the server, if pg_internal
+is set, we can either fiat that the "client_encoding" is set to UTF-8
+as it was originally when we switched it on connection, or verify that
+the libpq's result set charset/encoding is equal to UTF-8. I believe
+this is available as an int, which could be cached when we do the
+original "SET client_encoding" and/or initial setup tests, which
+should prevent accidental corruption.
+
+ - if we decide to go this route and detect the charset change, we can
+ issue a notice/warning from DBD::Pg that the client_encoding has
+ changed and then turn off the pg_internal flag.
+
+ - if everything checks out, we use the usual dequote_* methods and
+ set the SvUTF8 flag on either text-based bytes, or set only on the
+ ASCII datums.
+
+ - a possible option to benchmark would be to directly use the
+ "utf8::upgrade" method from the perl internals (or some Sv-creation
+ method based on (char*)) to take advantage of any perl-specific
+ enhancements already in place. This may be just as fast since perl
+ already needs to copy the (char*) contents into the SV, and may
+ already have fast-tracked code-paths for this type of operation,
+ since we know the data will be valid UTF8.
+
+When processing data coming *in* from the user i.e., (SV*) we consider
+the following:
+
+ - if pg_internal is 0, pass through the normal methods unabashed.
+
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
+ do nothing; the underlying (char*) will already be in utf-8 data.
+
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 0, we need
+ special consideration for hi-bit characters; since we've
+ effectively co-opted the expected client_encoding and forced UTF8,
+ we need to treat the raw data as octets. We have a couple choices:
+
+ - treat as latin-1/perl raw. This may be a good default choice,
+ but I'm not 100% convinced; in any case we would need to
+ convert from raw to utf-8 using utf8::upgrade.
+
+ - treat as original client_encoding. This may be the least
+ changed expectation as far as the user is concerned, but
+ requires us to either:
+
+ a) switch client_encoding for query to the original
+ client_encoding, while somehow still retaining the utf-8
+ client encoding for result set retrieval, or,
+
+ b) actually use Encode to transcode from the original
+ client_encoding to UTF8. I think GSM is particularly
+ against bringing Encode into the picture just due to
+ additional complexity issues.
+
+
+Implementation considerations/ideas
+-----------------------------------
+
+DWC feels strongly that we should avoid setting the SvUTF8 flag on any
+retrieved/created SV which does not require it; as such, an operation
+that can quickly check whether there are any hi-bit characters in a
+given (char*) would need to be weighed against the possible
+inconvenience of *always* setting the SvUTF8 flag on eligible strings,
+regardless of whether it is full ASCII.
+
+Considering that we already utilize strlen() which traverses the
+entire string, even a naïve replacement which simply tracks the number
+of hi-bit chars encountered while traversing for that length may be a
+low-enough overhead that it will not be an undue burden in detecting
+this situation.
+
+We can also take advantage of two algorithmic enhancements if we know two
things:
+
+- firstly, if we somehow already have the length of the initial
+ string/structure, (via the libpq structures), we can unroll the
+ hi-bit detection into the largest-supported unit-sized words; i.e.,
+ on a 64-bit machine we could check 8 bytes at a time for presence of
+ high-bits, simply by checking against the (char*) against mask
+ 0x88888888; if that result is non-zero then we have at least one
+ high-bit character in the batch. (We can do compile-time checks to
+ determine the largest word size and have different versions of the
+ loop depending on said determintion.) We would look at the (length
+ % word size) and utilize a different mask (presumably from a local
+ static LUT) on the remainder for the final detection. This makes
+ the number of operations in the worst-case scenario O(n/wordsize +
+ 1)
+
+- if we don't care about the *total* number of hi-bit chars (which we
+ may or may not), we can short-circuit the custom_strlen at the first
+ hi-bit and return (current_length) + system strlen(*currentp).
+ i.e., this would detect cases where there is a hi-bit present while
+ falling back to the (presumed) faster system-level strlen on the
+ rest of the buffer.
+
+We will look at updating/benchmarking various implementations of a
+combined strlen()/has_high_bit function to use in determining
+auto-upgrade to utf8 behavior. This will allow us to skip an SV copy
+for the data that is potentially modified in-place.
+
+Which is better here depends on what is more expensive; this hi-bit
+check or the unequivocable copy of the SV data in case we have hi-bit
+data that needs to be utf8::upgrade'd.
+
--
1.7.0.5