[DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

dbdpg-commits Wed, 29 Jun 2011 00:51:11 -0700

Committed by David Christensen <[email protected]>

Subject: [DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM


---
 TODO.utf8 |  161 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 161 insertions(+), 0 deletions(-)

diff --git a/TODO.utf8 b/TODO.utf8
new file mode 100644
index 0000000..5260bac
--- /dev/null
+++ b/TODO.utf8
@@ -0,0 +1,161 @@
+Summary of design changes from discussions with GSM and DWC re: utf-8 in 
DBD::Pg
+================================================================================
+
+Behavior of the pg_unicode/pg_utf8_strings connection attribute
+---------------------------------------------------------------
+We will utilize a connect attribute (enabled by default) to enable the
+use of an immediate SET client_encoding.  The current name of this is
+"pg_utf8_strings", but DWC prefers something non-encoding specific;
+examples wanted, but "pg_unicode" or "pg_internal" seem best.
+
+If the "pg_internal" attribute is explicitly provided in the DBI
+connect attributes it will be one of (0, 1), to enable/disable the
+pg_internal behavior explicitly.  If not provided, we check the
+initial "server_encoding" and "client_encoding" settings.
+
+The logic for setting "pg_internal" when unspecified is:
+
+ - If "server_encoding" is "SQL_ASCII" set pg_internal to 0.
+
+ - If "client_encoding" <> "server_encoding", or perhaps better yet if
+   the pg_setting("client_encoding") returns a different value than
+   the default version for that setting, then we assuming that the
+   client encoding choice is *explicit* and the user will be wanting
+   to get raw octets back from DBI, thus set pg_internal to 0.
+
+ - Otherwise set pg_internal to 1.
+
+Immediately after the connection initialization completes, we will
+check for the set pg_internal flag; if set, we issue a "SET
+client_encoding TO 'utf-8'" and commit.
+
+
+Proposal for an "encoding" DBD attribute interface
+--------------------------------------------------
+
+DWC suggested a DBD::db attribute handle, suggested to be called
+"encoding" which when set would effectively pass-thru to the
+underlying: "SET client_encoding = $blah" and *disable* the
+pg_internal flag.  Specifically, by setting the encoding attribute,
+you are effectively indicating that you want the data from PostgreSQL
+back
+
+If such a mechanism *was* instituted, we could utilize `pg_encoding =>
+'blah'` as the connection-level attribute and just tie the underlying
+implementation of the pg_internal mechanism to this, by having a
+keyword ('internal') as the special-case encoding, which could be
+enabled/disabled via $dbh->{pg_encoding} = 'internal';
+
+This would allow us to pass-through utf-8 *without* setting the SvUTF8
+flag by setting $dbh->{pg_encoding} = 'utf-8'.
+
+
+Behavior changes if pg_internal is set
+--------------------------------------
+
+There will be two distinct changes that need to take place,
+specifically input and output.
+
+When processing the result sets returned by the server, if pg_internal
+is set, we can either fiat that the "client_encoding" is set to UTF-8
+as it was originally when we switched it on connection, or verify that
+the libpq's result set charset/encoding is equal to UTF-8.  I believe
+this is available as an int, which could be cached when we do the
+original "SET client_encoding" and/or initial setup tests, which
+should prevent accidental corruption.
+
+ - if we decide to go this route and detect the charset change, we can
+   issue a notice/warning from DBD::Pg that the client_encoding has
+   changed and then turn off the pg_internal flag.
+
+ - if everything checks out, we use the usual dequote_* methods and
+   set the SvUTF8 flag on either text-based bytes, or set only on the
+   ASCII datums.
+
+ - a possible option to benchmark would be to directly use the
+   "utf8::upgrade" method from the perl internals (or some Sv-creation
+   method based on (char*)) to take advantage of any perl-specific
+   enhancements already in place.  This may be just as fast since perl
+   already needs to copy the (char*) contents into the SV, and may
+   already have fast-tracked code-paths for this type of operation,
+   since we know the data will be valid UTF8.
+
+When processing data coming *in* from the user i.e., (SV*) we consider
+the following:
+
+ - if pg_internal is 0, pass through the normal methods unabashed.
+
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 1, we
+   do nothing; the underlying (char*) will already be in utf-8 data.
+
+ - if pg_internal is 1 and incoming SV's UTF8 flag is 0, we need
+   special consideration for hi-bit characters; since we've
+   effectively co-opted the expected client_encoding and forced UTF8,
+   we need to treat the raw data as octets.  We have a couple choices:
+
+     - treat as latin-1/perl raw.  This may be a good default choice,
+       but I'm not 100% convinced; in any case we would need to
+       convert from raw to utf-8 using utf8::upgrade.
+
+     - treat as original client_encoding.  This may be the least
+       changed expectation as far as the user is concerned, but
+       requires us to either:
+
+       a) switch client_encoding for query to the original
+          client_encoding, while somehow still retaining the utf-8
+          client encoding for result set retrieval, or,
+
+       b) actually use Encode to transcode from the original
+          client_encoding to UTF8.  I think GSM is particularly
+          against bringing Encode into the picture just due to
+          additional complexity issues.
+
+
+Implementation considerations/ideas
+-----------------------------------
+
+DWC feels strongly that we should avoid setting the SvUTF8 flag on any
+retrieved/created SV which does not require it; as such, an operation
+that can quickly check whether there are any hi-bit characters in a
+given (char*) would need to be weighed against the possible
+inconvenience of *always* setting the SvUTF8 flag on eligible strings,
+regardless of whether it is full ASCII.
+
+Considering that we already utilize strlen() which traverses the
+entire string, even a naÃ¯ve replacement which simply tracks the number
+of hi-bit chars encountered while traversing for that length may be a
+low-enough overhead that it will not be an undue burden in detecting
+this situation.
+
+We can also take advantage of two algorithmic enhancements if we know two 
things:
+
+- firstly, if we somehow already have the length of the initial
+  string/structure, (via the libpq structures), we can unroll the
+  hi-bit detection into the largest-supported unit-sized words; i.e.,
+  on a 64-bit machine we could check 8 bytes at a time for presence of
+  high-bits, simply by checking against the (char*) against mask
+  0x88888888; if that result is non-zero then we have at least one
+  high-bit character in the batch.  (We can do compile-time checks to
+  determine the largest word size and have different versions of the
+  loop depending on said determintion.)  We would look at the (length
+  % word size) and utilize a different mask (presumably from a local
+  static LUT) on the remainder for the final detection.  This makes
+  the number of operations in the worst-case scenario O(n/wordsize +
+  1)
+
+- if we don't care about the *total* number of hi-bit chars (which we
+  may or may not), we can short-circuit the custom_strlen at the first
+  hi-bit and return (current_length) + system strlen(*currentp).
+  i.e., this would detect cases where there is a hi-bit present while
+  falling back to the (presumed) faster system-level strlen on the
+  rest of the buffer.
+
+We will look at updating/benchmarking various implementations of a
+combined strlen()/has_high_bit function to use in determining
+auto-upgrade to utf8 behavior.  This will allow us to skip an SV copy
+for the data that is potentially modified in-place.
+
+Which is better here depends on what is more expensive; this hi-bit
+check or the unequivocable copy of the SV data in case we have hi-bit
+data that needs to be utf8::upgrade'd.
+
-- 
1.7.0.5

[DBD::Pg 2/2] Commit UTF-8 design notes/discussion between DWC/GSM

Reply via email to