Best practice for updating a string?

Erland Sommarskog Thu, 01 May 2008 10:53:29 -0700

I have in my XS module a routine that converts the charcaters of an SV
from one code page to another. The conversion is done "in-place". That is,
I retrieve the text pointer, and then I rewrite the area pointed to. For 
reference, the full code for the routine is included at the end of this
post.


The module is an API for communicating with MS SQL Server from Perl. One 
feature is that the caller can request that data sent to/from the server 
should be converted between different code pages. When sending data to the
server, my module needs to copy the caller's data, or else the caller 
would find his variables to have changed. 

The caller may pass data in a hash, where the hash keys corresponds to 
columns. Metadata is also subject to conversion, thus also the hash keys 
needs to be converted. 

The actual code is a bit contrived, but here is a distilled script that 
shows the essence of the problem: 

   use strict;
   use Win32::SqlServer;
   
   my $X = new Win32::SqlServer;
   my $datarows = [{k => 1, 'ÅÄÖ' => 'åäö'},
                   {k => 2, 'ÅÄÖ' => 'çéü'},
                   {k => 3, 'ÅÄÖ' => '$£'},
                   {k => 4, 'ÅÄÖ' => '§§§'}];
   
   my %copy;
   foreach my $key (keys %{$$datarows[0]}) {
      my $val = $$datarows[0]{$key};
      $X->codepage_convert($val, '1252', '850');
      $X->codepage_convert($key, '1252', '850');
      $copy{$key} = $val;
   }
   
   # This should print the original array.
   foreach my $row (@$datarows) {
     foreach my $key (keys %$row) {
        print "Key: $key, Value '$$row{$key}' !!! ";
     }
     print "\n";
   }

This naïve implementation does not work on neither Perl 5.8 or 5.10. 
The output is:

   Key: k, Value '1' !!! Key: , Value 'åäö' !!! 
   Key: k, Value '2' !!! Key: , Value 'çéü' !!! 
   Key: k, Value '3' !!! Key: , Value '$£' !!! 
   Key: k, Value '4' !!! Key: , Value '§§§' !!! 

All four keys are changed, not only the one for $$datarows[0]. Apparently 
there is an optimisation in how hash keys are stored. But if I change the
loop to: 

   my $val = $$datarows[0]{$key};
   my $keycopy = $key;
   $X->codepage_convert($val, '1252', '850');
   $X->codepage_convert($keycopy, '1252', '850');
   $copy{$keycopy} = $val;

$$datarows survive the ordeal in Perl 5.8. But not so in 5.10! Apparently 
there is one more optimisation, so that the copy share the area with the
original. 

Now is my question, what is the best practice to deal with this? A simple 
workaround is this change: 

     my $keycopy = $key . '';
   
But maybe it's my XS code that is bad? That is, changing the area directly 
from C++ is bad thing that I should not do. 

But in such case, what should I do? I cannot just allocate a new area and 
forget the old, because then I have a memory leak. And if deallocate the 
area, I will apparently pull the rug for other SVs that uses the same 
area. Safefree? Does Safefree keep a reference count? The Perl
documentation is very terse, I essentially only find this: "The
XSUB-writer's interface to the C free function." And as far as I know,
free does not do reference counting. 

Or should I decrease the refcount on this SV and then return a new one?

Certainly, changing the area in place is the most efficient, although
efficiency on this level does not matter that much in a DB API.

Below is the code for codepage_convert.

void codepage_convert(SV     * olle_ptr,
                      SV     * sv,
                      UINT     from_cp,
                      UINT     to_cp)

{  int      widelen;
   int      ret;
   DWORD    err;
   BSTR     bstr;
   STRLEN   sv_len;
   char   * sv_text = (char *) SvPV(sv, sv_len);
   STRLEN   outlen;

   if (sv_len > 0) {
      // If the input string is UTF_8, we should ignore from_cp.
      if (SvUTF8(sv)) {
         from_cp = CP_UTF8;
      }

      // First find out how long the Unicode string will be, by calling
      // MultiByteToWideChar without a buffer. Not that we always set flags to
      // 0 here, since it works with all code pages.
      widelen = MultiByteToWideChar(from_cp, 0, sv_text, sv_len, NULL, 0);

      if (widelen > 0) {
         // Allocate Unicode string and convert to Unicode.
         bstr = SysAllocStringLen(NULL, widelen);
         ret = MultiByteToWideChar(from_cp, 0, sv_text, sv_len, bstr, widelen);
      }
      else {
         ret = 0;
      }

      // Check for errors.
      if (ret == 0) {
         err = GetLastError();
         if (err == ERROR_INVALID_PARAMETER) {
            olle_croak(olle_ptr,
                       "Conversion from codepage %d to Unicode failed. Maybe 
you are using an non-existing code-page?",
                       from_cp);
         }
         else {
            olle_croak(olle_ptr,
                       "Conversion from codepage %d to Unicode failed with 
error %d",
                       from_cp, err);
         }
      }

      // Now determine the length for the string in the receiving code page.
      outlen = WideCharToMultiByte(to_cp, 0, bstr, widelen, NULL, 0, NULL, 
NULL);

      if (outlen > 0) {
         // Note that with some code pages the new string could be shorter or
         // longer.
         if (outlen > sv_len) {
            sv_text = SvGROW(sv, outlen);
         }
         SvCUR_set(sv, outlen);
         sv_text = (char *) SvPV(sv, sv_len);

         // Convert to target.
         ret = WideCharToMultiByte(to_cp, 0, bstr, widelen, sv_text, outlen, 
NULL, NULL);
      }
      else {
         ret = 0;
      }

      if (ret == 0) {
         err = GetLastError();
         if (err == ERROR_INVALID_PARAMETER) {
            olle_croak(olle_ptr,
                       "Conversion to codepage %d from Unicode failed. Maybe 
you are using an non-existing code-page?",
                       to_cp);
         }
         else {
            olle_croak(olle_ptr,
                       "Conversion to codepage %d from Unicode failed with 
error %d",
                       to_cp, err);
         }
      }

      // Get rid of the bstr.
      SysFreeString(bstr);

      // Set or unset the UTF8 flag depending on target charset.
      if (to_cp == CP_UTF8) {
         SvUTF8_on(sv);
      }
      else {
         SvUTF8_off(sv);
      }
   }
}



-- 
Erland Sommarskog, Stockholm, [EMAIL PROTECTED]

Best practice for updating a string?

Reply via email to