Some further analysis...

---------- Forwarded message ----------
From: William A Rowe Jr <wr...@rowe-clan.net>
Date: Wed, Nov 25, 2015 at 9:44 PM
Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)
To: httpd <d...@httpd.apache.org>


On Wed, Nov 25, 2015 at 6:45 PM, William A Rowe Jr <wr...@rowe-clan.net>
wrote:

> On Nov 25, 2015 4:19 PM, "Mikhail T." <mi+t...@aldan.algebra.com> wrote:
> >
> > Thus, I contend, using C-library will not cause invalid results, and the
> only reason to have Apache's own implementation is performance, but not
> correctness.
>
> Well almost but wrong...
>
> The pure char-based ß processing produced no case change in my reviews of
> tolower/toupper in de_DE codeset. If you were to examine string comparison
> the collation order changes substantially.
>
> And more to the point, if tolower()/toupper() could handle not only mbcs
but multicharacter transliteration, your results would have varied.  1:1
character translations have their intrinsic limits.

> That said, I'm working up a comprehensive audit and other codeset/language
> combinations absolutely do.  Code and results forthcoming shortly.
>
As promised, here's a quick review based on the sbcs and utf8 code pages in
the very limited single-byte scope on my machine.

I did not touch the following mbcs because they require 'shift-state' to
toggle into and out of specific characters and that implies a lot of
calculated fuzzing that I didn't have time for this week.  (Since mod_ftp
explicit tls is still broken, I had no time for any of this, either ;-)  I
also didn't get to evaluating the wide chars yet that fall into the
traditional posix/c ascii range, which I still mean to do, and haven't yet
repeated this exercise on win32 or os/x, only on a somewhat multinational
configuration of fedora 22.

The source code is pretty rudimentary.  I used iconv to shove all of the
resulting text evaluation into utf-8 for the console/file output, it really
plays no part in the locality equation.  It can be adapted for testing
similar on an EBCDIC box with a bit of clever coding I never got to.

Untested: ja_JP.eucjp ja_JP.ujis japanese.euc ko_KR.euckr korean.euc
zh_CN.gb18030 zh_CN.gb2312 zh_CN.gbk zh_HK.big5hkscs zh_SG.gb2312 zh_SG.gbk
zh_TW.big5 zh_TW.euctw

Tested and exceptional results noted (source code attached);

LANG="aa_DJ.iso88591";
        no surprises
LANG="af_ZA.iso88591";
        no surprises
LANG="an_ES.iso885915";
        no surprises
LANG="ar_AE.iso88596";
        no surprises
LANG="ar_BH.iso88596";
        no surprises
LANG="ar_DZ.iso88596";
        no surprises
LANG="ar_EG.iso88596";
        no surprises
LANG="ar_IQ.iso88596";
        no surprises
LANG="ar_JO.iso88596";
        no surprises
LANG="ar_KW.iso88596";
        no surprises
LANG="ar_LB.iso88596";
        no surprises
LANG="ar_LY.iso88596";
        no surprises
LANG="ar_MA.iso88596";
        no surprises
LANG="ar_OM.iso88596";
        no surprises
LANG="ar_QA.iso88596";
        no surprises
LANG="ar_SA.iso88596";
        no surprises
LANG="ar_SD.iso88596";
        no surprises
LANG="ar_SY.iso88596";
        no surprises
LANG="ar_TN.iso88596";
        no surprises
LANG="ar_YE.iso88596";
        no surprises
LANG="ast_ES.iso885915";
        no surprises
LANG="be_BY.cp1251";
        no surprises
LANG="bg_BG.cp1251";
        no surprises
LANG="br_FR.iso88591";
        no surprises
LANG="br_FR.iso885915@euro";
        no surprises
LANG="bs_BA.iso88592";
        no surprises
LANG="ca_AD.iso885915";
        no surprises
LANG="ca_ES.iso88591";
        no surprises
LANG="ca_ES.iso885915@euro";
        no surprises
LANG="ca_FR.iso885915";
        no surprises
LANG="ca_IT.iso885915";
        no surprises
LANG="cs_CZ.iso88592";
        no surprises
LANG="cy_GB.iso885914";
        no surprises
LANG="da_DK.iso88591";
        no surprises
LANG="da_DK.iso885915";
        no surprises
LANG="de_AT.iso88591";
        no surprises
LANG="de_AT.iso885915@euro";
        no surprises
LANG="de_BE.iso88591";
        no surprises
LANG="de_BE.iso885915@euro";
        no surprises
LANG="de_CH.iso88591";
        no surprises
LANG="de_DE.iso88591";
        no surprises
LANG="de_DE.iso885915@euro";
        no surprises
LANG="de_LU.iso88591";
        no surprises
LANG="de_LU.iso885915@euro";
        no surprises
LANG="el_CY.iso88597";
        no surprises
LANG="el_GR.iso88597";
        no surprises
LANG="en_AU.iso88591";
        no surprises
LANG="en_BW.iso88591";
        no surprises
LANG="en_CA.iso88591";
        no surprises
LANG="en_DK.iso88591";
        no surprises
LANG="en_GB.iso88591";
        no surprises
LANG="en_GB.iso885915";
        no surprises
LANG="en_HK.iso88591";
        no surprises
LANG="en_IE.iso88591";
        no surprises
LANG="en_IE.iso885915@euro";
        no surprises
LANG="en_NZ.iso88591";
        no surprises
LANG="en_PH.iso88591";
        no surprises
LANG="en_SG.iso88591";
        no surprises
LANG="en_US.iso88591";
        no surprises
LANG="en_US.iso885915";
        no surprises
LANG="en_ZA.iso88591";
        no surprises
LANG="en_ZW.iso88591";
        no surprises
LANG="es_AR.iso88591";
        no surprises
LANG="es_BO.iso88591";
        no surprises
LANG="es_CL.iso88591";
        no surprises
LANG="es_CO.iso88591";
        no surprises
LANG="es_CR.iso88591";
        no surprises
LANG="es_DO.iso88591";
        no surprises
LANG="es_EC.iso88591";
        no surprises
LANG="es_ES.iso88591";
        no surprises
LANG="es_ES.iso885915@euro";
        no surprises
LANG="es_GT.iso88591";
        no surprises
LANG="es_HN.iso88591";
        no surprises
LANG="es_MX.iso88591";
        no surprises
LANG="es_NI.iso88591";
        no surprises
LANG="es_PA.iso88591";
        no surprises
LANG="es_PE.iso88591";
        no surprises
LANG="es_PR.iso88591";
        no surprises
LANG="es_PY.iso88591";
        no surprises
LANG="es_SV.iso88591";
        no surprises
LANG="es_US.iso88591";
        no surprises
LANG="es_UY.iso88591";
        no surprises
LANG="es_VE.iso88591";
        no surprises
LANG="et_EE.iso88591";
        no surprises
LANG="et_EE.iso885915";
        no surprises
LANG="eu_ES.iso88591";
        no surprises
LANG="eu_ES.iso885915@euro";
        no surprises
LANG="fi_FI.iso88591";
        no surprises
LANG="fi_FI.iso885915@euro";
        no surprises
LANG="fo_FO.iso88591";
        no surprises
LANG="fr_BE.iso88591";
        no surprises
LANG="fr_BE.iso885915@euro";
        no surprises
LANG="fr_CA.iso88591";
        no surprises
LANG="fr_CH.iso88591";
        no surprises
LANG="fr_FR.iso88591";
        no surprises
LANG="fr_FR.iso885915@euro";
        no surprises
LANG="fr_LU.iso88591";
        no surprises
LANG="fr_LU.iso885915@euro";
        no surprises
LANG="ga_IE.iso88591";
        no surprises
LANG="ga_IE.iso885915@euro";
        no surprises
LANG="gd_GB.iso885915";
        no surprises
LANG="gl_ES.iso88591";
        no surprises
LANG="gl_ES.iso885915@euro";
        no surprises
LANG="gv_GB.iso88591";
        no surprises
LANG="he_IL.iso88598";
        no surprises
LANG="hr_HR.iso88592";
        no surprises
LANG="hsb_DE.iso88592";
        no surprises
LANG="hu_HU.iso88592";
        no surprises
LANG="hy_AM.armscii8";
        no surprises
LANG="id_ID.iso88591";
        no surprises
LANG="is_IS.iso88591";
        no surprises
LANG="it_CH.iso88591";
        no surprises
LANG="it_IT.iso88591";
        no surprises
LANG="it_IT.iso885915@euro";
        no surprises
LANG="iw_IL.iso88598";
        no surprises
LANG="ka_GE.georgianps";
        no surprises
LANG="kk_KZ.pt154";
        no surprises
LANG="kl_GL.iso88591";
        no surprises
LANG="ku_TR.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="kw_GB.iso88591";
        no surprises
LANG="lg_UG.iso885910";
        no surprises
LANG="lt_LT.iso885913";
        no surprises
LANG="lv_LV.iso885913";
        no surprises
LANG="mg_MG.iso885915";
        no surprises
LANG="mi_NZ.iso885913";
        no surprises
LANG="mk_MK.iso88595";
        no surprises
LANG="ms_MY.iso88591";
        no surprises
LANG="mt_MT.iso88593";
  128 =                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°ħ²³´µĥ·¸ışğĵ½ ż
      ^                                  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°Ħ²³´µĤ·¸IŞĞĴ½ Ż
      v                                  ħ˘£¤ ĥ§¨işğĵ­ ż°ħ²³´µĥ·¸ışğĵ½ ż
      ?                                  .    .  *...  . '    '  *'''  '
LANG="nb_NO.iso88591";
        no surprises
LANG="nl_BE.iso88591";
        no surprises
LANG="nl_BE.iso885915@euro";
        no surprises
LANG="nl_NL.iso88591";
        no surprises
LANG="nl_NL.iso885915@euro";
        no surprises
LANG="nn_NO.iso88591";
        no surprises
LANG="(null)";
        no surprises
LANG="oc_FR.iso88591";
        no surprises
LANG="om_KE.iso88591";
        no surprises
LANG="pl_PL.iso88592";
        no surprises
LANG="pt_BR.iso88591";
        no surprises
LANG="pt_PT.iso88591";
        no surprises
LANG="pt_PT.iso885915@euro";
        no surprises
LANG="ro_RO.iso88592";
        no surprises
LANG="ru_RU.iso88595";
        no surprises
LANG="ru_RU.koi8r";
        no surprises
LANG="ru_UA.koi8u";
        no surprises
LANG="sk_SK.iso88592";
        no surprises
LANG="sl_SI.iso88592";
        no surprises
LANG="so_DJ.iso88591";
        no surprises
LANG="so_KE.iso88591";
        no surprises
LANG="so_SO.iso88591";
        no surprises
LANG="sq_AL.iso88591";
        no surprises
LANG="st_ZA.iso88591";
        no surprises
LANG="sv_FI.iso88591";
        no surprises
LANG="sv_FI.iso885915@euro";
        no surprises
LANG="sv_SE.iso88591";
        no surprises
LANG="sv_SE.iso885915";
        no surprises
LANG="tg_TJ.koi8t";
        no surprises
LANG="th_TH.tis620";
        no surprises
LANG="tl_PH.iso88591";
        no surprises
LANG="tr_CY.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="tr_TR.iso88599";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
  192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
      v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
      ? ....................... .....*. ''''''''''''''''''''''' '''''*'
LANG="uk_UA.koi8u";
        no surprises
LANG="uz_UZ.iso88591";
        no surprises
LANG="wa_BE.iso88591";
        no surprises
LANG="wa_BE.iso885915@euro";
        no surprises
LANG="xh_ZA.iso88591";
        no surprises
LANG="yi_US.cp1255";
        no surprises
LANG="zu_ZA.iso88591";
        no surprises
LANG="aa_DJ.utf8";
        no surprises
LANG="aa_ER.utf8";
        no surprises
LANG="aa_ER.utf8@saaho";
        no surprises
LANG="aa_ET.utf8";
        no surprises
LANG="af_ZA.utf8";
        no surprises
LANG="ak_GH.utf8";
        no surprises
LANG="am_ET.utf8";
        no surprises
LANG="an_ES.utf8";
        no surprises
LANG="anp_IN.utf8";
        no surprises
LANG="ar_AE.utf8";
        no surprises
LANG="ar_BH.utf8";
        no surprises
LANG="ar_DZ.utf8";
        no surprises
LANG="ar_EG.utf8";
        no surprises
LANG="ar_IN.utf8";
        no surprises
LANG="ar_IQ.utf8";
        no surprises
LANG="ar_JO.utf8";
        no surprises
LANG="ar_KW.utf8";
        no surprises
LANG="ar_LB.utf8";
        no surprises
LANG="ar_LY.utf8";
        no surprises
LANG="ar_MA.utf8";
        no surprises
LANG="ar_OM.utf8";
        no surprises
LANG="ar_QA.utf8";
        no surprises
LANG="ar_SA.utf8";
        no surprises
LANG="ar_SD.utf8";
        no surprises
LANG="ar_SS.utf8";
        no surprises
LANG="ar_SY.utf8";
        no surprises
LANG="ar_TN.utf8";
        no surprises
LANG="ar_YE.utf8";
        no surprises
LANG="as_IN.utf8";
        no surprises
LANG="ast_ES.utf8";
        no surprises
LANG="ayc_PE.utf8";
        no surprises
LANG="az_AZ.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="be_BY.utf8";
        no surprises
LANG="be_BY.utf8@latin";
        no surprises
LANG="bem_ZM.utf8";
        no surprises
LANG="ber_DZ.utf8";
        no surprises
LANG="ber_MA.utf8";
        no surprises
LANG="bg_BG.utf8";
        no surprises
LANG="bh_IN.utf8";
        no surprises
LANG="bho_IN.utf8";
        no surprises
LANG="bn_BD.utf8";
        no surprises
LANG="bn_IN.utf8";
        no surprises
LANG="bo_CN.utf8";
        no surprises
LANG="bo_IN.utf8";
        no surprises
LANG="br_FR.utf8";
        no surprises
LANG="brx_IN.utf8";
        no surprises
LANG="bs_BA.utf8";
        no surprises
LANG="byn_ER.utf8";
        no surprises
LANG="ca_AD.utf8";
        no surprises
LANG="ca_ES.utf8";
        no surprises
LANG="ca_FR.utf8";
        no surprises
LANG="ca_IT.utf8";
        no surprises
LANG="ce_RU.utf8";
        no surprises
LANG="cmn_TW.utf8";
        no surprises
LANG="crh_UA.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="csb_PL.utf8";
        no surprises
LANG="cs_CZ.utf8";
        no surprises
LANG="cv_RU.utf8";
        no surprises
LANG="cy_GB.utf8";
        no surprises
LANG="da_DK.utf8";
        no surprises
LANG="de_AT.utf8";
        no surprises
LANG="de_BE.utf8";
        no surprises
LANG="de_CH.utf8";
        no surprises
LANG="de_DE.utf8";
        no surprises
LANG="de_LU.utf8";
        no surprises
LANG="doi_IN.utf8";
        no surprises
LANG="dv_MV.utf8";
        no surprises
LANG="dz_BT.utf8";
        no surprises
LANG="el_CY.utf8";
        no surprises
LANG="el_GR.utf8";
        no surprises
LANG="en_AG.utf8";
        no surprises
LANG="en_AU.utf8";
        no surprises
LANG="en_BW.utf8";
        no surprises
LANG="en_CA.utf8";
        no surprises
LANG="en_DK.utf8";
        no surprises
LANG="en_GB.utf8";
        no surprises
LANG="en_HK.utf8";
        no surprises
LANG="en_IE.utf8";
        no surprises
LANG="en_IN.utf8";
        no surprises
LANG="en_NG.utf8";
        no surprises
LANG="en_NZ.utf8";
        no surprises
LANG="en_PH.utf8";
        no surprises
LANG="en_SG.utf8";
        no surprises
LANG="en_US.utf8";
        no surprises
LANG="en_ZA.utf8";
        no surprises
LANG="en_ZM.utf8";
        no surprises
LANG="en_ZW.utf8";
        no surprises
LANG="es_AR.utf8";
        no surprises
LANG="es_BO.utf8";
        no surprises
LANG="es_CL.utf8";
        no surprises
LANG="es_CO.utf8";
        no surprises
LANG="es_CR.utf8";
        no surprises
LANG="es_CU.utf8";
        no surprises
LANG="es_DO.utf8";
        no surprises
LANG="es_EC.utf8";
        no surprises
LANG="es_ES.utf8";
        no surprises
LANG="es_GT.utf8";
        no surprises
LANG="es_HN.utf8";
        no surprises
LANG="es_MX.utf8";
        no surprises
LANG="es_NI.utf8";
        no surprises
LANG="es_PA.utf8";
        no surprises
LANG="es_PE.utf8";
        no surprises
LANG="es_PR.utf8";
        no surprises
LANG="es_PY.utf8";
        no surprises
LANG="es_SV.utf8";
        no surprises
LANG="es_US.utf8";
        no surprises
LANG="es_UY.utf8";
        no surprises
LANG="es_VE.utf8";
        no surprises
LANG="et_EE.utf8";
        no surprises
LANG="eu_ES.utf8";
        no surprises
LANG="fa_IR.utf8";
        no surprises
LANG="ff_SN.utf8";
        no surprises
LANG="fi_FI.utf8";
        no surprises
LANG="fil_PH.utf8";
        no surprises
LANG="fo_FO.utf8";
        no surprises
LANG="fr_BE.utf8";
        no surprises
LANG="fr_CA.utf8";
        no surprises
LANG="fr_CH.utf8";
        no surprises
LANG="fr_FR.utf8";
        no surprises
LANG="fr_LU.utf8";
        no surprises
LANG="fur_IT.utf8";
        no surprises
LANG="fy_DE.utf8";
        no surprises
LANG="fy_NL.utf8";
        no surprises
LANG="ga_IE.utf8";
        no surprises
LANG="gd_GB.utf8";
        no surprises
LANG="gez_ER.utf8";
        no surprises
LANG="gez_ER.utf8@abegede";
        no surprises
LANG="gez_ET.utf8";
        no surprises
LANG="gez_ET.utf8@abegede";
        no surprises
LANG="gl_ES.utf8";
        no surprises
LANG="gu_IN.utf8";
        no surprises
LANG="gv_GB.utf8";
        no surprises
LANG="hak_TW.utf8";
        no surprises
LANG="ha_NG.utf8";
        no surprises
LANG="he_IL.utf8";
        no surprises
LANG="hi_IN.utf8";
        no surprises
LANG="hne_IN.utf8";
        no surprises
LANG="hr_HR.utf8";
        no surprises
LANG="hsb_DE.utf8";
        no surprises
LANG="ht_HT.utf8";
        no surprises
LANG="hu_HU.utf8";
        no surprises
LANG="hy_AM.utf8";
        no surprises
LANG="ia_FR.utf8";
        no surprises
LANG="id_ID.utf8";
        no surprises
LANG="ig_NG.utf8";
        no surprises
LANG="ik_CA.utf8";
        no surprises
LANG="is_IS.utf8";
        no surprises
LANG="it_CH.utf8";
        no surprises
LANG="it_IT.utf8";
        no surprises
LANG="iu_CA.utf8";
        no surprises
LANG="iw_IL.utf8";
        no surprises
LANG="ja_JP.utf8";
        no surprises
LANG="ka_GE.utf8";
        no surprises
LANG="kk_KZ.utf8";
        no surprises
LANG="kl_GL.utf8";
        no surprises
LANG="km_KH.utf8";
        no surprises
LANG="kn_IN.utf8";
        no surprises
LANG="kok_IN.utf8";
        no surprises
LANG="ko_KR.utf8";
        no surprises
LANG="ks_IN.utf8";
        no surprises
LANG="ks_IN.utf8@devanagari";
        no surprises
LANG="ku_TR.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="kw_GB.utf8";
        no surprises
LANG="ky_KG.utf8";
        no surprises
LANG="lb_LU.utf8";
        no surprises
LANG="lg_UG.utf8";
        no surprises
LANG="li_BE.utf8";
        no surprises
LANG="lij_IT.utf8";
        no surprises
LANG="li_NL.utf8";
        no surprises
LANG="lo_LA.utf8";
        no surprises
LANG="lt_LT.utf8";
        no surprises
LANG="lv_LV.utf8";
        no surprises
LANG="lzh_TW.utf8";
        no surprises
LANG="mag_IN.utf8";
        no surprises
LANG="mai_IN.utf8";
        no surprises
LANG="mg_MG.utf8";
        no surprises
LANG="mhr_RU.utf8";
        no surprises
LANG="mi_NZ.utf8";
        no surprises
LANG="mk_MK.utf8";
        no surprises
LANG="ml_IN.utf8";
        no surprises
LANG="mni_IN.utf8";
        no surprises
LANG="mn_MN.utf8";
        no surprises
LANG="mr_IN.utf8";
        no surprises
LANG="ms_MY.utf8";
        no surprises
LANG="mt_MT.utf8";
        no surprises
LANG="my_MM.utf8";
        no surprises
LANG="nan_TW.utf8";
        no surprises
LANG="nan_TW.utf8@latin";
        no surprises
LANG="nb_NO.utf8";
        no surprises
LANG="nds_DE.utf8";
        no surprises
LANG="nds_NL.utf8";
        no surprises
LANG="ne_NP.utf8";
        no surprises
LANG="nhn_MX.utf8";
        no surprises
LANG="niu_NU.utf8";
        no surprises
LANG="niu_NZ.utf8";
        no surprises
LANG="nl_AW.utf8";
        no surprises
LANG="nl_BE.utf8";
        no surprises
LANG="nl_NL.utf8";
        no surprises
LANG="nn_NO.utf8";
        no surprises
LANG="nr_ZA.utf8";
        no surprises
LANG="nso_ZA.utf8";
        no surprises
LANG="oc_FR.utf8";
        no surprises
LANG="om_ET.utf8";
        no surprises
LANG="om_KE.utf8";
        no surprises
LANG="or_IN.utf8";
        no surprises
LANG="os_RU.utf8";
        no surprises
LANG="pa_IN.utf8";
        no surprises
LANG="pap_AN.utf8";
        no surprises
LANG="pap_AW.utf8";
        no surprises
LANG="pap_CW.utf8";
        no surprises
LANG="pa_PK.utf8";
        no surprises
LANG="pl_PL.utf8";
        no surprises
LANG="ps_AF.utf8";
        no surprises
LANG="pt_BR.utf8";
        no surprises
LANG="pt_PT.utf8";
        no surprises
LANG="quz_PE.utf8";
        no surprises
LANG="raj_IN.utf8";
        no surprises
LANG="ro_RO.utf8";
        no surprises
LANG="ru_RU.utf8";
        no surprises
LANG="ru_UA.utf8";
        no surprises
LANG="rw_RW.utf8";
        no surprises
LANG="sa_IN.utf8";
        no surprises
LANG="sat_IN.utf8";
        no surprises
LANG="sc_IT.utf8";
        no surprises
LANG="sd_IN.utf8";
        no surprises
LANG="sd_IN.utf8@devanagari";
        no surprises
LANG="se_NO.utf8";
        no surprises
LANG="shs_CA.utf8";
        no surprises
LANG="sid_ET.utf8";
        no surprises
LANG="si_LK.utf8";
        no surprises
LANG="sk_SK.utf8";
        no surprises
LANG="sl_SI.utf8";
        no surprises
LANG="so_DJ.utf8";
        no surprises
LANG="so_ET.utf8";
        no surprises
LANG="so_KE.utf8";
        no surprises
LANG="so_SO.utf8";
        no surprises
LANG="sq_AL.utf8";
        no surprises
LANG="sq_MK.utf8";
        no surprises
LANG="sr_ME.utf8";
        no surprises
LANG="sr_RS.utf8";
        no surprises
LANG="sr_RS.utf8@latin";
        no surprises
LANG="ss_ZA.utf8";
        no surprises
LANG="st_ZA.utf8";
        no surprises
LANG="sv_FI.utf8";
        no surprises
LANG="sv_SE.utf8";
        no surprises
LANG="sw_KE.utf8";
        no surprises
LANG="sw_TZ.utf8";
        no surprises
LANG="szl_PL.utf8";
        no surprises
LANG="ta_IN.utf8";
        no surprises
LANG="ta_LK.utf8";
        no surprises
LANG="te_IN.utf8";
        no surprises
LANG="tg_TJ.utf8";
        no surprises
LANG="the_NP.utf8";
        no surprises
LANG="th_TH.utf8";
        no surprises
LANG="ti_ER.utf8";
        no surprises
LANG="ti_ET.utf8";
        no surprises
LANG="tig_ER.utf8";
        no surprises
LANG="tk_TM.utf8";
        no surprises
LANG="tl_PH.utf8";
        no surprises
LANG="tn_ZA.utf8";
        no surprises
LANG="tr_CY.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="tr_TR.utf8";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="ts_ZA.utf8";
        no surprises
LANG="tt_RU.utf8";
        no surprises
LANG="tt_RU.utf8@iqtelif";
   64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHiJKLMNOPQRSTUVWXYZ{|}~
      v @abcdefghIjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
      ?  ........*.................      ''''''''*'''''''''''''''''
LANG="tu_IN.utf8";
        no surprises
LANG="ug_CN.utf8";
        no surprises
LANG="uk_UA.utf8";
        no surprises
LANG="unm_US.utf8";
        no surprises
LANG="ur_IN.utf8";
        no surprises
LANG="ur_PK.utf8";
        no surprises
LANG="uz_UZ.utf8";
        no surprises
LANG="uz_UZ.utf8@cyrillic";
        no surprises
LANG="ve_ZA.utf8";
        no surprises
LANG="vi_VN.utf8";
        no surprises
LANG="wa_BE.utf8";
        no surprises
LANG="wae_CH.utf8";
        no surprises
LANG="wal_ET.utf8";
        no surprises
LANG="wo_SN.utf8";
        no surprises
LANG="xh_ZA.utf8";
        no surprises
LANG="yi_US.utf8";
        no surprises
LANG="yo_NG.utf8";
        no surprises
LANG="yue_HK.utf8";
        no surprises
LANG="zh_CN.utf8";
        no surprises
LANG="zh_HK.utf8";
        no surprises
LANG="zh_SG.utf8";
        no surprises
LANG="zh_TW.utf8";
        no surprises
LANG="zu_ZA.utf8";
        no surprises
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <string.h>
#include <ctype.h>
#include <iconv.h>
#include <langinfo.h>

/* Express shock if a character maps from the ASCII/ISO646 range into
 * the high-bit range, or visa versa, and then explicitly verify that
 * all ASCII values map as expected if it is in the lowest 128 char plane
 * This does not work for testing on an EBCDIC architecture, diff ranges
 * for alpha and 'well known posix characters' and u/l case are offset by 64.
 */
int surprise (int ch, int uc, int lc)
{
    int lcsurprise = (ch > 127 && lc < 128) || (ch < 128 && lc > 127);
    int ucsurprise = (ch > 127 && uc < 128) || (ch < 128 && uc > 127);
    if (ch < 128) {
        if (ch >= 'A' && ch <= 'Z') {
            if (lc != ch + 32) lcsurprise = 1;
            if (uc != ch) ucsurprise = 1;
        } else if (ch >= 'a' && ch <= 'z') {
            if (lc != ch) lcsurprise = 1;
            if (uc != ch - 32) ucsurprise = 1;
        } else {
            if (lc != ch) lcsurprise = 1;
            if (uc != ch) ucsurprise = 1;
        }
    }      
    return lcsurprise | ucsurprise;
}     


int main (int argc, char *argv[])
{
  int verbose = 0;
  int colwidth = 64;
  int row, col, view, prelen;
  char buf[colwidth + 16], *bufch, *bufptr;
  char pbuf[colwidth * 4 + 16], *pbufptr;
  size_t bufsz, pbufsz;
  iconv_t convctx;
  char *locale;
  ++argv, --argc;

  if (*argv && (strcmp(*argv, "-v") == 0)) {
    verbose = 1;
    ++argv, --argc;
  }

  do {
    int viewed = 0;
    if (!*argv)
      locale = setlocale(LC_ALL, "");
    else
      locale = setlocale(LC_ALL, *argv++);
    printf("LANG=\"%s\";\n", locale);

    convctx = iconv_open("UTF-8", nl_langinfo(CODESET));
    if (!convctx) {
      printf("Failed to initialize \"%s\" to \"UTF-8\" iconv context\n",
             nl_langinfo(CODESET));
      continue;
    }

    for (row = 0; row < (256 / colwidth); ++row) {
      for (col = 0, view = 0; col < colwidth; ++col) {
        int lcsurprise, ucsurprise;
        unsigned char ch = row * colwidth + col;
        unsigned char lc = tolower(ch), uc = toupper(ch);
        if (verbose && (lc != ch || uc != ch)) {
          view = 1; break;
        }
        if (surprise(ch, uc, lc)) {
          view = 1; break;
        }
      }
      if (!view)
        continue;

      bufch = buf + sprintf(buf, "%5d = ", row * colwidth);
      prelen = bufch - buf - 3;
      for (col = 0; col < colwidth; ++col) {
        unsigned char ch = row * colwidth + col;
        *(bufch++) = isprint(ch) ? ch : ' ';
      }
      bufptr = buf; bufsz = bufch - buf; pbufptr = pbuf; pbufsz = sizeof(pbuf);
      iconv(convctx, &bufptr, &bufsz, &pbufptr, &pbufsz);
      printf("%.*s\n", pbufptr - pbuf, pbuf);

      bufch = buf + sprintf(buf, "%.*s ^ ", prelen, "        ");
      for (col = 0; col < colwidth; ++col) {
        unsigned char ch = row * colwidth + col;
        unsigned char uc = toupper(ch);
        *(bufch++) = isprint(uc) ? uc : ' ';
      }
      bufptr = buf; bufsz = bufch - buf; pbufptr = pbuf; pbufsz = sizeof(pbuf);
      iconv(convctx, &bufptr, &bufsz, &pbufptr, &pbufsz);
      printf("%.*s\n", pbufptr - pbuf, pbuf);

      bufch = buf + sprintf(buf, "%.*s v ", prelen, "        ");
      for (col = 0; col < colwidth; ++col) {
        unsigned char ch = row * colwidth + col;
        unsigned char lc = tolower(ch);
        *(bufch++) = isprint(lc) ? lc : ' ';
      }
      bufptr = buf; bufsz = bufch - buf; pbufptr = pbuf; pbufsz = sizeof(pbuf);
      iconv(convctx, &bufptr, &bufsz, &pbufptr, &pbufsz);
      printf("%.*s\n", pbufptr - pbuf, pbuf);

      bufch = buf + sprintf(buf, "%.*s ? ", prelen, "        ");
      bufch = buf + prelen + 3;
      for (col = 0; col < colwidth; ++col) {
        unsigned char ch = row * colwidth + col;
        unsigned char lc = tolower(ch);
        unsigned char uc = toupper(ch);
        *(bufch++) = surprise(ch, uc, lc) ? '*'
                                          : (uc != ch) ? '\''
                                                       : (lc != ch) ? '.' : ' ';
      }
      bufptr = buf; bufsz = bufch - buf; pbufptr = pbuf; pbufsz = sizeof(pbuf);
      iconv(convctx, &bufptr, &bufsz, &pbufptr, &pbufsz);
      printf("%.*s\n", pbufptr - pbuf, pbuf);
      viewed = 1;
    }
    if (!viewed)
        printf("        no surprises\n");
    iconv_close(convctx);
  } while (*argv);
}

Reply via email to