Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-30 Thread Yann Ylavic
Sorry for the late, was afk this times...

Regarding the name, I'm fine with ap[r]_cstr[n]casecmp(),
ap[r]_casecmpcstr[n]() or ap[r]_cstr_*() (if we need a set of
functions in this area)..

I think we all agree that the new function(s) would help protocol
"validation" being agnostic wrt the locale, though httpd (as any *nix
program) runs in the "C" locale by default (hence str[n]casecmp()
behave as expected), and this can't be changed unless some
(third-party-)module plays with setlocale(), as Bill said).

So the new function(s) would address two concerns:
1. doing the right thing at the protocol level if/when modules need
custom locales,
2. have an effecient "C"-string caseless comparison function on all
platforms (see tests results below).

For 1. I agree we should not hurry and take the time to review the
kind of changes I proposed in [1].

For 2. I think we can start using the new function(s) whenever we are
dealing with "C"-strings and this is a fast path (eg. Jean-Frederic's
report about ap_proxy_port_of_scheme(), which should be addressed both
in httpd and APR IMHO).


Regarding performances, attached are the tests (and results) I ran on
different systems (linuxes+glibc+gcc only!, i.e.
Debian6+glibc-2.11+gcc-4.4, Debian8+glibc-2.19+gcc-4.9 and
CentOS7+glibc-2.17+gcc-4.8) for the different implementations that
were discussed so far (including standard strncasecmp,
svn_cstring_casecmp, and Mikhail's mi_strcasecmp).



a. Our implementation(s) are faster than str[n]casecmp() for strings
lengths < 4 or 8 (depending on sizeof(long), ie. 32bit vs 64bit
system), which matters not only for such short strings but also when
the compared strings differ in these first bytes (our implementation
fails faster too here),

b. Latests str[n]casecmp() (or/and gcc) are far faster (x3) than any
of our proposal in the "C" (or "UTF-8") locale for longer strings, too
bad there is no strcasecmp[_loc]() taking the locale as argument (à la
stdc++)...
*However*, whenever mappings are in place, eg. the famous
mt_MT.ISO-8859, str[n]casecmp() takes the same time as our
implementation (comparing the same number of caseless-equal
characters),

c. Our best implementation, which is performing well in all cases (ie.
no "pathological" behaviour with some cases) is Jim's "ap_casestrcmp"
(the current one).
Actually the ones performing a bit better are those called
"ap_casestrcmp_1" and "ap_casestrcmp_2" in the test, the former being
the same as Jim's but with "++ps1; ++ps2;" done at the end of the
loop, and the latter being my proposed version using an index instead
of char pointers (no gain compared to "ap_casestrcmp_1", not worth the
change...).
So I'd be for using Jim's with the simple "++ps1; ++ps2;" change.




The attached test results are the ones run on CentOS7 (because this is
the system of a real/performant machine I can access, and running the
tests on my Debian laptop make it hot enough to be unfair :)
Since I'm not very used to CentOS, I could not make the
"mt_MT.ISO-8859" locale work/being applied, either because I'm doing
things wrong, or sowehow the locale has been updated to avoid this
mapping (though I was able to make it work with latest debians, where
strcasecmp() performs differently depending on the locale...).

So for completeness, I'm pasting the results on a debian jessie for
locales "mt_MT.ISO-8859" and "C" here, since it matters there:

$ LC_ALL=mt_MT.iso88593 ./ap_casecmpstr-O2 'a' 1
'CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa'
'cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa' 0
./ap_casecmpstr-O2 'a' 1
"CyCyCyCyCyCyCyCyCyCoOoOoOoOoOaAaAaAaAaAa"
"cYcYcYcYcYcYcYcYcYcOoOoOoOoOoAaAaAaAaAaa" 0: locale "mt_MT.iso88593"
- ap_casecmpstr   : time=06.160937456, res=0
- ap_casecmpstr_1 : time=06.256894742, res=0
- ap_casecmpstr_2 : time=06.136213804, res=0
- ap_casecmpstr_4 : time=06.787756289, res=0
- ap_casecmpstr_3 : time=06.110559311, res=0
- ap_casecmpstr_7 : time=06.844624092, res=0
- ap_casecmpstr_5 : time=06.820174763, res=0
- ap_casecmpstr_6 : time=10.488436936, res=0
- svn_cstring_casecmp : time=07.329213881, res=0
- mi_strcasecmp   : time=10.165367784, res=0
- strcasecmp_ext  : time=06.274211596, res=0
- strcasecmp  : time=06.126361486, res=0
- strcmp  : time=00.590613344, res=-32 != str[n]casecmp()'s result!

$ LC_ALL=mt_MT.iso88593 ./ap_casecmpstr-O2 'a' 1
$'\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9\xa9'
'' 0
./ap_casecmpstr-O2 'a' 1 "<...40 unprintable chars here...>"
"" 0: locale "mt_MT.iso88593"
- ap_casecmpstr   : time=00.479203198, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_1 : time=00.526867671, res=64 != str[n]casecmp()'s result!
- ap_casecmpstr_2 : time=00.525341010, res=64 != str[n]casecmp()'s 

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-30 Thread William A Rowe Jr
I've hijacked Yann's thoughts and replied on a dev@apr thread.  There is
merit in httpd's deliberations but the issue is sufficiently larger than
'just us ourselves'.


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-26 Thread Jim Jagielski
ascii? ascii? ascii?

:-)

> On Nov 25, 2015, at 4:52 PM, Christophe JAILLET 
>  wrote:
> 
> Hi,
> 
> just in case off, gnome as a set of function g_ascii_...
> (see 
> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp)
> 
>> 
>> I'm also waiting for feedback about the naming convention, I'd like to get
>> this into APR yesterday and start building on it, but it's hard to name our
>> generic-posix tolower/toupper until we agree on the naming scheme :)
>> 
>> 
> 



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-26 Thread Jim Jagielski
Yeah, SVN's 'svn_cstring_casecmp' and how it's used is
pretty much inline with my thoughts on how httpd would
use ours...

> On Nov 25, 2015, at 5:10 PM, Bert Huijben <b...@qqmail.nl> wrote:
> 
> We have a set of similar comparison functions in Subversion. I’m pretty sure 
> we already had these in the time we still had ebcdic support on trunk.
> (We removed that support years ago, but the code should still live on a 
> branch)
>  
> Bert
>  
> From: William A Rowe Jr [mailto:wr...@rowe-clan.net] 
> Sent: woensdag 25 november 2015 22:55
> To: httpd <dev@httpd.apache.org>
> Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)
>  
> On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET 
> <christophe.jail...@wanadoo.fr> wrote:
>> Hi,
>> 
>> just in case off, gnome as a set of function g_ascii_...
>> (see 
>> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp)
>  
> Interesting, does anyone know offhand whether these perform the expected
> or the stated behavior under EBCDIC environments? 



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-26 Thread William A Rowe Jr
Sounds right... Actually a fusion between svn_cstring_* and several
existing ap_ and apr_ functions would be useful.

SVN folk, any objection to APR appropriating these API's?  20/20 hindsight,
is apr_cstring_ or shorter apr_cstr_ the way to go here?  You all had to
use the thing so I trust your preferences.  Either expresses locale C in my
mind, so they work for me.
On Nov 26, 2015 07:38, "Jim Jagielski" <j...@jagunet.com> wrote:

> Yeah, SVN's 'svn_cstring_casecmp' and how it's used is
> pretty much inline with my thoughts on how httpd would
> use ours...
>
> > On Nov 25, 2015, at 5:10 PM, Bert Huijben <b...@qqmail.nl> wrote:
> >
> > We have a set of similar comparison functions in Subversion. I’m pretty
> sure we already had these in the time we still had ebcdic support on trunk.
> > (We removed that support years ago, but the code should still live on a
> branch)
> >
> > Bert
> >
> > From: William A Rowe Jr [mailto:wr...@rowe-clan.net]
> > Sent: woensdag 25 november 2015 22:55
> > To: httpd <dev@httpd.apache.org>
> > Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)
> >
> > On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <
> christophe.jail...@wanadoo.fr> wrote:
> >> Hi,
> >>
> >> just in case off, gnome as a set of function g_ascii_...
> >> (see
> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp
> )
> >
> > Interesting, does anyone know offhand whether these perform the expected
> > or the stated behavior under EBCDIC environments?
>
>


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Mikhail T.
On 25.11.2015 18:21, Bert Huijben wrote:
> That Turkish ‘I’ problem is the only case I know of where the
> collation actually changes behavior within the usual western alphabet
> of ASCII characters.
Argh, yes, I see now, what the problem would be... Thank you,

-mi



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Nov 25, 2015 4:19 PM, "Mikhail T."  wrote:
>
>
>>
>> So, the concern is, some hypothetical header, such as X-ASSIGN-TO may,
after going through the locale-aware strtolower() unexpectedly become
x-aßign-to?
>
> I just tested the above on both FreeBSD and Linux, and the results are
encouraging:
>>
>> % echo STRASSE | env LANG=de_DE.ISO8859 tr '[[:upper:]]' '[[:lower:]]'
>> strasse
>
> Thus, I contend, using C-library will not cause invalid results, and the
only reason to have Apache's own implementation is performance, but not
correctness.

Well almost but wrong...

The pure char-based ß processing produced no case change in my reviews of
tolower/toupper in de_DE codeset. If you were to examine string comparison
the collation order changes substantially.

That said, I'm working up a comprehensive audit and other codeset/language
combinations absolutely do.  Code and results forthcoming shortly.

As long as everyone keeps their fingers off the setlocale()/trigger, it's
all fine.


RE: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Bert Huijben
The example was the other way around. Changing SS to ß is not a valid 
transform, but the other way is. There are also transforms on the combined AE 
characters, etc.

 

That Turkish ‘I’ problem is the only case I know of where the collation 
actually changes behavior within the usual western alphabet of ASCII characters.

 

Bert

 

 

From: Mikhail T. [mailto:mi+t...@aldan.algebra.com] 
Sent: woensdag 25 november 2015 23:19
To: dev@httpd.apache.org
Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

 

On 25.11.2015 14:10, Mikhail T. wrote:

Two variables, LC_CTYPE and LC_COLLATE control this text processing behavior.  
The above is the correct lower case transliteration for Turkish.  In German, 
the upper case correspondence of sharp-S ß is 'SS', but multi-char translation 
is not provided by the simple tolower/toupper functions.

So, the concern is, some hypothetical header, such as X-ASSIGN-TO may, after 
going through the locale-aware strtolower() unexpectedly become x-aßign-to?

I just tested the above on both FreeBSD and Linux, and the results are 
encouraging:

% echo STRASSE | env LANG=de_DE.ISO8859 tr '[[:upper:]]' '[[:lower:]]'
strasse

Thus, I contend, using C-library will not cause invalid results, and the only 
reason to have Apache's own implementation is performance, but not correctness.

-mi



RE: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Bert Huijben
See http://www.siao2.com/2004/12/03/274288.aspx

And http://www.siao2.com/2013/04/04/10407543.aspx

For some background and related bugs in several products.

 

I hope this blog will stay alive. (The author passed away recently)

 

Bert

 

From: Bert Huijben [mailto:b...@qqmail.nl] 
Sent: donderdag 26 november 2015 00:22
To: dev@httpd.apache.org
Subject: RE: apr_token_* conclusions (was: Better casecmpstr[n]?)

 

The example was the other way around. Changing SS to ß is not a valid 
transform, but the other way is. There are also transforms on the combined AE 
characters, etc.

 

That Turkish ‘I’ problem is the only case I know of where the collation 
actually changes behavior within the usual western alphabet of ASCII characters.

 

Bert

 

 

From: Mikhail T. [mailto:mi+t...@aldan.algebra.com] 
Sent: woensdag 25 november 2015 23:19
To: dev@httpd.apache.org <mailto:dev@httpd.apache.org> 
Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

 

On 25.11.2015 14:10, Mikhail T. wrote:

Two variables, LC_CTYPE and LC_COLLATE control this text processing behavior.  
The above is the correct lower case transliteration for Turkish.  In German, 
the upper case correspondence of sharp-S ß is 'SS', but multi-char translation 
is not provided by the simple tolower/toupper functions.

So, the concern is, some hypothetical header, such as X-ASSIGN-TO may, after 
going through the locale-aware strtolower() unexpectedly become x-aßign-to?

I just tested the above on both FreeBSD and Linux, and the results are 
encouraging:

% echo STRASSE | env LANG=de_DE.ISO8859 tr '[[:upper:]]' '[[:lower:]]'
strasse

Thus, I contend, using C-library will not cause invalid results, and the only 
reason to have Apache's own implementation is performance, but not correctness.

-mi



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Wed, Nov 25, 2015 at 6:45 PM, William A Rowe Jr 
wrote:

> On Nov 25, 2015 4:19 PM, "Mikhail T."  wrote:
> >
> > Thus, I contend, using C-library will not cause invalid results, and the
> only reason to have Apache's own implementation is performance, but not
> correctness.
>
> Well almost but wrong...
>
> The pure char-based ß processing produced no case change in my reviews of
> tolower/toupper in de_DE codeset. If you were to examine string comparison
> the collation order changes substantially.
>
> And more to the point, if tolower()/toupper() could handle not only mbcs
but multicharacter transliteration, your results would have varied.  1:1
character translations have their intrinsic limits.

> That said, I'm working up a comprehensive audit and other codeset/language
> combinations absolutely do.  Code and results forthcoming shortly.
>
As promised, here's a quick review based on the sbcs and utf8 code pages in
the very limited single-byte scope on my machine.

I did not touch the following mbcs because they require 'shift-state' to
toggle into and out of specific characters and that implies a lot of
calculated fuzzing that I didn't have time for this week.  (Since mod_ftp
explicit tls is still broken, I had no time for any of this, either ;-)  I
also didn't get to evaluating the wide chars yet that fall into the
traditional posix/c ascii range, which I still mean to do, and haven't yet
repeated this exercise on win32 or os/x, only on a somewhat multinational
configuration of fedora 22.

The source code is pretty rudimentary.  I used iconv to shove all of the
resulting text evaluation into utf-8 for the console/file output, it really
plays no part in the locality equation.  It can be adapted for testing
similar on an EBCDIC box with a bit of clever coding I never got to.

Untested: ja_JP.eucjp ja_JP.ujis japanese.euc ko_KR.euckr korean.euc
zh_CN.gb18030 zh_CN.gb2312 zh_CN.gbk zh_HK.big5hkscs zh_SG.gb2312 zh_SG.gbk
zh_TW.big5 zh_TW.euctw

Tested and exceptional results noted (source code attached);

LANG="aa_DJ.iso88591";
no surprises
LANG="af_ZA.iso88591";
no surprises
LANG="an_ES.iso885915";
no surprises
LANG="ar_AE.iso88596";
no surprises
LANG="ar_BH.iso88596";
no surprises
LANG="ar_DZ.iso88596";
no surprises
LANG="ar_EG.iso88596";
no surprises
LANG="ar_IQ.iso88596";
no surprises
LANG="ar_JO.iso88596";
no surprises
LANG="ar_KW.iso88596";
no surprises
LANG="ar_LB.iso88596";
no surprises
LANG="ar_LY.iso88596";
no surprises
LANG="ar_MA.iso88596";
no surprises
LANG="ar_OM.iso88596";
no surprises
LANG="ar_QA.iso88596";
no surprises
LANG="ar_SA.iso88596";
no surprises
LANG="ar_SD.iso88596";
no surprises
LANG="ar_SY.iso88596";
no surprises
LANG="ar_TN.iso88596";
no surprises
LANG="ar_YE.iso88596";
no surprises
LANG="ast_ES.iso885915";
no surprises
LANG="be_BY.cp1251";
no surprises
LANG="bg_BG.cp1251";
no surprises
LANG="br_FR.iso88591";
no surprises
LANG="br_FR.iso885915@euro";
no surprises
LANG="bs_BA.iso88592";
no surprises
LANG="ca_AD.iso885915";
no surprises
LANG="ca_ES.iso88591";
no surprises
LANG="ca_ES.iso885915@euro";
no surprises
LANG="ca_FR.iso885915";
no surprises
LANG="ca_IT.iso885915";
no surprises
LANG="cs_CZ.iso88592";
no surprises
LANG="cy_GB.iso885914";
no surprises
LANG="da_DK.iso88591";
no surprises
LANG="da_DK.iso885915";
no surprises
LANG="de_AT.iso88591";
no surprises
LANG="de_AT.iso885915@euro";
no surprises
LANG="de_BE.iso88591";
no surprises
LANG="de_BE.iso885915@euro";
no surprises
LANG="de_CH.iso88591";
no surprises
LANG="de_DE.iso88591";
no surprises
LANG="de_DE.iso885915@euro";
no surprises
LANG="de_LU.iso88591";
no surprises
LANG="de_LU.iso885915@euro";
no surprises
LANG="el_CY.iso88597";
no surprises
LANG="el_GR.iso88597";
no surprises
LANG="en_AU.iso88591";
no surprises
LANG="en_BW.iso88591";
no surprises
LANG="en_CA.iso88591";
no surprises
LANG="en_DK.iso88591";
no surprises
LANG="en_GB.iso88591";
no surprises
LANG="en_GB.iso885915";
no surprises
LANG="en_HK.iso88591";
no surprises
LANG="en_IE.iso88591";
no surprises
LANG="en_IE.iso885915@euro";
no surprises
LANG="en_NZ.iso88591";
no surprises
LANG="en_PH.iso88591";
no surprises
LANG="en_SG.iso88591";
no surprises
LANG="en_US.iso88591";
no surprises
LANG="en_US.iso885915";
no surprises
LANG="en_ZA.iso88591";
no surprises
LANG="en_ZW.iso88591";
no surprises
LANG="es_AR.iso88591";
no surprises
LANG="es_BO.iso88591";
no surprises
LANG="es_CL.iso88591";
no surprises

Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Wed, Nov 25, 2015 at 9:44 PM, William A Rowe Jr 
wrote:

> LANG="ku_TR.iso88599";
>64 = @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
>   ^ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ABCDEFGHİJKLMNOPQRSTUVWXYZ{|}~
>   v @abcdefghıjklmnopqrstuvwxyz[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
>   ?  *.  *'
>   192 = ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
>   ^ ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ×ØÙÚÛÜİŞßÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏĞÑÒÓÔÕÖ÷ØÙÚÛÜIŞÿ
>   v àáâãäåæçèéêëìíîïğñòóôõö×øùúûüişßàáâãäåæçèéêëìíîïğñòóôõö÷øùúûüışÿ
>   ? ... .*. ''' '*'
>

The translation here is pretty simple.  We display the ^ toupper() and the
v tolower() value of every character.  For the summary line '?', in normal
or -v verbose mode, ' ' suggests no translations at all, '.' means this ch
has a lower case translation, ' means the cc has an upper case translation,
but I strip most of these lines out while searching for the exceptional
cases...

'*' is the surprising case, the high bit character translation falls into
the ancient 0-127 code plane, or a ch 0-127 falls into the high bit plane,
or anything within the traditional 0-127 code plane translates into an
unexpected position.

LANG="mt_MT.iso88593";
  128 =  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°ħ²³´µĥ·¸ışğĵ½ ż
  ^  Ħ˘£¤ Ĥ§¨İŞĞĴ­ Ż°Ħ²³´µĤ·¸IŞĞĴ½ Ż
  v  ħ˘£¤ ĥ§¨işğĵ­ ż°ħ²³´µĥ·¸ışğĵ½ ż
  ?  ..  *...  . ''  *'''  '

The last example above seems to indicate an isprint() validation error
or utf-8 mis-assignment in iconv, somewhere in the last 16 characters
of this code table, apparently between Ĵ­ and Ż :)


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Wed, Nov 25, 2015 at 1:12 PM, Jim Jagielski  wrote:

>
> > On Nov 25, 2015, at 12:42 PM, William A Rowe Jr 
> wrote:
> >
> > On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski  wrote:
> > What is the current status? Is this on hold?
> >
> > It is looking for a good name.  I'm happy with apr_token_strcasecmp
> > to best indicate its use-case and provenance.  Does that work for
> > everyone?
>
> Still not super excited by the use of 'token' since it
> implies it should only be used for HTTP tokens and not
> in other cases where we use it to do ascii string comparisons
> (for example, when we check env-var settings or maybe directives)...
> yeah, they could also be lumped as 'tokens' I guess...
>
> ap_casecmpastr[n] for Case-insensitive CoMParison of Ascii STRing
>

APR has a naming pattern for various functional groups - this won't be the
last
one that is impacted by POSIX-ing what should already be posix :)

Because this is (a) str[n]casecmp I'm pretty strongly against name mangling
for the sake of name mangling, our consumers are C programmers, after all.
Well, most of them anyways... and they should be familiar enough names
for the Lua and PHP folks too.

And this isn't ASCII actually, we established that we want EBCDIC build of
APR + HTTPD to have the same thing.  Not ASCII, but POSIX locale.  We
will be careful about the description on that count.

Still -0.5 on introducing an ap_function, in light of the current mess in
httpd.h.
I'm only 10% of the way through reviewing @deprecated on that single header.


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Mikhail T.
On 25.11.2015 12:42, William A Rowe Jr wrote:
> If the script switches setlocale to turkish, for example, our
> forced-lowercase content-type conversion 
> will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the
> specs intended.
I'm sorry, could you elaborate on this? Would not strtolower(3) convert
"IMAGE/GIF" to "image/gif" in /all/ locales -- including "C"? At least,
in all single-byte charsets -- such as the Turkish ISO 8859-9
? Yes, the function will
act differently on the strings containing octets above 127, but those
would occur neither in content-types nor in header-names...
> Adding unambiguous token handling functions would be good for the few
> case-insensitive string comparison, string folding, and search
> functions.  It allows the spec-consumer to trust their string processing.
Up until now, I thought, the thread was about coming up with a short-cut
-- an optimization for processing tokens, like request-headers, which
are known to be in US-ASCII anyway and where using locale-aware
functions is simply wasteful -- but not incorrect.

You seem to imply, the locale-aware functions might be doing the wrong
thing some times -- and this confuses me...

Yours,

-mi



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Mikhail T.
On 25.11.2015 13:16, William A Rowe Jr wrote:
>
> Two variables, LC_CTYPE and LC_COLLATE control this text processing
> behavior.  The above is the correct lower case transliteration for
> Turkish.  In German, the upper case correspondence of sharp-S ß is
> 'SS', but multi-char translation is not provided by the simple
> tolower/toupper functions.
>
So, the concern is, some hypothetical header, such as X-ASSIGN-TO may,
after going through the locale-aware strtolower() unexpectedly become
x-aßign-to?

-mi



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Jim Jagielski

> On Nov 25, 2015, at 12:42 PM, William A Rowe Jr  wrote:
> 
> On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski  wrote:
> What is the current status? Is this on hold?
> 
> It is looking for a good name.  I'm happy with apr_token_strcasecmp
> to best indicate its use-case and provenance.  Does that work for 
> everyone?

Still not super excited by the use of 'token' since it
implies it should only be used for HTTP tokens and not
in other cases where we use it to do ascii string comparisons
(for example, when we check env-var settings or maybe directives)...
yeah, they could also be lumped as 'tokens' I guess...

ap_casecmpastr[n] for Case-insensitive CoMParison of Ascii STRing



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Jim Jagielski
My point is that we use it to compare, for example,
"FoobARski!" with "foOBArsKi!", not "Ébana?" with "ébana?" or "ebana?"

In that way I mean "ascii"

Heck, we may as well say that we really aren't comparing
"strings" at all, just arrays of 8bit characters :)

Anyway, that was my final post about the name... at this
point I'd just like to see the actual improvement get completely
folded in and used so we (and our users) can start enjoying the
benefit.

> On Nov 25, 2015, at 2:31 PM, William A Rowe Jr  wrote:
> 
> On Wed, Nov 25, 2015 at 1:12 PM, Jim Jagielski  wrote:
> 
> > On Nov 25, 2015, at 12:42 PM, William A Rowe Jr  wrote:
> >
> > On Wed, Nov 25, 2015 at 10:17 AM, Jim Jagielski  wrote:
> > What is the current status? Is this on hold?
> >
> > It is looking for a good name.  I'm happy with apr_token_strcasecmp
> > to best indicate its use-case and provenance.  Does that work for
> > everyone?
> 
> Still not super excited by the use of 'token' since it
> implies it should only be used for HTTP tokens and not
> in other cases where we use it to do ascii string comparisons
> (for example, when we check env-var settings or maybe directives)...
> yeah, they could also be lumped as 'tokens' I guess...
> 
> ap_casecmpastr[n] for Case-insensitive CoMParison of Ascii STRing
> 
> APR has a naming pattern for various functional groups - this won't be the 
> last
> one that is impacted by POSIX-ing what should already be posix :)
> 
> Because this is (a) str[n]casecmp I'm pretty strongly against name mangling
> for the sake of name mangling, our consumers are C programmers, after all.
> Well, most of them anyways... and they should be familiar enough names
> for the Lua and PHP folks too.
> 
> And this isn't ASCII actually, we established that we want EBCDIC build of
> APR + HTTPD to have the same thing.  Not ASCII, but POSIX locale.  We
> will be careful about the description on that count.
> 
> Still -0.5 on introducing an ap_function, in light of the current mess in 
> httpd.h.
> I'm only 10% of the way through reviewing @deprecated on that single header.
> 
> 



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Nov 25, 2015 12:00, "Mikhail T."  wrote:
>
> On 25.11.2015 12:42, William A Rowe Jr wrote:
>>
>> If the script switches setlocale to turkish, for example, our
forced-lowercase content-type conversion
>> will cause "IMAGE/GIF" to become "ımage/gıf", clearly not what the specs
intended.
>
> I'm sorry, could you elaborate on this? Would not strtolower(3) convert
"IMAGE/GIF" to "image/gif" in all locales -- including "C"? At least, in
all single-byte charsets -- such as the Turkish ISO 8859-9? Yes, the
function will act differently on the strings containing octets above 127,
but those would occur neither in content-types nor in header-names...

Two variables, LC_CTYPE and LC_COLLATE control this text processing
behavior.  The above is the correct lower case transliteration for
Turkish.  In German, the upper case correspondence of sharp-S ß is 'SS',
but multi-char translation is not provided by the simple tolower/toupper
functions.

Consider this is a function of language, and not of 'charset' per-say.  The
same charset behaves differently based on the locale's language.

>> Adding unambiguous token handling functions would be good for the few
case-insensitive string comparison, string folding, and search functions.
It allows the spec-consumer to trust their string processing.
>
> Up until now, I thought, the thread was about coming up with a short-cut
-- an optimization for processing tokens, like request-headers, which are
known to be in US-ASCII anyway and where using locale-aware functions is
simply wasteful -- but not incorrect.

Partially so, that was the motivation behind the proposal.  Apparently OS/X
in particular has a slow implementation of strcasecmp even running under
the Posix locale.

> You seem to imply, the locale-aware functions might be doing the wrong
thing some times -- and this confuses me...

Until the APR consumer, including an instance of httpd, actually calls
setlocale(), everything should be behaving as expected.  If your in-process
code under httpd calls setlocale() to customize its behavior based on the
HTTP consumer's locale, that is when things may go badly under the hood in
both httpd and in APR.

But yes, I flagged this to the security team almost immediately and then
had to research what could introduce such a vulnerability of accepting
unexpected input and treating it as valid ASCII.  I was less concerned with
treating valid ASCII as opaque text which would be rejected.


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Wed, Nov 25, 2015 at 1:50 PM, Jim Jagielski  wrote:

> My point is that we use it to compare, for example,
> "FoobARski!" with "foOBArsKi!", not "Ébana?" with "ébana?" or "ebana?"
>
> In that way I mean "ascii"
>

But that isn't precisely what you wrote.  It happens to be ASCII here
because
we are corresponding in some offshoot of ISO646 (and yours might be
different
than mine, but gmail resolved it).  We have an EBCDIC modality for APR and
for httpd, and in that case, it is exactly what you meant (only A-Z, a-z)
but
not what you wrote :)


> Heck, we may as well say that we really aren't comparing
> "strings" at all, just arrays of 8bit characters :)
>

True, almost.  That's why I reiterated that the 8-bit values are largely
opaque to the "C" locale.  As long as A-Z, a-z behave as 'we' expect
with protocol conformance, we aren't so worried about the rest, and
that applies equally in ASCII or EBCDIC or Baudôt.

Anyway, that was my final post about the name... at this
> point I'd just like to see the actual improvement get completely
> folded in and used so we (and our users) can start enjoying the
> benefit.
>

In terms of your perceived optimization, I'd hate to have the result that
OS/X
hobbled users gain a faster strcmp implementation while others realize a
slower implementation, and I'm thinking of non-locale aware BSD.  That's
often the case with "optimizations" that don't consider what the clib
maintainers are able to accomplish with specialized knowledge of the
target architecture (especially MMX operations across arrays of characters).

And we long ago decided that APR really isn't cut out for those sorts of
optimizations.

I'm also waiting for feedback about the naming convention, I'd like to get
this into APR yesterday and start building on it, but it's hard to name our
generic-posix tolower/toupper until we agree on the naming scheme :)


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Christophe JAILLET

Le 25/11/2015 22:02, Jim Jagielski a écrit :

In general, strcmp() is not implemented via strcmp.c
(although if you do a source code search for strcmp, that's
what you'll get). Most of the time it's implemented in
assembly (strcmp.s) or simply leverages memcmp() where
you aren't doing a byte by byte comparison but are doing
a native memory word (32 or 64bit) comparison. This
makes them super fast.

Once we need to worry about case insensitivity, then
we see a whole gamut of implementations; some use
a mapped array as I did; some go char by char and call
tolower() on each one; some do other things such as
testing if isupper() before calling tolower() if needed.
The word-based optimizations seem less viable, as seen
in test results that I ran and Yann also verified (afaict)

In my tests, my impl was faster on OSX and CentOS5 and 6.
It's a very common function we use and with my test results
it seemed to make sense to provide our own impl, esp if
we decided that what we were really concerned about was
comparing for equality, and so would be able to avoid
the !strcasecmp logic leaping.

If we decide that all this was for moot, that's fine.
That's what these types of investigations and discussions
are for.



Personally, my testing shows that faster/slower is not that self 
evident. On my machine, it depends of the length of the string.
With shorter strings (less than ~10 chars) Yann's proposal seems to be 
the best with the test program. What happens if the const char table is 
not in L1 cache? We still have the same speedup?

When strings are longer, std strncasecmp always win.

Short strings are our use case, so, I would say, why not using this 
implementation, after all?



My personal reticence would be:
   - it adds complexity to the code (one more function that looks 
really similar to existing ones)
   - the speed increase is 'only' 15% if I remember well latest numbers 
given by Yann
   - the speed increase is potentially platform/compiler/C library 
dependent.
   - it does not suppress (IMO) the 'switch' for going even faster to 
the right test
   - many off the tests against ASCII strings are hidden in apr 
functions (apr_table_get...)
Do we have an idea of the overall time spent in these str[n]casecmp 
function when processing a request?  15% of that time should be, IMO, 
quite low.

Does it worse the added complexity? For me, the answer is: not sure.

CJ


RE: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Bert Huijben
We have a set of similar comparison functions in Subversion. I’m pretty sure we 
already had these in the time we still had ebcdic support on trunk.

(We removed that support years ago, but the code should still live on a branch)

 

Bert

 

From: William A Rowe Jr [mailto:wr...@rowe-clan.net] 
Sent: woensdag 25 november 2015 22:55
To: httpd <dev@httpd.apache.org>
Subject: Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

 

On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET 
<christophe.jail...@wanadoo.fr <mailto:christophe.jail...@wanadoo.fr> > wrote:

Hi,

just in case off, gnome as a set of function g_ascii_...
(see 
https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp)

 

Interesting, does anyone know offhand whether these perform the expected

or the stated behavior under EBCDIC environments? 

 



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Mikhail T.
On 25.11.2015 14:10, Mikhail T. wrote:
>>
>> Two variables, LC_CTYPE and LC_COLLATE control this text processing
>> behavior.  The above is the correct lower case transliteration for
>> Turkish.  In German, the upper case correspondence of sharp-S ß is
>> 'SS', but multi-char translation is not provided by the simple
>> tolower/toupper functions.
>>
> So, the concern is, some hypothetical header, such as X-ASSIGN-TO may,
> after going through the locale-aware strtolower() unexpectedly become
> x-aßign-to?
I just tested the above on both FreeBSD and Linux, and the results are
encouraging:

% echo STRASSE | env LANG=de_DE.ISO8859 tr '[[:upper:]]' '[[:lower:]]'
strasse

Thus, I contend, using C-library will not cause invalid results, and the
only reason to have Apache's own implementation is performance, but not
correctness.

-mi



Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Wed, Nov 25, 2015 at 2:06 PM, Jacob Champion 
wrote:

> My two cents: I agree that another "name mangled" abbreviation is not
> particularly helpful, but I also agree with Jim's concern: "apr_token" made
> me immediately wonder what made this exclusive to HTTP tokens.
> Unfortunately I don't have much of an alternative suggestion. I have seen
> other frameworks refer to an "invariant" or "independent" locale/culture
> before; maybe that helps jog someone's creativity?
>
> Feel free to ignore my rambling. ;) My default naming strategy is to throw
> out random ideas.
>

I was thinking "_lcc_" for locale-C, short and to the point, but meaningful?

"_lcposix_" is also descriptive but long, and "_posix_" is much too
overloaded
with multiple meanings and implications (locales *are* a posix API :)


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Jim Jagielski
In a library that has:

apr_pstrdup()
apr_pstrndup()
apr_pstrmemdup()

and apr_pstrmemdup() and apr_pstrndup() are functionally
the same, as well as:

apr_strnatcasecmp()
apr_strnatcmp()

neither of which use an 'n' variable to determine string
size, yet is called 'strn...' whereas the dups use that
'n' in 'strndup' to signify that we have a size parameter
BUT its functionally equiv function apr_pstrmemdup() is
called what it is instead of apr_pstrnmemdup()...

... I think we are WAY overthinking naming here.


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Jacob Champion
On Nov 25, 2015 1:10 PM, "Jim Jagielski"  wrote:
> ... I think we are WAY overthinking naming here.

I overthink naming constantly, so there's an excellent chance that you're
absolutely correct! That said... your list only ended up convincing me that
APR needs better naming conventions. ;-D

(I do really appreciate this discussion, though. I promise I'm not trying
to stir the pot.)

--Jacob


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Christophe JAILLET

Hi,

just in case off, gnome as a set of function g_ascii_...
(see 
https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp)




I'm also waiting for feedback about the naming convention, I'd like to get
this into APR yesterday and start building on it, but it's hard to 
name our

generic-posix tolower/toupper until we agree on the naming scheme :)






Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Jacob Champion
My two cents: I agree that another "name mangled" abbreviation is not
particularly helpful, but I also agree with Jim's concern: "apr_token" made
me immediately wonder what made this exclusive to HTTP tokens.
Unfortunately I don't have much of an alternative suggestion. I have seen
other frameworks refer to an "invariant" or "independent" locale/culture
before; maybe that helps jog someone's creativity?

Feel free to ignore my rambling. ;) My default naming strategy is to throw
out random ideas.

--Jacob
(on mobile, apologies for strange formatting)


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Wed, Nov 25, 2015 at 3:52 PM, Christophe JAILLET <
christophe.jail...@wanadoo.fr> wrote:

> Hi,
>
> just in case off, gnome as a set of function g_ascii_...
> (see
> https://developer.gnome.org/glib/2.28/glib-String-Utility-Functions.html#g-ascii-strcasecmp
> )


Interesting, does anyone know offhand whether these perform the expected
or the stated behavior under EBCDIC environments?


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread Jim Jagielski
In general, strcmp() is not implemented via strcmp.c
(although if you do a source code search for strcmp, that's
what you'll get). Most of the time it's implemented in
assembly (strcmp.s) or simply leverages memcmp() where
you aren't doing a byte by byte comparison but are doing
a native memory word (32 or 64bit) comparison. This
makes them super fast.

Once we need to worry about case insensitivity, then
we see a whole gamut of implementations; some use
a mapped array as I did; some go char by char and call
tolower() on each one; some do other things such as
testing if isupper() before calling tolower() if needed.
The word-based optimizations seem less viable, as seen
in test results that I ran and Yann also verified (afaict)

In my tests, my impl was faster on OSX and CentOS5 and 6.
It's a very common function we use and with my test results
it seemed to make sense to provide our own impl, esp if
we decided that what we were really concerned about was
comparing for equality, and so would be able to avoid
the !strcasecmp logic leaping.

If we decide that all this was for moot, that's fine.
That's what these types of investigations and discussions
are for.


Re: apr_token_* conclusions (was: Better casecmpstr[n]?)

2015-11-25 Thread William A Rowe Jr
On Wed, Nov 25, 2015 at 3:10 PM, Jim Jagielski  wrote:

> In a library that has:
>
> apr_pstrdup()
> apr_pstrndup()
> apr_pstrmemdup()
>

which are all semantically and mechanically different...


> and apr_pstrmemdup() and apr_pstrndup() are functionally
> the same,


Are you arguing to remove pstrmemdup?  That's a discussion to have
before APR 2.0, certainly, but it isn't functionally the same; bytes within
the copied pstrmemdup may be null, and it has a trailing null appended,
quite different than a memdup.


> as well as:
>
> apr_strnatcasecmp()
> apr_strnatcmp()
>
> neither of which use an 'n' variable to determine string
> size,


So there isn't a strnnat[case]cmp() function, are you offering a patch?


> yet is called 'strn...'


Indeed, possible to trip over with grep, for sure, but what is an 'atcmp'?
Seems clear enough to me, but are you proposing a rename?  APR 2.0
is the right time for that...


> whereas the dups use that
> 'n' in 'strndup' to signify that we have a size parameter
>

Indeed... follows the general stdc pattern.


> BUT its functionally equiv function apr_pstrmemdup() is
> called what it is instead of apr_pstrnmemdup()...
>
> ... I think we are WAY overthinking naming here.
>

People may be overthinking, and stumbling to come up with the
most concise and accurate name.  Renaming suggestions and
deprecation of the old names are welcome.  These are good
discussions to have, we made many improvements between
APR 0.9.x and APR 1.0.0 for exactly these reasons.

I agree we can call your proposal apr_str[n]casecmp because it
is a str[n]casecmp implementation - however, that doesn't tell the
user that it is "unusual" but equivalent function that breaks from
posix in that it deliberately chooses not to use the locale and is
primarily for wire protocols.  Thus the _token_ suggestion, but
I am open to other uniqifiers.  I'm not keen on coming up with
a new mishmash of str case len cmp equality blah that will be
harder for reviewers to decipher when reviewing commits.

I know you are in a hurry to just do something, but usually the stuff
we just hurry through many of us regret later, as piles of httpd.h cruft
can attest.  How many headers do you know that contain explicit
sighs?  APR has attempted to be more deliberate in its naming
conventions, by consensus.

You've certainly raised your ire many times at APR's unwillingness
to just modify an API within a major.minor revision, expressed very
little confidence that waiting for an APR release is ever a good idea,
and might be even perceived at hostile toward the entire APR
approach - which has never offered the shoot-from-the-hip approach
that earlier httpd releases enjoyed.  But these decisions were put
down in reaction to frequent breakage for developers prior to httpd 2,
and are in place precisely because APR wants other developers
beyond the world of httpd to have trust and confidence in the API
they are coding to.  Hopefully httpd module authors can enjoy the
same level of confidence.