HiTim,

On Thu, Sep 17, 2020 at 07:27:55PM +0200, Tim Düsterhus wrote:
> Hi List,
> 
> I'd like to discuss something that's bothering me for quite some time
> now: The naming, namespacing and discoverability of fetches and converters.
> 
> Frankly, it's a small mess right now. I assume this is due to historic
> growth an features being added over time.

I don't think anyone will disagree with this!

> Let me give a few examples:
> 
> I want to encoded something using base64? I use the `base64` converter.
> Decoding? **Obviously** that's `b64dec` ;)

That's a perfect example of something initially created to do one
thing and taking too generic a name ("base64"). Adding the decoder
years later proved to be difficult. I even think that by then we
didn't yet easily support arguments for converters, so it wasn't
trivial to indicate in base64() what direction to use, plus it
wouldn't have helped much anyway, given that what you write usually
means something in a context and might not be the correct form anyway.

> Hex encoding? Easily done with `hex`. Decoding back to a string? Not
> possible. But you can to `hex2i` to turn a hexadecimal string into an
> integer.

This one was recently discussed as well, I even though we had "unhex"
or something like this. Maybe it appeared in a private discussion with
someone, I don't remember.

> We have `url_dec`. But there isn't an `url_enc` or maybe I am just not
> able to find it.

I'm pretty sure it appeared in a private discussion recently as well,
in which I explained that it does have the exact same issue as url_dec
in that it needs to work differently depending on the part that's being
encoded and as such has to be very flexible.

> Then there's quite a few redundant converters. We have `sha1`, then
> `sha2` and most recently we have `digest`.

>From what I'm reading in the commit message (the doc not being clear
on this point), they don't seem to overlap as "digest" seems to return
an HMAC. In addition, "sha1" doesn't require openssl so it comes for
free. But anyway I get your point and generally agree.

> Then within the description of the converters the variable naming and
> scoping is explained over and over again for each converter that takes
> variable names.

Did you also notice that copy-pasting is quite a common practice in the
doc :-/

> For fetches it's not as bad, many have proper prefixes.

Initially there were very few converters, they were usable and needed
only for stick tables, so these were ipmask(), lower(), upper() and
that was about all. These slowly expanded without any general direction
being observed, making the naming quite not trivial. For fetches it's
different because you have to indicate where you want to retrieve the
information, and actually for some of them, depending on the mode
haproxy was working, very early it was possible to have some confusion
about that, so they needed to be a bit more precise.

> But there's also
> a difference in naming. Sometimes the namespace is separated by a dot
> (`req.hdr`). Sometime it's an underscore (`ssl_fc`).

Yep. The dot was added later to try to group all those working within
the same domain (e.g. the request or the response). We also (re)discovered
that the automatic binding making them directly available from Lua is less
transparent with the dot as that one has to be turned to an underscore
there. But it's less common and the dot still appears much more readable
to me.

> Long story short: Is it feasible to update the canonical names of
> fetches and converters, preserving backwards compatibility with old
> versions and adding the missing "duals" for encoders / decoders or do we
> have to live with the current situation.

We can *try*. Trust me, it is *extremely hard* to correctly name things.
A significant part of the development job is to find correct names for
variables, functions, struct members, and worse, config options! You want
non-confusing names that accurately represent what you're designating with
a very limited risk that it later degenerates into something broader and
confusing again.

> -------
> 
> Let me give a proposal of what makes sense to me:
> 
> Make 's.' contain string related converters.
> 
> s.field() instead of field()
> s.word() instead of word()
> s.json() instead of json()

Maybe, why not for these ones. But then json() alone cannot stay
this way and needs to be called json-enc() or something like this
since it only encodes and doesn't decode.

> Make 'http.' contain HTTP related converters.
> 
> http.language() instead of language()

This already becomes difficult. http is very wide and touches all
areas. For example you'll use http.date() to encode an HTTP date
while plenty of users will search for it in "date.something" because
it's just one way to represent a date, or "time.something" because a
date is first and foremost something related to time.

> Make 'math.' contain mathematics related converters.
> 
> math.add() instead of add()
> math.sub() instead of sub()

And here I'd say that it will turn a number of expressions to
something very hard to read. We purposely tried to keep such
operators quite compact to keep expressions readable. I'd argue
that ",mul(3),add(21),div(32)" is quite more readable than
",math.mul(3),math.add(21),math.div(32)". I'm not fundamentally
opposed, but I'm pretty sure that because of this nobody will ever
use the long form, and we'll have added more noise in the doc
instead. It reminds me of a Cisco training 20 years ago where the
trainer told us never to use "write mem" and to instead use the more
versatile "copy running-config startup-config", but nobody ever uses
this, it's error-prone and much more painful to use, so even 20 years
later "write mem" remains the standard way to save.

> Make 'enc.' contain encoding related converters.
> 
> enc.base64() instead of base64()
> enc.unbase64() instead of b64dec()
> enc.hex() instead of hex()
> enc.unhex() being the new reverse of hex()

Similarly, "encoding" is something broad, it's actually all about what
converters do, they convert from a representation to another one, so
defining where it stops isn't trivial, but could be attempted. Some will
consider that aes_gcm_{enc,dec} is about encoding/decoding while others
will consider it in crypto and others in security.

> Make 'sec.' contain security related converters.
> 
> sec.digest() instead of digest()
> deprecate sha1, sha2 in favor of digest()

Actually I'd rather have "hash.something" for everything related to
hashing, as most of the time it has nothing to do with security, and
making people think that "djb2" or "crc32" are secure is a problem for
me, and personnaly I don't see a reason why these ones wouldn't appear
next to "sha1" or "digest".

> sec.secure_memcmp() instead of secure_memcmp()

Maybe "sec.memcmp()" or s.secure_memcmp() since it's about strings ?

> You get the concept.

Absolutely, it's just as as you can see above there isn't one single
way to classify them, and that it many of them will be considered harder
to use and will be ignored.

> The biggest benefit to namespacing for me is that related converters
> appear below each other within configuration.txt. The most commonly
> required converters could still remain in the default namespace (or
> perhaps rather aliased into the default namespace). I don't believe that
> 's.' for the string stuff is too bad, though.

Now I'm wondering if we shouldn't consider a variant of this proposal.
Apparently the main issue is with finding them, so they should be better
classified. We could imagine having them sorted by use case in the doc,
even with redundancy, and for each of them, suggesting a long and a short
name (this one optional). We could then consider that certain name first
and foremost belong to one category (often related to what file it's
implemented in), and inherits the long name from this category, but also
has a short name. The keyword wouldn't be decribed in details there, it
would just be an index, allowing the same one to appear at multiple
places. You could then have:

 Security-oriented converters
   long name          | short name  | purpose
   -------------------+-------------+---------------------------------------
   crypto.aes_gcm_enc | aes_gcm_enc | encode binary input using AES_GCM
   crypto.aes_gcm_dec | aes_gcm_dec | decode binary input using AES_GCM
   crypto.sha2        | sha2        | hashes binary input using SHA2

...
 Encoding functions
   long name          | short name  | purpose
   -------------------+-------------+---------------------------------------
   crypto.aes_gcm_enc | aes_gcm_enc | encode binary input using AES_GCM
   crypto.aes_gcm_dec | aes_gcm_dec | decode binary input using AES_GCM
   s.tolower          | lower       | lower case input string
   s.toupper          | upper       | lower case input string
   s.b64enc           | base64      | encode using base64
   s.b64dec           | b64dec      | decode using base64
   http.url_enc       | url_enc     | URL-encode binary input
...
 Hashing function
   long name          | short name  | purpose
   -------------------+-------------+---------------------------------------
   hash.crc32         | crc32       | hashes binary input using crc32 algo
   hash.djb2          | djb2        | hashes binary input using djb2 algo
   hash.wt6           | wt6         | hashes binary input using wt6 algo
   hash.sha1          | sha1        | hashes binary input using SHA1
   crypto.sha2        | sha2        | hashes binary input using SHA2
   crypto.digest      | digest      | hashes binary input using configurable 
algos
...

So there will definitely be some redundancy, which translates the reality,
and the keywords will only be indexed using their long names. As such they
will be much easier to find depending on what a user is trying to do, it
could easily be said at one place that all the crypto.xxx stuff is only
available when openssl is enabled, and overall we could simplify the search
there.

What do you think ?

Willy

Reply via email to