Re: [dev] Some misc tools that could help others

2022-09-21 Thread Hadrien Lacour
On Wed Sep 21, 2022 at 8:21 PM CEST, NRK wrote:
> On Wed, Sep 21, 2022 at 04:24:28PM +, Hadrien Lacour wrote:
> > The other "interesting" one is genhtab, an alternative to gperf that is much
> > simpler to use and produces smaller binaries for big tables (but slower, cf
> > genhtab_bench) using a simple chained hashing table (with some tricks due to
> > being built AOT) using fnv1a; I'll probably try with XXH3 soon, to see if 
> > the
> > speed issue can be mitigated without too much binary bloating.
>
>   const uint32_t bkt =  hash & (ht.len - 1);
>
> Due to mul being the last step in fnv1a, the entropy is stored in the
> high bits, so you want to be masking the high bits and not the low ones.
Huh, makes sense, I was somehow counting on it having a good distribution.

> And for statically generated hash-tables (as well as in general) I tend
> to prefer open-addressing rather than chaining. For probing I prefer
> linear probe with robin-hood insertion[0] for statically generated
> tables.
>
Never liked the conceptual complexity of open addressing (and how you can lock
on a single bucket, in read-write MT scenarios), but it could be better, indeed.
Another thing I like about chained hashing, is how you can just change the
bucket type according to your read/write performance needs (vector, skiplist,
avl tree, etc...).

> As for why: linear probing nicely exploits spatial locality[1] and
> robin-hood can reduce the max probe size by alleviating clustering[2].
> This leads to both faster worst-case lookups as well as less memory
> usage.
In this specific case, there's not much cache improvement since the bucket
contents are stored sequentially, without pointers except for the keys.

> For run-time tables where you don't know the inputs, double-hashing
> might be a better approach[3]. But for statically generated one, I
> wouldn't use double-hashing since you can easily change the hash
> function until it's producing good results (e.g for fnv1a changing the
> starting state or the multiplier, or both[4]).
>
> Also one more trick that can reduce memory usage is to eliminate pointer
> overhead and use an array of arrays, i.e using `char [][n]` instead of
> `char *[]`. This isn't _guaranteed_ to reduce mem usage though, so
> you'll have to do some calculation to figure it out, i.e:
>
>   /* approximate mem usage for `char *[]` */
>   for (sum = 0, i = 0; i < LEN(input); ++i)
>   sum += (strlen(input[i]) + 1) + sizeof (char *);
>   /* vs char[][max_input_strlen + 1] */
>   sum = LEN(input) * (max_input_strlen + 1);
>
That's a very special case, but worth thinking about, indeed.

To be honest, performance wasn't the focus, since I can't imagine beating
gperf -lC.



Re: [dev] Some misc tools that could help others

2022-09-21 Thread NRK
On Wed, Sep 21, 2022 at 04:24:28PM +, Hadrien Lacour wrote:
> The other "interesting" one is genhtab, an alternative to gperf that is much
> simpler to use and produces smaller binaries for big tables (but slower, cf
> genhtab_bench) using a simple chained hashing table (with some tricks due to
> being built AOT) using fnv1a; I'll probably try with XXH3 soon, to see if the
> speed issue can be mitigated without too much binary bloating.

const uint32_t bkt =  hash & (ht.len - 1);

Due to mul being the last step in fnv1a, the entropy is stored in the
high bits, so you want to be masking the high bits and not the low ones.

Alternatively you can change the hash function to mix the high bits with
the low ones just before returning:

hash ^= hash >> 33; /* for the 64bit version. */

But I'd much rather just mask the high bits instead.

And for statically generated hash-tables (as well as in general) I tend
to prefer open-addressing rather than chaining. For probing I prefer
linear probe with robin-hood insertion[0] for statically generated
tables.

As for why: linear probing nicely exploits spatial locality[1] and
robin-hood can reduce the max probe size by alleviating clustering[2].
This leads to both faster worst-case lookups as well as less memory
usage.

For run-time tables where you don't know the inputs, double-hashing
might be a better approach[3]. But for statically generated one, I
wouldn't use double-hashing since you can easily change the hash
function until it's producing good results (e.g for fnv1a changing the
starting state or the multiplier, or both[4]).

Also one more trick that can reduce memory usage is to eliminate pointer
overhead and use an array of arrays, i.e using `char [][n]` instead of
`char *[]`. This isn't _guaranteed_ to reduce mem usage though, so
you'll have to do some calculation to figure it out, i.e:

/* approximate mem usage for `char *[]` */
for (sum = 0, i = 0; i < LEN(input); ++i)
sum += (strlen(input[i]) + 1) + sizeof (char *);
/* vs char[][max_input_strlen + 1] */
sum = LEN(input) * (max_input_strlen + 1);

Off the top of my head, these would be some of the major tricks I use
when building a statically generated hash-tables. And yeah, I prefer
building it on a case by case basis instead of using some pre-made
generator :)

[0]: https://programming.guide/robin-hood-hashing.html
[1]: https://en.wikipedia.org/wiki/Locality_of_reference
[2]: https://en.wikipedia.org/wiki/Primary_clustering
[3]: https://nullprogram.com/blog/2022/08/08/
[4]: 
https://github.com/jarun/nnn/blob/de3ad1b14618392753051959b2e7ac52f70252a1/src/icons-hash.c#L124-L125

- NRK



[dev] Some misc tools that could help others

2022-09-21 Thread Hadrien Lacour
Hello,
I've decided to post about some miscellaneous tools (C, sh, AWK) I made to ease
my day to day computing/programming; some of them finally being in a shape I
won't be embarrassed too much by.

┌┐
│ https://git.sr.ht/~q3cpma/misc-tools   │
││
│ * genhtab Generate static C99 hash tables  │
│ * htmldecode  HTML decoding to UTF-8   │
│ * htmlencode  HTML encoding from UTF-8 │
│ * mbcut   Multibyte aware string trimming  │
│ * natsort Natural sorting for UTF-8│
│ * urldecode   URL decoding │
│ * urlencode   URL encoding │
│ * wcswidthwcswidth(3) wrapper  │
││
│ Note: for simplicity, Unicode handling is limited to code points, treating │
│ combining characters and emoji as a sequence of code points instead of a   │
│ complete grapheme. │
└┘
Most useful one would be natsort, due to the constant need for it and lack of
alternatives baring scripting languages packages like Python (natsort), Perl
(Sort::Naturally) or Tcl (lsort -dictionary, built-in).

The other "interesting" one is genhtab, an alternative to gperf that is much
simpler to use and produces smaller binaries for big tables (but slower, cf
genhtab_bench) using a simple chained hashing table (with some tricks due to
being built AOT) using fnv1a; I'll probably try with XXH3 soon, to see if the
speed issue can be mitigated without too much binary bloating.

┌──┐
│ https://git.sr.ht/~q3cpma/scripts│
│  │
│ Collection of sh, AWK scripts with an exhaustive description of dependencies │
│ and POSIX compliance; a few bash scripts too, when arrays are needed.│
└──┘
I'm sure everyone here has his script collection following his hacker life, so
here's mine and here's the stuff that might interest people:
* archive_*.sh
Wrappers for common operations on archives.

* bwrap.bash, bwrap_auto.bash
Only Bash scripts of the repo, a much saner alternative to Firejail.

* drop_priv.sh
Nugget I found somewhere to avoid sudo/doas dependencies, but still limit
privileged areas in scripts.

* map.sh, filter.sh
Anyone who programs a bit knows what these are.

* hotplug_[u]mount.sh
Since I don't want and udev automounting crap, these are what I use for
the usual USB stick/drive.

* mass_rename.sh, rename.sh (should find better names)
Command and file component based renaming. Simple, but the dry run 
feature
is really useful.

* tabulate.sh
Exactly what it says on the tin, a configurable AWK based table viewer.

* web_man.sh (see .web_man.conf too)
A man that fetches from the web to get the same page for different OSes.
Used mainly to check the portability of certain tools/options.

* util.sh
My enormous collection of sh functions. I suggest you check these:
https://git.sr.ht/~q3cpma/scripts/tree/master/item/util.sh#L29
Pure sh fallback for readlink -f

https://git.sr.ht/~q3cpma/scripts/tree/master/item/util.sh#L406
Portable head -n#

https://git.sr.ht/~q3cpma/scripts/tree/master/item/util.sh#L413
Help text formatting use for all my script help 
messages.

https://git.sr.ht/~q3cpma/scripts/tree/master/item/util.sh#L479
Portable "rand"

https://git.sr.ht/~q3cpma/scripts/tree/master/item/util.sh#L591
Like atexit(3), allows the stacking trap '...' EXIT

https://git.sr.ht/~q3cpma/scripts/tree/master/item/util.sh#L600
Portable and simple mktemp [-d] fallback, which is just 
a C
program built built and aliased if needed

And this one from another repo: 
https://git.sr.ht/~q3cpma/posix-build/tree/master/item/build_util.sh#L1300
Add a C preprocessor directive of the form `#embed_as_string "path" var`
which creates a `static const char var[]` containing the content of 
path as
a proper C string literal

If that file isn't fully ASCII, it is then considered