[OE-core] Findings from using Hash Equivalence server

Tobias Hagelborn Thu, 21 Sep 2023 03:16:59 -0700

Findings from using Hash Equivalence server at scale:
=====================================================


We have now used the OE Hash Equivalence server at scale in our C/I chain at 
our company.
This has given some insights in what works and what can be improved when 
running this
service in full production.

Some stats from our run:
------------------------

### Hash Equivalence server:
* ~30M reqs/day
* ~500K new entries/day
* 2-3K reqs/s
* Data growth ~10GiB/day (with dbg info)

### C/I Builds:
* ~20K builds/day
* 15-20K tasks/build
* 140K sstate misses/day.

The good:
---------
* It works! It finds reusable sstate tasks (in all of our builds)
* It is very valuable in recipes that invalidate
  almost any other package. (Examples: glib, openssl)
* Hashserve scales to the need even if only running one instance
  - Tested for 12K req/s which was sufficient for our needs
* Robust enough (if no external cleanup takes place)
  - _No_ non-recoverable crashes during 300M requests served

The client site (Bitbake):
--------------------------
We have added a sanity check that disables the use of a remote HE server and 
switches
to a local one if the remote HE server cannot be connected. This is done from 
the
"ConfigParsed" event. The reason for this is to avoid builds hanging in case 
the remote HE server is not responding.

Areas of improvement:
---------------------

### Data retention:
There is no built in data-retention. Solving this with recurring external
cleanup script. It works but also exposed locking in the server resulting
in inter-lock with the external cleanup and timeout for the clients.
This can partially be the nature of SQLite and a single file database.
OE Hashserve is not built for cleaning up data and the data growth is high so
it has to be handled.

### Protocol:
As our first tested option for deployment was Kubernetes, the absence of the 
de-facto
standard HTTP(s) protocol required some workarounds.
Routing, authentication and monitoring support gets lost on the way. I would 
suggest that we
look into using HTTP(s) + JSON and some Basic Auth as the next basis of the 
protocol.

### Security:
Supply-chain attacks is nowadays something to be aware of. This service uses a
non encrypted and non authenticated protocol and is thus open to 
man-in-the-middle
attacks, or any type of fake data and manipulation. The protocol changes 
suggested above
could mitigate some of this. Additionally, one thought is for the client to 
provide a signature of the hash together with the hash so that it can be 
verified by the client using
it's secret upon retrieval.

### External database connection
An option for an external DB (For example PostgreSQL) would improve the 
possibility for concurrent cleanup and vacuum while running.
In the stateless world of cloud/containers/pods, an option for external DB
would be favorable.
This would probably make use of some external package for DB interaction so it 
would
differ a little from the standard python only nature of Bitbake in general.

### Nice to have
- Minor changes that should be more easily added as patches

* Hash (LRU) cache. From our stats, there seem to be a 1:20 ratio in writes to 
reads
  from the DB so a cache might save some resources.

Conclusions
===========
Hashserve is FOSS and if we want improvements, we have to contribute.
I will investigate to what extent we can chip in on some of these parts.
However, especially for the protocol and maybe also on external Python
dependencies, via PyPI, it would be nice to know if this is acceptable and
wished-for changes, before starting out.

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#187988): 
https://lists.openembedded.org/g/openembedded-core/message/187988
Mute This Topic: https://lists.openembedded.org/mt/101497051/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

[OE-core] Findings from using Hash Equivalence server

Reply via email to