On Thu, 2023-09-21 at 12:13 +0200, Tobias Hagelborn wrote:
> Findings from using Hash Equivalence server at scale:
> =====================================================
> 
> We have now used the OE Hash Equivalence server at scale in our C/I chain at 
> our company.
> This has given some insights in what works and what can be improved when 
> running this
> service in full production.
> 
> Some stats from our run:
> ------------------------
> 
> ### Hash Equivalence server:
> * ~30M reqs/day
> * ~500K new entries/day
> * 2-3K reqs/s
> * Data growth ~10GiB/day (with dbg info)
> 
> ### C/I Builds:
> * ~20K builds/day
> * 15-20K tasks/build
> * 140K sstate misses/day.

These are really interesting numbers, thanks for sharing!

> The good:
> ---------
> * It works! It finds reusable sstate tasks (in all of our builds)
> * It is very valuable in recipes that invalidate
>   almost any other package. (Examples: glib, openssl)
> * Hashserve scales to the need even if only running one instance
>   - Tested for 12K req/s which was sufficient for our needs
> * Robust enough (if no external cleanup takes place)
>   - _No_ non-recoverable crashes during 300M requests served
> 
> The client site (Bitbake):
> --------------------------
> We have added a sanity check that disables the use of a remote HE server and 
> switches
> to a local one if the remote HE server cannot be connected. This is done from 
> the
> "ConfigParsed" event. The reason for this is to avoid builds hanging in case 
> the remote HE server is not responding.

I'd have to see what that patch looked like but I get nervous about
builds changing from the users configuration without the user making
that change. If it shows a suitable warning it might be ok. 

ConfigParsed is a horrible place to do such a thing too. I know why
you've done that but it would probably be better in the hashserve
client code itself from an upstream perspective if you plan to submit
it.

> Areas of improvement:
> ---------------------
> 
> ### Data retention:
> There is no built in data-retention. Solving this with recurring external
> cleanup script. It works but also exposed locking in the server resulting
> in inter-lock with the external cleanup and timeout for the clients.
> This can partially be the nature of SQLite and a single file database.
> OE Hashserve is not built for cleaning up data and the data growth is high so
> it has to be handled.
> 
> ### Protocol:
> As our first tested option for deployment was Kubernetes, the absence of the 
> de-facto
> standard HTTP(s) protocol required some workarounds.
> Routing, authentication and monitoring support gets lost on the way. I would 
> suggest that we
> look into using HTTP(s) + JSON and some Basic Auth as the next basis of the 
> protocol.

We did a lot of work to make answers "fast" from the server. To make
12k req/s possible, we couldn't use high level protocol like http, let
alone https or signing. I would therefore caution that if you change it
like this, it will no longer perform anywhere near as fast as you
require and there would be knock on effects from that.

> ### Security:
> Supply-chain attacks is nowadays something to be aware of. This service uses a
> non encrypted and non authenticated protocol and is thus open to 
> man-in-the-middle
> attacks, or any type of fake data and manipulation. The protocol changes 
> suggested above
> could mitigate some of this. Additionally, one thought is for the client to 
> provide a signature of the hash together with the hash so that it can be 
> verified by the client using
> it's secret upon retrieval.

We've made the assumption that the write access ports would be under
some kind of network control. For public services like the one the
project shares, there is a risk of man-in-the-middle attacks but it
would only really be exploitable if you can change the sstate being
accessed too and if you can do that, you already have breached
security.

I suspect adding signing is going to affect speed a lot too
unfortunately. We probably do need to think about what to do about all
this but it isn't straightforward.

> ### External database connection
> An option for an external DB (For example PostgreSQL) would improve the 
> possibility for concurrent cleanup and vacuum while running.
> In the stateless world of cloud/containers/pods, an option for external DB
> would be favorable.
> This would probably make use of some external package for DB interaction so 
> it would
> differ a little from the standard python only nature of Bitbake in general.
> 
> ### Nice to have
> - Minor changes that should be more easily added as patches
> 
> * Hash (LRU) cache. From our stats, there seem to be a 1:20 ratio in writes 
> to reads
>   from the DB so a cache might save some resources.
> 
> Conclusions
> ===========
> Hashserve is FOSS and if we want improvements, we have to contribute.
> I will investigate to what extent we can chip in on some of these parts.
> However, especially for the protocol and maybe also on external Python
> dependencies, via PyPI, it would be nice to know if this is acceptable and
> wished-for changes, before starting out.

I know Joshua has had plans for a different version of hashserve using
more scalable technology which we can't include directly in bitbake due
to dependency issues. We had planned to keep the protocol simple and
minimal so other server implementations were possible.

We haven't really thought about supporting multiple client protocols,
it would be possible if there are compelling reasons but it obviously
would increase our complexity a lot and be hard to ensure everything
says working with suitable testing.

Cheers,

Richard


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#187993): 
https://lists.openembedded.org/g/openembedded-core/message/187993
Mute This Topic: https://lists.openembedded.org/mt/101497051/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to