JosiahWI opened a new pull request, #13170:
URL: https://github.com/apache/trafficserver/pull/13170
The PR title is fairly self-explanatory, but the design choices here deserve
explicit mention.
* The head_p type is no longer a union with a {pointer, version} field
This has been done to eliminate type punning, which was done all over the
implementation, and is UB. The pointer and version are now always set through
the macros `FREELIST_POINTER`, `FREELIST_VERSION`, and
`SET_FREELIST_POINTER_VERSION`, which use a separate {pointer, version} struct
type and `memcpy` on platforms where this is appropriate (see preprocessor defs
for the list).
* The head_p type has been changed into a type alias of the data type
This is necessary so that the head of the list will be an atomic integer
type instead of an atomic class type to be sure it can use 128bit atomic
hardware instructions on platforms that support them.
* Freelist alignment is now adjusted to be satisfy `head_p` alignment
requirements
This is actually a bug in master. There is no issue created for it. It
happens to work to allocate `head_p` objects that are underaligned on our
supported platforms, but it is UB and could very realistically fail... to drive
this point home, the included unit tests segfault with the original
implementation for this exact reason (they pass if the alignment is adjusted).
* The freelist and atomiclist pop operations (called freelist_new for the
freelist) are now locked to provide mutual exclusion
This one is annoying and might need to be reverted from this PR and
considered separately. This particular change is necessary to fix #11640 -
there is a minor data race in master in that the second pointer from the list
head can be overwritten by an allocator's placement new before it is read
without synchronization in `freelist_new` by another thread, which is going to
subsequently find out the list head is stale and retry. Thus, the garbage
pointer is not dereferenced, but this is still UB.
I have been thinking about approaches here. One approach is to add an atomic
flag to the list head that is set by any thread popping from the list. A thread
that has successfully popped can spin on that flag to wait for the completion
of any other threads still reading the memory it is about to return. My
intuition is that this will be better, but I don't know without benchmarking,
and it's a lot more complex than the lock.
Fixes #5398
Fixes #11640
## Previous Work
See #7382. This PR is only a step in the direction of #7382; it retains a
lot of the old code structure along with most of its design flaws. If this
change is accepted, it should thereafter be possible to apply other design
improvements from #7382, such as the fleshed out versioned pointer type, with
greater confidence.
## A Few Comments About Assertions
This PR adds a hoard of assertions that check alignment requirements. Most
of them are debug assertions - the alignment check on the pointer passed to
`freelist_push` is a release assert for now, because it would almost certainly
indicate a major issue if it triggered. According to the comment from
@bryancall, this assertion was in fact failing before (it was previously a
debug assert, and he commented it out). I am hopeful that that issue is now
resolved.
I have commented out the assertions in atomiclist, because the alignment
requirements for atomiclist are established elsewhere in ATS - and established
incorrectly. The head_p objects are misaligned all over the place. That is bad.
Unfortunately, it's not trivial to fix, because the head_p objects have to be
aligned at an offset from the base pointer, which *also* has to be aligned, but
for a different object type (e.g. `Event`).
## Performance Implications
I ran the following benchmarks in WSL running on an AMD Ryzen 5 5500U
processor.
### Freelist - Single Threaded Release Performance - Before
```
benchmark name samples iterations est run time
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
Single threaded alloc 100 587 2.348
ms
36.2374 ns 35.5824 ns 37.6799
ns
4.69932 ns 2.48233 ns 9.03906
ns
Single threaded free 100 1412 2.2592
ms
15.2373 ns 14.9109 ns 16.1355
ns
2.52519 ns 1.00718 ns 5.32402
ns
```
### Freelist - Single Threaded Release Performance - After
Notice that the allocation takes a performance hit because of the added lock.
```
benchmark name samples iterations est run time
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
Single threaded alloc 100 600 2.34
ms
40.3439 ns 38.0766 ns 46.6134
ns
17.5734 ns 7.42366 ns 35.7456
ns
Single threaded free 100 346 2.3528
ms
12.0705 ns 11.37 ns 15.4749
ns
6.78033 ns 0.198969 ns 16.1671
ns
```
### Freelist - Single Threaded Release Performance - After Without Lock
This represents the potential performance of allocation without the mutex
guarding the allocation routine. This case represents similar behavior to the
original code, which did not have the lock, either.
```
benchmark name samples iterations est run time
mean low mean high mean
std dev low std dev high std dev
-------------------------------------------------------------------------------
Single threaded alloc 100 672 2.352
ms
30.9758 ns 29.5555 ns 36.3515
ns
12.3705 ns 2.25337 ns 28.5114
ns
Single threaded free 100 489 2.3472
ms
13.0952 ns 11.6199 ns 16.8676
ns
10.5817 ns 2.0172 ns 19.303
ns
```
### Multithreaded Performance (Contention)
I don't know the best way to test this. I would appreciate input from
@bryancall or anyone who knows how to benchmark overall ATS performance. I'm
concerned that the added lock will have unacceptable performance impacts, but I
don't know how I should confirm this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]