Hi Chapel Developers -
I'd like to introduce the Cache for Remote Data, in order to proceed with code
review and contributing it as a disabled-by-default component of the upcoming
release. In particular, I'd like to know who might be willing to review it...
Description below.
Thanks,
-michael
== Introduction ==
It's easy to write Chapel programs that create many small messages. For example:
var A:[1..100] int;
....
on Locales[1] {
for a in A {
use a
}
}
That bit of code might be for example using the vector in a distributed
matrix-vector multiply. Or maybe it's just some poorly written code. Either
way, it results in a 100 8-byte GETs from Locale 1 to Locale 0.
One way around this problem would be to create a local copy of A, enabling the
Chapel compiler's ability to optimize whole array assignment. While this
approach does reduce the number of messages, and is probably the right idea in
a library context, it has a few drawbacks:
- the obvious solution might use to much memory (imagine than A is large and
distributed; we can't just copy the whole thing).
- a tiled/blocked solution adds complexity.
- if many such local copies are used throughout a program, we might end
up with too much memory used for buffering, especially on large systems.
- as a communication optimization that would need to be located within the
code using A, it diminishes the Chapel promise of separating algorithm
from data distribution.
== Caching Remote Data ==
The basic idea here is to add a new runtime component that caches remote data.
This cache allows for:
- data re-use without adding communication
- automatic prefetching triggered by sequential access
- automatic write-behind for puts
- user- (or compiler-) requested prefetching similar to __builtin_prefetch
- bounded communication buffers when the cache is used
Philosophically, the cache needs to exist on each locale, but in the
implementation (as we'll discuss further below) there is one cache per pthread.
What about memory consistency? The natural and key question one should ask
about any cache is how it ensures that the program still runs in a sane manner.
More specifically, you might want to know the answers to these three questions:
1) Do non-parallel programs still work as they would without the cache?
2) How can tasks on two locales ever share data?
3) Does the cache introduce a lot of memory consistency traffic?
The key idea in the design of this cache is to avoid all memory consistency
traffic. Instead, all data in the cache is considered invalid once the memory
consistency model requires us to read new values. But more on that in a second;
let's discuss the answer to first two questions.
For non-parallel programs (or program regions), reads and writes happen through
the cache. Naturally, a written value that is then read needs to return the
value written (and not some other value). Then, at some point, the written
value needs to end up in a PUT to the destination locale. The main wrinkle here
for Chapel is that 'on' statements have specific meaning within the cache. When
we begin an 'on' statement, we must invalidate everything in the cache (since a
sequential program could have written values there in another locale before the
'on' statement began). When we complete an 'on' statement, we must flush any
pending PUTs in the cache, since the locale we return to might rely on the
values being written there.
How do tasks share data? The short answer is - they synchronize. Suppose one
task wants to write a value to memory, then ask the second task to read the
newly written value. As with processor-cache based memory consistency, the two
tasks need to do the following in order to orchestrate this:
x = 0 at program start.
task A, producer
x = 43
// write behind x?
release barrier
notify task B
task B, consumer
// prefetch x?
wait for notification
acquire barrier
read x and see x == 43
** Note -- We will use the words 'barrier,' 'memory barrier,' and 'fence'
interchangeably. Since we're talking about a cache and memory consistency, it
should be clear that we are NOT talking about the distributed computing
barrier
in which tasks wait for all tasks to arrive at the same point.
What do 'acquire' and 'release' mean? These terms are from the C++11 and C11
memory model. 'acquire' means that any reads used after the barrier cannot be
started before the barrier. The problem that the acquire barrier is addressing
is that we might have read ahead some data some time ago (either with
prefetching or just cache reuse) that is no longer usable. In the example, we
have highlighted this situation with the 'prefetch x?' comment in task B. In
the fictional, ideal world, all of our GETs are prefetched into the cache to
cover network latency. But suppose that the prefetch happened before task A
completed its write to x. Then task B cannot use the value for x in the cache -
if it did, it would have a stale old value. The acquire barrier indicates that
the old value for x (or any other memory location) is not acceptable. For this
cache in particular, an acquire barrier means that any value in the cache is no
longer usable - in other words, an acquire barrier is equivalent to evicting
everything from the cache.
The release barrier is handling the opposite situation. 'release' means that
any writes started before the barrier must complete before the barrier. In the
example, the comment after x = 43 in task A says 'write behind x?'. It would be
nice if we could take our time writing x, so that we can combine it with other
writes or just cover network latency. The problem is - if we are able to put
off completing the writing of x until after task B is notified, it could read
the old value. So, the 'release' barrier indicates that any writes performed by
the program need to be totally completed.
Note that 'acquire' and 'release' memory barriers are necessary for
multithreaded programming *on a single locale*. So the memory semantics that
programs rely on *already* are sufficient to enable the cache of remote data.
Note also that many operations in Chapel come with acquire and/or release
barriers. At the moment, sync variables and atomic operations have full barrier
semantics (meaning that they imply an acquire and a release barrier). So in a
real Chapel program, the required barriers would probably be hidden in whatever
technique was used to 'notify' or 'wait'. Atomic operations on atomic-type
variables can specify the required memory semantics (e.g., you could do
myatomic.fetchAdd(1, memory_order_acquire) ).
== Implementation Notes ==
The cache itself is a 2Q cache (because this kind of cache is reported to have
better efficiency than a plain LRU). Besides being in a queue of one sort or
another, entries are also stored in a 'pointer tree' which is a two-level
'hashtable' where the hash function just selects different portions of the
remote address. The pointer tree uses separate chaining (ie, each hash table
element is actually a linked list of elements that go into that bucket).
The cache consists of 'cache entries', one per 'cache page'. A 'cache page' is
1024 bytes in the current implementation. The pointer tree and the 2Q queues
consist of cache entries which may point to a 1024-byte cache page. However, a
GET is always rounded up to entire 'cache line'. A cache line is currently 64
bytes. Each cache entry tracks which cache lines are valid (ie, for which cache
lines in the cache page have we done a GET?) and for pages that have been
written to in a PUT - aka 'dirty pages' - which bytes in the page have been
written to.
There is a tradeoff in the cache line size and in the cache page size:
- smaller cache lines might mean lower latency gets
- larger cache lines might mean fewer gets/more cache hits
- smaller cache pages mean less wasted space (imagine putting 1 byte per page)
- larger cache pages mean larger aggregated operations
(since read ahead/write behind will only ever get a single cache page
in one operation).
We chose 64 bytes for the cache line size because on our Infiniband network, 64
bytes is the largest request size which has no significant increase in latency
from an 8 byte request. We chose 1024 bytes for the cache page size because it
is the smallest request size that allows close to peak bandwidth in our
network.
When processing a GET, we first check to see if the requested cache page is
in the pointer tree. If not, we find an unused cache page and immediately start
a nonblocking get into the appropriate portion of that page. While the get is
ongoing, we adjust the pointer tree and the 2Q queues in order to add a new
cache entry. If we did find the requested cache page in the pointer tree, we
check to see if the requested data is in the page's dirty bits or valid lines.
If so, we copy out the cached data. If not, we wait for any pending operations
that would overlap with the get we are about to start and then we start a
nonblocking get and mark the appropriate lines as valid. The implementation
finishes up a GET operation by waiting for the nonblocking get we started
(unless we are doing a prefetch, in which case it is not necessary to wait),
then by copying the newly gotten data out of the cache and into the request
buffer.
We would like GETs to adjacent memory to trigger reahahead, but that is left
for future work. At the moment, we have only the benefit that GETs to adjacent
memory are requesting a whole cache line (64-bytes) at a time instead of
(typically) 8 bytes at a time.
When processing a PUT, we similarly check for the requested cache page in the
pointer tree and use an unused page if not. We find a unused 'dirty entry' to
track the dirty bits of the cache page if the cache entry does not already have
one. If we are working with an existing cache entry, we wait for any operation
that would overlap with the region of the cache page we are about to overwrite.
We copy the data into the appropriate part of the cache page and mark the
appropriate bits in the dirty entry.
On a release barrier or when we have too many 'dirty entries', we go through
dirty pages and create and start PUTs for each contiguous section with the dirty
bits set. In this manner, PUTs to adjacent memory locations are aggregated.
Note that it took significant effort to implement this cache efficiently
enough. The implementation we are presenting here is the 5th design we tried.
We would like to illustrate why it was a challenge to keep the overhead low.
For an Infiniband network, the latency for a small operation is on the order of
2 microseconds. That might sound like a lot, but it's fairly easy to take 2
microseconds in software (these are performance estimates):
- 1 malloc call
- 20 processor synchronizations
- 20 cache lines loaded from RAM
- 2000 instructions executed.
These numbers influenced our design:
- we start a nonblocking GET as soon as we know the page is not in the cache
- we pre-allocate all memory used by the cache
- we create 1 cache per pthread (to avoid processor synchronization)
- we did our best to minimize the amount of memory read/written when
processing a put or a get.
Here's how pending requests and acquire barriers are handled. We assign each
operation a sequence number using a thread-local counter. We record minimum and
maximum sequence numbers in cache entries. Each task has in task-local storage
the sequence number of the last acquire barrier. If a task performs a GET but
finds a cache entry with a minimum sequence number before its last acquire
barrier, it must invalidate that cache line and do a new GET.
Lastly, since the implementation uses thread-local storage for the cache, it
requires that tasks not move between threads. Tasks could move between threads
if we had a way to notify the cache that they were about to do so (in which
case the cache would issue a release barrier in the old thread and an acquire
barrier in the new thread). Another alternative would be to mark tasks with
ongoing operations in the cache as not movable to other threads. A third option
would be to only move tasks between threads in situations that create full
barriers anyway; notably a full barrier occurs on task start and sync variable
use.
== Testing ==
Our implementation passes all normally passing GASNet/UDP tests.
== Performance ==
We compared performance of PTRANS, LULESH, MiniMD, and SSCA2 benchmarks with
and without the cache using GASNet and the ibv conduit with 4 nodes.
PTRANS: about 2x faster with the cache
LULESH: about 3x faster with the cache (for sedov15oct.lmesh)
MiniMD: about 2x faster with the cache
SSCA2: kernel 1 is about 33% slower with the cache
SSCA2: kernel 2 is about 4x faster with the cache
SSCA2: kernel 3 is about 2% faster with the cache
SSCA2: kernel 4 is about 30% faster with the cache
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers