What you are asking for is called an "out of core" matrix-multiplication algorithm. If you google for it, you'll find several papers on this. The key (as is also true for in-memory matrix operations), is to maximize memory locality.
However, there doesn't seem to be much free code available for this problem. My suspicion is that it is not so popular because: 1) If your problem doesn't fit in memory, these days the first choice is usually to go to a distributed-memory cluster rather than hitting the disk. 2) The cubic scaling of matrix-matrix products means that you can't increase the size too much anyway. If you give me 1000x more processing power, I can handle a 10x bigger matrix rank (assuming dense matrices). PS. HDF5 lets you specify how an array is "chunked" for storage, and lets you read back subsets of an array at a time. All of this functionality is exposed in HDF5.jl, I believe. Possibly it would be more efficient to memory-map the array and let the OS deal with paging it in and out of memory. Note that Julia matrices are stored in column-major format (contiguous columns stored one after another), so you'll want to access the matrix column-by-column in your matrix-vector products.
