BTW, have you considered synergetic implementation, which would work as following. Arrange an intermediate buffer followed by non-accessible page [commonly would be done with anonymous mmap of two pages followed by mprotect(PROT_NONE) for the second page]. Upon *_init we call software SHA*_Init. Then all short inputs go directly through software SHA*_Update, while everything that is larger than certain value, say 256 bytes, is treated as following. Input stream is first "purged/aligned" by running single pass of SHA*_Update till SHA*_CTX->data is full. Then available 64-byte chunks are copied to the *bottom* of first page mentioned above. Then we set up SEGV signal handler, let hardware suffer from page fault and collect the intermediate hash values. The procedure is repeated if more than pagesize was availalbe at a time. SHA*_CTX->Nl,Nh are adjusted accordingly and remaning bytes [if any] are fed again to software SHA*_Update. Upon *_final we just call *software* SHA*_Final. A.
