Re: PadLock engine SHA1 support
Andy Polyakov wrote: Hi, BTW, have you considered synergetic implementation, which would work as following. Arrange an intermediate buffer followed by non-accessible page [commonly would be done with anonymous mmap of two pages followed by mprotect(PROT_NONE) for the second page]. Upon *_init we call software SHA*_Init. Then all short inputs go directly through software SHA*_Update, while everything that is larger than certain value, say 256 bytes, is treated as following. Input stream is first purged/aligned by running single pass of SHA*_Update till SHA*_CTX-data is full. Then available 64-byte chunks are copied to the *bottom* of first page mentioned above. Then we set up SEGV signal handler, let hardware suffer from page fault and collect the intermediate hash values. The procedure is repeated if more than pagesize was availalbe at a time. SHA*_CTX-Nl,Nh are adjusted accordingly and remaning bytes [if any] are fed again to software SHA*_Update. Upon *_final we just call *software* SHA*_Final. Are you sure it flushes the intermediate results on exception? Well we can try ;-) Yep it works. Proof of concept at http://www.logix.cz/michal/devel/padlock/phe_sum.c It isn't optimized at all, does finalizing in HW so it can be compiled wothout OpenSSL and only works for files 512MB. But it actually works, which is a good start ;-) Thanks for the idea Andy! Michal __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Hi, BTW, have you considered synergetic implementation, which would work as following. Arrange an intermediate buffer followed by non-accessible page [commonly would be done with anonymous mmap of two pages followed by mprotect(PROT_NONE) for the second page]. Upon *_init we call software SHA*_Init. Then all short inputs go directly through software SHA*_Update, while everything that is larger than certain value, say 256 bytes, is treated as following. Input stream is first purged/aligned by running single pass of SHA*_Update till SHA*_CTX-data is full. Then available 64-byte chunks are copied to the *bottom* of first page mentioned above. Then we set up SEGV signal handler, let hardware suffer from page fault and collect the intermediate hash values. The procedure is repeated if more than pagesize was availalbe at a time. SHA*_CTX-Nl,Nh are adjusted accordingly and remaning bytes [if any] are fed again to software SHA*_Update. Upon *_final we just call *software* SHA*_Final. Man that's a wicked idea ;-) Though I'm not sure how xsha would survive restarting after its segfault. Well, the idea is rather to *not* restart it, but collect intermediate results and terminate it. Then this results are fed to either software or back to hardware as if it's a whole lot of new data, but with init values from previous step. The keyword is also to *never* let hardware do the final padding and final block calculation [which is why it always looks like a whole lot of data to hardware]. That's because hardware never knows correct Nl,Nh values used for final padding, only software does. Are you sure it flushes the intermediate results on exception? Well we can try ;-) Manual says it does. Well, it doesn't say it flushes on SEGV in particular, but at low level processors don't normally distinguish SEGV, page fault or other exception. They just go like oh! it's *an* exception, I flush, go kernel, call handler. Manual essentially says I flush upon *an* exception. Would such an approach work on all architectures (anonymous and protected pages, sighandlers, ...)? I don't know, but we can always make it conditionally available on explicitly tested architectures:-) You also have to realize that it also takes extra effort to make such implementation thread-safe. There are basically two options. 1. Allocate pages on per-thread basis [which would require unified API to per-thread storage, something we don't have]. 2. Serialize access to hardware [which we have unified API for]. As hardware is faster than network second one is perfectly viable option. In the meantime could we go with the old fashioned patches that I sent some time ago? I'll realign them with current CVS head (or 0.9.8 branch). There were unanswered questions like support for SHA-224, test suite with public record that it passes, EVP_MD_FLAG_ONESHOT... But I don't have time to look into it right now, we have to do in May or something... A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Hi Andy, I'm sorry for such a late reply ;-) I didn't have the hardware available during past few months and only got it up and running again recently. BTW, have you considered synergetic implementation, which would work as following. Arrange an intermediate buffer followed by non-accessible page [commonly would be done with anonymous mmap of two pages followed by mprotect(PROT_NONE) for the second page]. Upon *_init we call software SHA*_Init. Then all short inputs go directly through software SHA*_Update, while everything that is larger than certain value, say 256 bytes, is treated as following. Input stream is first purged/aligned by running single pass of SHA*_Update till SHA*_CTX-data is full. Then available 64-byte chunks are copied to the *bottom* of first page mentioned above. Then we set up SEGV signal handler, let hardware suffer from page fault and collect the intermediate hash values. The procedure is repeated if more than pagesize was availalbe at a time. SHA*_CTX-Nl,Nh are adjusted accordingly and remaning bytes [if any] are fed again to software SHA*_Update. Upon *_final we just call *software* SHA*_Final. A. Man that's a wicked idea ;-) Though I'm not sure how xsha would survive restarting after its segfault. Are you sure it flushes the intermediate results on exception? Well we can try ;-) Would such an approach work on all architectures (anonymous and protected pages, sighandlers, ...)? In the meantime could we go with the old fashioned patches that I sent some time ago? I'll realign them with current CVS head (or 0.9.8 branch). Michal __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Should I add/fix something? BTW, have you considered synergetic implementation, which would work as following. Arrange an intermediate buffer followed by non-accessible page [commonly would be done with anonymous mmap of two pages followed by mprotect(PROT_NONE) for the second page]. Upon *_init we call software SHA*_Init. Then all short inputs go directly through software SHA*_Update, while everything that is larger than certain value, say 256 bytes, is treated as following. Input stream is first purged/aligned by running single pass of SHA*_Update till SHA*_CTX-data is full. Then available 64-byte chunks are copied to the *bottom* of first page mentioned above. Then we set up SEGV signal handler, let hardware suffer from page fault and collect the intermediate hash values. The procedure is repeated if more than pagesize was availalbe at a time. SHA*_CTX-Nl,Nh are adjusted accordingly and remaning bytes [if any] are fed again to software SHA*_Update. Upon *_final we just call *software* SHA*_Final. A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Andy Polyakov wrote: Could be. But should it be run automatically during make? I guess no... No, but I'd like to *see* some test program and I'd like to hear explicit statement that the implementation passes this test. As you might recall we have tested AES by encypting with software and decrypting with engine and then other way around. We need something even this time:-) A. FWIW I'm testing it with OpenVPN having OpenSSL-0.9.8+padlock on one end and OpenSSL-0.9.7 without padlock on the other. And of course I'm regularly running those openssl dgst tests with and without engine on some input files. I don't often send untested patches ;-) Michal Ludvig -- * Personal homepage: http://www.logix.cz/michal __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Could be. But should it be run automatically during make? I guess no... No, but I'd like to *see* some test program and I'd like to hear explicit statement that the implementation passes this test. As you might recall we have tested AES by encypting with software and decrypting with engine and then other way around. We need something even this time:-) FWIW I'm testing it with OpenVPN having OpenSSL-0.9.8+padlock on one end and OpenSSL-0.9.7 without padlock on the other. And of course I'm regularly running those openssl dgst tests with and without engine on some input files. I don't often send untested patches ;-) I'm not questioning whether or not the patch is tested! I simply would like to see a record, preferably public one, on *simple* verification procedure, which *anybody* [with appropriate hardware] could execute at any time and in no time, without having to setup VPN or similar. That's all. A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Andy, Ping :-) Did you have time to look at this patch? Should I add/fix something? Thanks! Michal Ludvig -- * Personal homepage: http://www.logix.cz/michal __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Ping :-) (-: Pong Did you have time to look at this patch? No, unfortunately. Are you in hurry? If yes, what's the hurry? Should I add/fix something? Windows support:-) SHA-224 [which differs from SHA-256 only by initial constants and truncated output]. Test programs [extra -e argument perhaps]. BTW, what's the deal with padlock engine in ./config -shared configuration? It doesn't seem to be there... A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Andy Polyakov wrote: Did you have time to look at this patch? No, unfortunately. Are you in hurry? If yes, what's the hurry? No I'm not. I just wanted to move forward... Should I add/fix something? Windows support:-) Uh, eh, ... afterall I don't have a machine to test it on. SHA-224 [which differs from SHA-256 only by initial constants and truncated output]. I see. I'll look at it and send you the patch. Test programs [extra -e argument perhaps]. ... to enable engines? Could be. But should it be run automatically during make? I guess no... BTW, what's the deal with padlock engine in ./config -shared configuration? It doesn't seem to be there... A. I don't know. I don't know much about the support infrastructure for engines. Any pointers where to look? Michal Ludvig -- * Personal homepage: http://www.logix.cz/michal __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Test programs [extra -e argument perhaps]. ... to enable engines? Yes. On the other hand I suppose one can write a script, which would simply call 'openssl dgst -sha[1|256] -engine padlock' with a set of known input vectors... Could be. But should it be run automatically during make? I guess no... No, but I'd like to *see* some test program and I'd like to hear explicit statement that the implementation passes this test. As you might recall we have tested AES by encypting with software and decrypting with engine and then other way around. We need something even this time:-) A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Andy Polyakov wrote: What happens when you issue the instruction without rep prefix? That's invalid instruction I believe. Dare to actually try? Just tried = Invalid instruction ;-) Instead its necessary to accumulate all data from update()s in some buffer and hash them only in final(). Note that there is EVP_MD_FLAG_ONESHOT, which can/should be used to avoid fallback to software at least for such cases. I have found this flag but didn't realise how to use it. If flag is set, just hash directly in update procedure and do nothing [but byte swaping?] in final. Instead of doing nothing but copying in update procedure and do hashing in final. What if you need to do several updates before finally hashing it? E.g. like in HMAC? You need to store the data somewhere before actually hashing them with padlock... And IIRC it's only used in one engine. Afterall I decided it's useless and wrote the software fallback path for SHA. Note that I didn't suggest to scrap software fallback [yet?], just to *complement* with a way to hash larger data chunk if it's readily available in one stroke. How do you know that there won't be more data to come to update()? Maybe I'm missing something w.r.t. this ONESHOT option...? You may want to generalise the software fallback path somewhere into openssl core, but the question is if it's worth the overhead now, when only one engine needs it. BTW, as for copying. As more than likely sensitive data gets copied into intermediate buffer, it's more than appropriate to zero it prior free. I only see memset on padlock intermediate state. A. Yeah, right. Attached is and incremental diff addressing this issue. Michal Ludvig -- * Personal homepage: http://www.logix.cz/michal Index: openssl-0.9.8/crypto/engine/eng_padlock.c === --- openssl-0.9.8.orig/crypto/engine/eng_padlock.c +++ openssl-0.9.8/crypto/engine/eng_padlock.c @@ -1153,6 +1153,7 @@ padlock_sha_bypass(struct padlock_digest if (ddata-buf_start ddata-used 0) { SHA1_Update(ddata-fallback_ctx, ddata-buf_start, ddata-used); if (ddata-buf_alloc) { + memset(ddata-buf_start, 0, ddata-used); free(ddata-buf_alloc); ddata-buf_alloc = 0; } @@ -1266,6 +1267,7 @@ padlock_sha_final(EVP_MD_CTX *ctx, unsig /* Pass the input buffer to PadLock microcode... */ padlock_do_sha1(ddata-buf_start, md, ddata-used); + memset(ddata-buf_start, 0, ddata-used); free(ddata-buf_alloc); ddata-buf_start = 0; ddata-buf_alloc = 0; @@ -1298,8 +1300,10 @@ padlock_sha_cleanup(EVP_MD_CTX *ctx) { struct padlock_digest_data *ddata = DIGEST_DATA(ctx); - if (ddata-buf_alloc) + if (ddata-buf_alloc) { + memset(ddata-buf_start, 0, ddata-used); free(ddata-buf_alloc); + } memset(ddata, 0, sizeof(struct padlock_digest_data));
Re: PadLock engine SHA1 support
The intermdiate status (and finally the result) is stored in the 128Bytes memory array in padlock_do_sha1(). I.e. it's context switch safe. What happens when you issue the instruction without rep prefix? That's invalid instruction I believe. Dare to actually try? Instead its necessary to accumulate all data from update()s in some buffer and hash them only in final(). Note that there is EVP_MD_FLAG_ONESHOT, which can/should be used to avoid fallback to software at least for such cases. I have found this flag but didn't realise how to use it. If flag is set, just hash directly in update procedure and do nothing [but byte swaping?] in final. Instead of doing nothing but copying in update procedure and do hashing in final. And IIRC it's only used in one engine. Afterall I decided it's useless and wrote the software fallback path for SHA. Note that I didn't suggest to scrap software fallback [yet?], just to *complement* with a way to hash larger data chunk if it's readily available in one stroke. BTW, as for copying. As more than likely sensitive data gets copied into intermediate buffer, it's more than appropriate to zero it prior free. I only see memset on padlock intermediate state. A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
the attached patch adds SHA1 support for VIA PadLock engine. Did VIA publish documentation for new instructions on their web-site? If not and you have it, can you send a copy to me? There are several design decisions that I may need to explain: The xsha1 instruction always finalizes the MD computation, That kind of sucks... i.e. it is not possible to call the hardware in sha1_update() with the provided input buffer. But the instruction with rep prefix is interruptable, i.e. can be exposed to context switch, right? That would mean that all the intermediate status has to be kept somewhere, either in visible registers or off-loaded to memory. What happens when you issue the instruction without rep prefix? Instead its necessary to accumulate all data from update()s in some buffer and hash them only in final(). Note that there is EVP_MD_FLAG_ONESHOT, which can/should be used to avoid fallback to software at least for such cases. In padlock_init() I allocate a buffer of a given size (8k as well) whose first 16B-aligned address goes to buf_start. Having the input data aligned allows PadLock crunch them faster. Is 16B-alignment for input a requirement even for SHA? Even refers to AES... A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: PadLock engine SHA1 support
Andy Polyakov wrote: The xsha1 instruction always finalizes the MD computation, That kind of sucks... Hopefully the next version of the CPU will have a new hashing instruction that will finalize only on request. I was already in touch with the CPU architects, explained them what problems the current design brings to us and they agreed to improve it. i.e. it is not possible to call the hardware in sha1_update() with the provided input buffer. But the instruction with rep prefix is interruptable, i.e. can be exposed to context switch, right? That would mean that all the intermediate status has to be kept somewhere, either in visible registers or off-loaded to memory. The intermdiate status (and finally the result) is stored in the 128Bytes memory array in padlock_do_sha1(). I.e. it's context switch safe. What happens when you issue the instruction without rep prefix? That's invalid instruction I believe. Instead its necessary to accumulate all data from update()s in some buffer and hash them only in final(). Note that there is EVP_MD_FLAG_ONESHOT, which can/should be used to avoid fallback to software at least for such cases. I have found this flag but didn't realise how to use it. And IIRC it's only used in one engine. Afterall I decided it's useless and wrote the software fallback path for SHA. In padlock_init() I allocate a buffer of a given size (8k as well) whose first 16B-aligned address goes to buf_start. Having the input data aligned allows PadLock crunch them faster. Is 16B-alignment for input a requirement even for SHA? Even refers to AES... A. No, alignment is required only for output buffer (128 Bytes in padlock_do_sha1()), but having the input aligned improves performance a lot. Because we're copying the data anyway, we can copy it to an aligned address. BTW In VIA Esther the buffers for AES can be unaligned in some cases as well. I'll come up with a patch. Michal Ludvig -- * Personal homepage: http://www.logix.cz/michal __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]