Fwd: [tahoe-dev] pycryptopp vs ARM

Zooko Wilcox-O'Hearn Wed, 03 Jun 2009 09:03:06 -0700

Begin forwarded message:

> From: Brian Warner <[email protected]>
> Date: June 3, 2009 5:22:23 AM MDT
> To: [email protected]
> Subject: [tahoe-dev] pycryptopp vs ARM
> Reply-To: [email protected]
>
>
> I did some digging into the mysterious pycrypto++ corruption  
> failures on
> Zandr's ARM-based linkstation. The executive summary: crypto+ 
> +-5.5.2 is
> broken, 5.6.0 probably fixes it, the problem won't appear on x86 or  
> most
> other processors, there's a workaround but it's too ugly to  
> contemplate.
>
>
> == The problem with crypto++-5.5.2 and ARM ==
>
> The main workhorse of AES-CTR_Mode is in strcipr.cpp,
> AdditiveCipherTemplate<S>::ProcessData. This is responsible for  
> each call to
> AES.ProcessData(), which means it is effectively generating a  
> stream cipher
> that looks like:
>
>  16 bytes of AES.encode(00000) ("00000" means 128 bits of zeros)
>  16 bytes of AES.encode(00001) (127 bits of zeros and a single  
> "one" bit)
>  16 bytes of AES.encode(00002)
>  etc..
>
> and XORing that stream cipher with the input data to produce the  
> output data.
> Each call to ProcessData() can involve an arbitrary amount of data,  
> but
> always works with adjacent spans of the stream, so it is  
> responsible for
> keeping track of how many bytes have been processed already: an  
> offset from
> the beginning of the stream.
>
> It remembers this offset in two pieces. The high-order piece is the  
> number of
> 16-byte blocks which have been encoded, stored in "m_counterArray",  
> as a
> 128-bit/16-byte number (remember that we're using AES here, so the  
> keysize
> can be either 128-bits or 256-bits, but the blocksize is always
> 128-bits/16-bytes).
>
> The low-order piece is stored backwards in "m_leftOver": upon entry to
> ProcessData, if we're sitting at offset=0,16,32, then m_leftOver=0,  
> but if
> we're sitting at offset=1,17,33 then m_leftOver=15, and if we're at
> offset=15,31,47 then m_leftOver=1.
>
> The AES object has a buffer named "m_buffer" which is used to hold the
> leftover stream cipher blocks between calls to ProcessData. It  
> always holds
> exactly 16 bytes, and always updates these bytes as a unit. m_buffer 
> [0] holds
> the stream byte that corresponds to offset=0,16,32 . m_buffer[1]  
> holds the
> byte that is used for offset=1,17,33.
>
> m_leftOver then tells you which of the right-most bytes in m_buffer 
> [] are
> waiting to be used. If m_leftOver=3, then m_buffer[13..15] are  
> waiting to be
> XORed to provide offset=13..15, or offset=29..31 .
>
> ProcessData() has three phases:
>  1: use up any leftover stream data sitting in m_buffer from last time
>  2: process as many full 16-byte blocks as possible
>  3: process a trailing partial block, if any, putting the leftover  
> stream in
>     m_buffer
>
> Phase 1 happens if m_leftOver>0 and works by just XORing the right  
> offset in
> m_buffer with the input data, and incrementing all the pointers. It  
> does this
> one byte at a time: not as efficient as you might like, but it's  
> never being
> used for more than 15 bytes per operation.
>
> Phase 2 has two forms: some operations can handle multple blocks at  
> once very
> efficiently (remember that ProcessData() is pretty generic and is  
> used by
> lots of different codepaths), so it sets up the underlying  
> operation (i.e.
> AES256) to encrypt-and-XOR a couple of blocks and says Go. The  
> operation code
> that is passed in to OperateKeystream() includes a note that says  
> whether the
> input and output string pointers are aligned, but many operations  
> (including
> modes.cpp:CTR_ModePolicy::OperateKeystream) ignore this note.
>
> If the first form couldn't be used, it falls back to the second  
> form, which
> has a while(length>=bufferByteSize) loop and explictly writes the  
> keystream
> into m_buffer, XORs it with the input string into the output  
> string, and
> increments all the pointers. This XOR loop will run one byte at a  
> time if its
> arguments aren't aligned.
>
> Phase 3 (which only happens if the remaining length is nonzero)  
> writes the
> keystream into m_buffer, XORs just the partial block (one byte at a  
> time),
> then sets m_leftOver so that the next time through Phase 1 will use  
> the
> previously generated keystream. A given keystream block is only  
> generated
> once.
>
> Note that Phase 1 and Phase 3 (and the second form of Phase 2, but  
> not the
> first) all use byte-at-a-time loops. Also note that Phase 2 is used  
> for the
> bulk of the data, so it wants to be as fast as possible.
>
> == The Gun On The Mantle In Act One ==
>
> All full-size CPUs like big fat memory reads, for efficiency:  
> they've got 32-
> or 64- bit memory busses, and read whole cache lines at a time.  
> They prefer
> aligned memory reads: x=*((int32_t*)0x0), because unaligned reads like
> x=*((int32_t*)0x1) or x=*((int32_t*)0x2) actually require the  
> processor to
> perform two reads (the first of 0x0-0x3, the second of 0x4-0x7) and  
> then
> shuffle bytes around to get them into the right place. Memory  
> writes behave
> similarly.
>
> (microcontrollers, at least those with an 8-bit bus, don't care about
> alignment. In fact, many of them don't even have 16- or 32- bit  
> operations,
> so the issue never comes up).
>
> Some processors will begrudgingly perform unaligned reads for you, but
> they'll be slow and grouchy about it. Some will throw a hardware  
> exception,
> which might be caught by the kernel and emulated (meaning it'll  
> work, but now
> the kernel is grouchy too, and things are really really slow), or  
> might be
> delivered as a SIGILL to your program (which will probably kill it).
>
> The ARM processor has an exciting quirk. In certain configurations  
> (which
> depend upon what the kernel wants to do), it uses a third mode, in  
> which
> unaligned reads cause the specific byte to be loaded correctly but the
> remaining bytes get shifted around. This effectively corrupts most  
> of the
> loaded word. http://www.aleph1.co.uk/chapter-10-arm-structured- 
> alignment-faq
> has a good writeup on the issue: their basic example is (remember  
> these are
> little-endian):
>
>         Address: 0  1  2  3  4  5  6  7
>         Value  : 10 21 66 23 ab 5e 9c 1d
>
>   Using *(unsigned long*)2 would give:
>
>         on x86: 0x5eab2366
>         on ARM: 0x21102366
>
> Similar things happen if you try to do an unaligned write: instead of
> modifying bytes [2..5], you'll modify bytes [0..3] with some  
> surprisingly
> rotated version of what you were writing. Byte [2] might get the  
> right thing,
> but byte[0] will be clobbered.
>
> == The Gun On The Mantle In Act Two ==
>
> The function in Phase 2 which generates the stream data bottoms out in
> rijndael.cpp's Rijndael::Enc::ProcessAndXorBlock, which takes three  
> (byte *)
> pointers (plus the internal key, already prepared and turned into  
> subkeys):
> the input block (i.e. the counter), the XOR block (i.e. the  
> plaintext), and
> the output block (i.e. the ciphertext). Most of the code in  
> rijndael.cpp is
> x86 assembly code, but when that's disabled (hint: use M-x
> cpp-highlight-buffer to make emacs colorize or hide code inside  
> specific
> conditionals, like !defined(CRYPTOPP_X86_ASM_AVAILABLE) ) the part  
> that's
> left is the regular C AES implementation. It shuffles a bunch of  
> data around
> internally and then does something like this:
>
>       word32 *const obw = (word32 *)outBlock;
>       const word32 *const xbw = (const word32 *)xorBlock;
>
>       obw[0] = tbw[0] ^ xbw[0] ^ rk[0];
>
> That means it's taking the xorBlock plaintext pointer (a byte*),  
> pretending
> it's really a 4-byte (word32*), and then reading from it one word  
> at a time.
> If xorBlock was not actually a multiple of 4 bytes, then this will  
> be an
> unaligned read. The write is doing the same thing, using outBlock  
> and casting
> it to a word32* and then writing words into it, so you can get an  
> unaligned
> write from this. This doesn't occur on x86 because the assembly  
> version gets
> used instead of the C version (I assume that either the assembly  
> handles
> misalignment explicitly, or that the x86 behaves correctly in the  
> face of
> unaligned accesses).
>
> == The Gun Is Fired In Act Three ==
>
> The sequence of operations that triggers a pycryptopp failure on  
> ARM is when
> an AES encryption instance is used to process two chunks of data in  
> which the
> first chunk is less than 16 bytes long and the two chunks combined  
> are at
> least 32 bytes long. The python code looks something like this:
>
>  expected = AES(key).process(plaintext)
>  e2 = AES(key)
>  got = e2.process(plaintext[:9]) + e2.process(plaintext[9:9+23])
>  assert expected == got # fails
>
> In particular, on ARM we see something like:
>  09,23: 0551d7974ff1d9e29c  
> 9d883746f146a6,ffc9ce5ecaa9abd890da470e46000000 <--GOT
>  09,23: 0551d7974ff1d9e29c  
> 9d883746080343,f15abafbc9ca5ac6a9a7eeaada790a42 <--EXPECTED
>
> (where the space is the break between the 9-byte call and the 23- 
> byte call,
> and the comma is where the 16-byte blocksize lands).
>
> We also see failures with other length combinations: 9+24, 10+22, 10 
> +23,
> 10+24, 15+17, etc.
>
> Let's look specifically at 9+23. The first call to process() will  
> skip Phase
> 1 (since this is the first time we've used this object), will skip  
> Phase 2
> (since we're not processing a full block), and only run Phase 3.  
> Phase 3 will
> generate the stream data for offset=0..15 and store it in m_buffer,  
> then XOR
> the first 9 bytes of that to produce the ciphertext, and store it  
> in the
> output pointer. Note that pycryptopp/Python allocates a new string  
> for each
> call to ProcessData(), and those strings are always 4-byte aligned.  
> Phase 3
> sets m_leftOver to 7, since we've only used 9 bytes out of the 16 in
> m_buffer. This works fine. The 9 bytes of ciphertext  
> (0551d7974ff1d9e29c)
> returned by process() are correct. The output buffer so far:
>
>  09,23: 0551d7974ff1d9e29c
>
> The second call to process() wants to process 23 bytes of  
> plaintext. It
> allocates a Python string for the result, let's pretend it gets  
> address
> 0x1000 (it will generally be 4-byte aligned, so it could just as  
> easily be at
> 0x1004 or 0x1008 or 0x100c). Phase 1 will see that m_leftOver=7, so  
> it does
> byte-wise XORs the remaining stream data from m_buffer into the  
> output and
> gets 9d883746080343. So far, so good. Phase 1 finishes by  
> incrementing the
> output pointer, in this case 0x1000+7=0x1007, and decrementing the  
> remaining
> length to be processed, 23-7=16. The output buffer now has:
>
>  09,23: 0551d7974ff1d9e29c 9d883746080343
>
> Phase 2 wants to work with whole blocks, and since length>=16, it  
> gets to
> run. It calls OperateKeystream() and tells it to encrypt-and-XOR  
> data using a
> ciphertext target pointer (outblock) of... 0x1007. Here is the  
> problem. That
> 32-bit cast and obw[0]= dereference causes an unaligned access, and  
> on ARM
> you wind up with writes to [0x1004-0x1007] instead of  
> [0x1007-0x100a]. This
> clobbers the bytes which were already written, giving us:
>
>  09,23: 0551d7974ff1d9e29c 9d883746080343
>                                           xxxxxxxx  <- expected write
>                                    f146a6,ff   <- actual write
>  09,23: 0551d7974ff1d9e29c 9d883746f146a6,ff   <- resulting data
>
> The rest of Phase 2 does more damage. I think the unalignedness of the
> plaintext pointer (xorBlock) is affecting stuff too, which is why the
> corrupted ciphertext looks both shifted and bit-flipped from the  
> expected
> value.
>
> I instrumented the ProcessData() code in cryptopp-5.5.2 to confirm  
> that the
> last byte of the output buffer was correct (0x43) at the end of  
> Phase 1 and
> then clobbered (0xa6) at the end of Phase 2.
>
> This corruption doesn't happen if Phase 1 didn't run, since then  
> the pointers
> are still equal to the original (4-byte aligned) Python string  
> buffer, which
> means that if your call to ProcessData() leaves it on a 16-byte  
> boundary,
> then you're fine.
>
> It also doesn't happen unless Phase 2 runs, which requires that  
> there be at
> least 16 bytes left to process (so it can do a full block). And it  
> only
> happens on ARM where aligned accesses give you this weird  
> corruption behavior
> (on other chips you might get a trap or a hard-to-spot performance  
> problem).
>
> Calling ProcessData() only once per SigningKey instance won't  
> trigger it.
> Always calling ProcessData() with short strings (<16 bytes) won't  
> trigger it,
> and always calling it with multiples of 16-byte strings won't  
> trigger it.
>
> == Episode 4: A New Hope ==
>
> cryptopp-5.6.0 changes the AES implementation considerably. I  
> haven't traced
> through it enough to be sure, but I suspect that they've fixed the  
> problem,
> because of a new #define CRYPTOPP_ALLOW_UNALIGNED_DATA_ACCESS  
> (which is
> conservatively only set on x86, x64, and PowerPC). In addition, the  
> vital
> write in rijndael.cpp now looks like:
>
>   Block::Put(xorBlock, outBlock)(tbw[0]^rk[0])(tbw[1]^rk[1])(tbw[2] 
> ^rk[2])(tbw[3]^rk[3]);
>
> and is done without casting xorBlock or outBlock to word32*. I  
> can't find
> where Block::Put is defined, but I'm encouraged that this will fix the
> problem. The terse release notes for 5.6.0 don't mention  
> intentionally fixing
> any problem like this.
>
> I've got a library compiling now to test this. I'll have to let it run
> overnight.. compiling libcrypto++ on Zandr's linkstation takes  
> about 3 hours.
> (incidentally, it runs a lot faster if you edit the makefile to use  
> -O0).
>
> What this means is that libcrypto++-5.5.2 has a grave bug on ARM,  
> which is
> probably fixed in the current libcryptopp++-5.6.0 . If we really  
> wanted to,
> pycryptopp could work around the bug, by keeping track of how many  
> bytes we'd
> submitted to ProcessData() and splitting up the input data (at most  
> one split
> per call to .process) such that we either tell ProcessData() to  
> start at a
> 16-byte boundary, or we give it less than 16 bytes of data to work  
> with.
> Something like:
>
>     def process(self, plaintext):
>         if len(plaintext) < 16 or self.counter%16==0:
>             self.counter += len(plaintext)
>             return self.e.ProcessData(plaintext)
>         gap = 16 - (self.counter%16)
>         return self.process(plaintext[:gap]) + self.process 
> (plaintext[gap:])
>
> That ought to sound crazy to you.
>
> Lenny has 5.5.2, and while we should develop a minimal (C++)  
> demonstration
> case and file a "serious" or "grave" debian/lenny bug on libcrypto++7
> (specific to arm/armel), I think it's unlikely that we could  
> convince them to
> issue a security fix and upgrade stable all the way to 5.6.0 for  
> just one
> architecture. So ARM boxes that want to use pycryptopp with
> --disable-embedded-cryptopp will need to upgrade to sid (which has  
> 5.6.0).
> Fortunately, pycryptopp.deb isn't going to go into Lenny anyways  
> (since it's
> already shipped), so this is less of an issue for the
> get-pycryptopp-into-Debian effort, just for people who want to use  
> pycryptopp
> in the syslib form on a lenny/stable ARM box. We may want to consider
> building a backported cryptopp-5.6.0 for these folks.
>
> The embedded-cryptopp form should work (assuming my hopes for  
> Block::Put are
> justified), because Zooko recently upgraded the embedded copy to  
> 5.6.0 .
>
> We should also look at crypto++'s test suite and make sure there's  
> a case
> which uses a single encryption object to two ProcessData operations  
> (with
> various lengths) so this case is being exercised right from the  
> source. I've
> attached the python equivalent below.
>
> cheers,
>  -Brian
>
>
> #! /usr/bin/python
>
> from pycryptopp.cipher.aes import AES
> from binascii import a2b_hex, b2a_hex
>
> key = a2b_hex 
> ("54b0789aeeddb6b3351e6d7b8d79d8d489582376ab1b322dd3362ccbfdb82f7a")
> assert len(key) == 32
> plaintext = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
> assert len(plaintext) == 52
> expected_ciphertext = AES(key=key).process(plaintext)
> #print b2a_hex(expected_ciphertext[:16]), b2a_hex 
> (expected_ciphertext[16:32]), b2a_hex(expected_ciphertext[32:48])
> expected_s =  
> "0551d7974ff1d9e29c9d883746080343"+"f15abafbc9ca5ac6a9a7eeaada790a42"+ 
> "73a45de485a97ecd52d79fe702a07a67"+"9b8d73ad"
> assert b2a_hex(expected_ciphertext) == expected_s
>
> errors = []
> for i in range(1, 25):
>     for j in range(1, 25):
>         p = AES(key=key)
>         a = p.process(plaintext[:i])
>         assert len(a) == i
>         b = p.process(plaintext[i:i+j])
>         assert len(b) == j
>         c = a+b
>         if c != expected_ciphertext[:i+j]:
>             errors.append((i,j,a,b,expected_ciphertext 
> [:i],expected_ciphertext[i:i+j]))
> for (i,j,a,b,ea,eb) in errors:
>     print "%02d,%02d: %s %s <--GOT" % (i, j, b2a_hex(a), b2a_hex(b))
>     print "%02d,%02d: %s %s <--EXPECTED" % (i, j, b2a_hex(ea),  
> b2a_hex(eb))
>     print
> if not errors:
>     print "no errors detected"
> _______________________________________________
> tahoe-dev mailing list
> [email protected]
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the "Crypto++ Users" 
Google Group.
To unsubscribe, send an email to [email protected].
More information about Crypto++ and this group is available at 
http://www.cryptopp.com.
-~----------~----~----~----~------~----~------~--~---

Fwd: [tahoe-dev] pycryptopp vs ARM

Reply via email to