cryptographic ASICs

ZmnSCPxj via bitcoin-dev Thu, 11 Feb 2021 00:21:22 -0800

Good morning Luke,

> > (to be fair, there were tools to force you to improve coverage by injecting 
> > faults to your RTL, e.g. it would virtually flip an `&&` to an `||` and if 
> > none of your tests signaled an error it would complain that your test 
> > coverage sucked.)
>
> nice!


It should be possible for a tool to be developed to parse a Verilog RTL design, 
then generate a new version of it with one change.
Then you could add some automation to run a set of testcases around mutated 
variants of the design.
For example, it could create a "wrapper" module that connects to an unmutated 
differently-named version of the design, and various mutated versions, wire all 
their inputs together, then compare outputs.
If the testcase could trigger an output of a mutated version to be different 
from the reference version, then we would consider that mutation covered by 
that testcase.
Possibly that could be done with Verilog-2001 file writing code in the wrapper 
module to dump out which mutations were covered, then a summary program could 
just read in the generated file.
Or Verilog plugins could be used as well (Icarus supports this, that is how it 
implements all `$` functions).

A drawback is that just because an output is different does not mean the 
testcase actually ***checks*** that output.
If the testcase does not detect the diverging output it could still not be 
properly covering that.

The point of this is to check coverage of the tests.
Not sure how well this works with formal validation.



> > Synthesis in particular is a black box and each vendor keeps their 
> > particular implementations and tricks secret.
>
> sigh.  i think that's partly because they have to insert diodes, and buffers, 
> and generally mess with the netlist.
>
> i was stunned to learn that in a 28nm ASIC, 50% of it is repeater-buffers!

Well, that surprises me as well.

On the other hand, smaller technologies consistently have lower raw output 
current driving capability due to the smaller size, and as trace width goes 
down and frequency goes up they stop acting like ideal 0-impedance traces and 
start acting more like transmission lines.
So I suppose at some point something like that would occur and I should not 
actually be surprised.
(Maybe I am more surprised that it reached that level at that technology size, 
I would have thought 33% at 7nm.)

In the modules where we were doing manual netlist+layout, we used inverting 
buffers instead (slightly smaller than non-inverrting buffers, in most 
technologies a non-inverting buffer is just an inverter followed by an 
inverting buffer), it was an advantage of manual design since it looks like 
synthesis tools are not willing to invert the contents of intermediate 
flip-lfops even if it could give theoretical speed+size advantage to use an 
inverting buffer rather than an non-inverting one (it looks like synthesis 
optimization starts at the output of flip-flops and ends at their input, so a 
manual designer could achieve slightly better performance if they were willing 
to invert an intermediate flip-flop).
Another was that inverting latches were smaller in the technology we were using 
than non-inverting latches, so it was perfectly natural for us to use an 
inverting latch and an inverting buffer on those parts where we needed higher 
fan-out (t was equivalent to a "custom" latch that had higher-than-normal 
driving capability).

Scan chain test generation was impossible though, as those require flip-flops, 
not latches.
Fortunately this was "just" deserialization of high-frequency low-width data 
with no transformation of the data (that was done after the deserialization, at 
lower clock speeds but higher data width, in pure RTL so flip-flops), so it was 
judged acceptable that it would not be covered by scan chain, since scan chain 
is primarily for testing combinational logic between flip-flops.
So we just had flip-flops at the input, and flip-flops at the output, and 
forced all latches to pass-through mode, during scan mode.
We just needed to have enough coverage to uncover stuck-at faults (which was 
still a pain, since additional test vectors slow down manufacturing so we had 
to reduce the test vectors to the minimum possible) in non-scan-momde testing.

Man, making ASICs was tough.


>
> plus, they make an awful lot of money, it is good business.
>
> > Pointing some funding at the open-source Icarus Verilog might also fit, as 
> > it lost its ability to do synthesis more than a decade ago due to inability 
> > to maintain.
>
> ah i didn't know it could do synthesis at all! i thought it was simulation 
> only.

Icarus was the only open-source synthesis tool I could find back then, and it 
dropped synthesis capability fairly early due to maintenance burden (I never 
managed to get the old version with synthesis compiled and never managed actual 
synthesis on it, so my knowledge of it is theoretical).


There is an argument that open-source software is not truly open-source unless 
it can be compiled by open-source compilers or executed by open-source 
interpreters.
Similarly, I think open-source hardware RTL designs are not truly open-source 
if there are no open-source synthesis tools that can synthesize it to netlist 
and then lay it out.

Icarus can interpret most Veriog RTL designs, though.
However, at the time I left, I had already mandated that new code should use 
`always_comb` and `always_ff` (previously I had mandated that new code should 
use `always @*` for combinational logic) and was encouraging my subordinates to 
use loops and `generate`.
Icarus did not support `always_comb` and `always_ff` at the time (though worked 
perfectly fine with loops and `generate`).
In addition, at the time, we (actually just me in that company haha) were 
dabbling in object-oriented testing methodologies (which Icarus has no plans on 
ever implementing, which is understandable since it is a massive increase in 
complexity, it is much much harder than the scheduling shenanigans of 
`always_comb` and the "just treat it as `always`" of `always_ff`).

(Particularly, you need object-oriented testbenches since SystemVerilog 
includes a fuzz-testing framework to randomize fields of objects according to 
certain engineer-provided constraints, and then you would use those object 
fields to derive the test vectors your test framework would feed into the DUT, 
this was a massive increase in code coverage for a largish up-front cost but 
once you built the test framework you could just dump various constraints on 
your test specification objects, I actually caught a few bugs that we would not 
have otherwise found with our previous checklist-based testing methodology.)
(Unfortunately it turned out that it required a more expensive license and I 
ended up hogging the only one we had of that more expensive license (which, if 
I remember correctly, was the same license needed for formal verification of 
netlist<->RTL equivalence) for this, which killed enthusiasm for this 
technique, sigh, this is another argument for getting open-source hardware 
design tools developed; not much sense in having open-source RTL for a crypto 
device if you have to pay through the nose for a license just to synthesize it, 
never mind the manufacturing cost.)


-----------------------


Another point to ponder is test modes.

In mass production you **need** test modes.
There will always be some number of manufacturing defects because even the 
cleanest of cleanrooms *will* have a tiny amount of contaminants (what can go 
wrong will go wrong).
Test modes are run in manufacturing to filter out chips with failing circuitry 
due to contamination.

Now, a typical way of implementing test modes is to have a special command sent 
over, say, the "normal" serial port interface of a chip, which then enters 
various test modes to allow, say, scan chain testing.
Of course, scan chain testing is done by pushing test vectors into all 
flip-flops, and then the test is validated by pulsing global clock once (often 
the test mode forces all flip-flops on the same clock), then pulling data from 
all flip-flops to verify that all the circuitry works as designed.

The "pulling data from all flip-flops" is of course just another way of saying 
that all mass-produced chips have a way of letting ***anyone*** exfiltrate data 
from their flip-flops via test modes.

Thus, for a secure environment, you need to ensure that test modes cannot be 
entered after the device enters normal operation.
For example, you might have a dedicated pad which is normally pulled-down, but 
if at reset it is pulled up, the device enters test mode.
If at reset the pad is pulled down, the device is in normal mode and even if 
the pad is pulled up afterwards the device will not enter test mode.
This ensures that only reset data can be read from the device, without 
possibility of exfiltration of sensitive (key material or midstate) data.
The pad should also not be exposed as a package pinout except perhaps on DS and 
ES packages, and the pulldown resistor has to be on-chip.

As an additional precaution, we can also create a small secure memory (maybe 
256 octet addressable would be more than enough).
It is possible to exempt flip-flops from scan chain generation (usually by 
explicitly instantiating flip-flops in a separate module and telling 
post-synthesis tools to exempt the module from scan chain synthesis).
This gives an extra layer of protection against test mode accessing sensitive 
data; even if we manage to screw up test mode and it is possible to force reset 
on the test mode circuit without resetting the rest of the design, sensitive 
data is still out of the scan chain.
Of course, since they are not on scan, it is possible they have undetectable 
manufacturing defects, so you would need to use some kind of ECC, or better 
triple-redundancy best-of-three, to protect against manufacturing defects on 
the non-scan flip-flops.
Fortunately non-scan flip-flops are often a good bit smaller than scan 
flip-flops, so the redundancy is not so onerous.
Since the ECC / best-of-three circuit itself would need to be tested, you would 
multiplex their inputs, in normal mode they get inputs from the non-scan-chain 
flip-flops, in test mode they get inputs from separate scan-chain flip-flops, 
so that the ECC / best-of-three circuit is testable at scan mode.
You would also need a separate test of the secure memory, this time running in 
normal mode with a special test program in the CPU, just in case.
Finally, you would explicitly lay them out "distributed" around the chip, since 
manufacturing defects tend to correlate in space (they are usually from dust, 
and dust particles can be large relative to cell size), you do not want all 
three of the best-of-three to have manufacturing defects.
For example, you could have a 256 x 8 non-scan-chain flip-flop module, 
instantiate three of those, and explicitly place them in corners of the digital 
area, then use a best-of-three circuit to resolve the "correct" value.

The test mode circuit itself could ensure that the device enters test mode if 
and only if the secure memory contains all 0 data after the test mode circuit 
is reset.
For example, the 256 x 8 non-scan-chain flip-flop module could have a large OR 
circuit that ORs all the flip-flops, then outputs a single bit that is the 
bitwise OR of all the flip-flop contents.
Then the test mode circuit gets the `in_use` outputs fo the three secure 
flip-flop modules, and if at reset any of them are `1` then it will refuse to 
enter test mode even if the test mode pad is pulled high.
This ensures that even if an attacker is somehow able to reset *only* the test 
mode circuit somehow (this is basic engineering, always assume something will 
go wrong), if the secure memory has any non-0 data (we presume it resets to 0), 
the device will still not enter test mode.

Of course, if the secure memory itself is accessible from the CPU, then it 
remains possible that a CPU program is reading from the secure area, keeping 
raw data in CPU registers, from which a test-mode might be able to extract if 
the device is somehow forced into test mode even after normal mode.
You could redesign your implementations of field multiplication and SHA 
midstate computation so that they directly read from the secure memory and 
write to the secure memory without using any flip-flops along the way, and have 
only the cryptographic circuit have access to the secure memory.
That way there is reduced possibility that intermediate flip-flops (that are 
part of the scan chain) outside the secure memory having sensitive key material 
or midstate data.
You would need to use a custom bus with separate read and write addresses, and 
non-pipelined unbuffered access, and since you want to distribute your secure 
memory physically distant, that translates to wide and long buses (it might be 
better to use 64 x 32 or 32 x 64 addressable memories, to increase what the 
cryptographic circuit has access to per clock cycle) screwing with your layout, 
and probably having to run the secure memory + crypto circuit at a ***much*** 
slower clock domain (but more secure is a good tradeoff for slowness).
Of course, that is a major design headache (the crypto circuit has to act 
mostly as a reduced-functionality processor), so you might just want to have 
the CPU directly access the secure memory and in early boot poke a `0x01` in 
some part of the memory, in the hope that the `in_use` flag in the previous 
paragraph is enough to suppress test modes from exfiltrating CPU registers.

Do note that with enough power-cycles and ESD noise you can put digital 
circuitry into really weird and unexpected states (seen it happen, though 
fairly hard to replicate, we had an ESD gun you could point at a chip to make 
it go into weird states), so being extra paranoid about test modes is important.
What can go wrong will go wrong!
In particular with "`TESTMODE_PAD` is only checked at reset" you would have to 
store `TESTMODE` in a non-scan flip-flop, and with enough targeted ESD that 
flip-flop can be jostled, setting `TESTMODE` even after normal operation.
You might instead want to use, say, a byte pattern instead of a single bit to 
represent `TESTMODE`, so the `TESTMODE` register has to have a specific value 
such as `0xA5`, so that targeted ESD has to be very lucky in order to force 
your device into test mode.
For example, since you need to check the `TESTMODE` pad at reset anyway, you 
could do something like this:

    input CLK, RESET_N, TESTMODE_PAD, IN_USE0, IN_USE1, IN_USE2;
    output reg TESTMODE;

    wire in_use = IN_USE0 || IN_USE1 || IN_USE2;

    reg [7:0] testmode_ff;
    wire [7:0] next_testmode_ff =
        (testmode_ff == 8'hA5 || testmode_ff == 8'h00) ?
          (TESTMODE_PAD && !in_use) ?                      8'hA5 :
          /*otherwise*/                                    8'h5A :
        /*otherwise*/                                      testmode_ff ;
    always_ff @(posedge CLK, negedge RESET_N) begin
        if (!RESET_N) testmode_ff <= 0x00;
        else          testmode_ff <= next_testmode_ff; end

    wire next_TESTMODE = (testmode_ff == 8'hA5);
    always_ff @(posedge CLK, negedge RESET_N) begin
        if (!RESET_N) TESTMODE <= 1'b0;
        else          TESTMODE <= next_TESTMODE; end

Do note that the `TESTMODE` is a flip-flop, since you do ***not*** want 
glitches on the `TESTMODE` signal line, it would be horribly unsafe to output 
it from combinational circuitry directly, please do not do that.
Of course that flip-flop can instead be the target of ESD gunnery, but since 
you need many clock pulses to read the scan chain, it should with good 
probability also get set to `0` on the next clock pulse and leave test mode 
(and probably crash the device as well until full reset, but this "fails safe" 
since at least sensitive data cannot be extracted).
`TESTMODE` has no feedback, thus cannot be stuck in a state loop.
`testmode_ff` *can* be stuck in a state loop, but that is deliberate, as it 
would "fail safe" if it gets a value other than `0xA5`, it would not enter test 
mode (and if it enters `0xA5` it can easily leave test mode by either 
`TESTMODE_PAD` or `in_use`).

(Sure, an attacker can try targeted ESD at the `TESTMODE` flip-flop repeatedly, 
but this risks also flipping other scan flip-flops that contain the data that 
is being extracted, so this might be sufficient protection in practice.)

If you are really going to open-source the hardware design then the layout is 
also open and attackers can probably target specific chip area for ESD pulse to 
try a flip-flop upset, so you need to be extra careful.
Note as well that even closed-source "secure" elements can be 
reverse-engineered (I used to do this in the IC design job as a junior 
engineer, it was the sort of shitty brain-numbing work forced on new hires), so 
security-by-obscurity does have a limit as well, it should be possible to try 
to figure out the testmode circuitry on "secure" elements and try to get 
targeted ESD upsets at flip-flops on the testmode circuit.

Test mode design is something of an arcane art, especially if you are trying to 
build a security device, on the one hand you need to ensure you deliver devices 
without manufacturing defects, on the other hand you need to ensure that the 
test mode is not entered inadvertently by strange conditions.

In general, because test modes are such a pain to deal with securely, and are 
an absolute necessity for mass production, you should assume that any "secure" 
chip can be broken by physical access and shooting short-range ESD pulses at it 
to try to get it into some test mode, unless it is openly designed to prevent 
test mode from persisting after entering normal mode, as above.

(No idea how that ESD gun thing worked or what it was formally called, we just 
called it the ESD gun, it was an amusing toy, you point it at the DUT and pull 
the trigger and suddenly it would switch modes, this of course was a bad thing 
since you want to make sure that as much as possible such upsets do not cause 
the chip to enter an irrecoverable mode but an amusing thing to do still, we 
even had small amounts of flash memory containing register settings that we 
would load into the settings registers periodically at the end of each display 
frame to protect against this kind of ESD gun thing since the flip-flops 
backing the settings registers were vulnerable to it and we needed a way to 
preserve the settings of the customer for the IC, the expected effect would be 
to cause the display to flicker.)

Regards,
ZmnSCPxj

_______________________________________________
bitcoin-dev mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev

Re: [bitcoin-dev] Libre/Open blockchain / cryptographic ASICs

Reply via email to