Good morning Luke,

> > Another point to ponder is test modes.
> > In mass production you need test modes.
>
> > (Sure, an attacker can try targeted ESD at the `TESTMODE` flip-flop 
> > repeatedly, but this risks also flipping other scan flip-flops that contain 
> > the data that is being extracted, so this might be sufficient protection in 
> > practice.)
>
> if however the ASIC can be flipped into TESTMODE and yet it carries on
> otherwise working, an algorithm can be re-run and the exposed data
> will be clean.

But in most testmodes I have seen (and designed) all clocks are driven 
externally from a different pin (usually the serial interface) when in testmode.
If the CPU clock is now controlled by the attacker, how do you run any kind of 
algorithm?

(This could be an artifact of how my old design company designed testmodes, 
YMMV.)

Really the concern here is that testmode is entered while the CPU has key 
material loaded into registers, or caches, then it is possible, if those 
registers/caches are in the scan chain, to exfiltrate data.
Does not matter if the chip is now in a mode that cannot execute algorithms, if 
it was doing any kind of computation involving privkeys (including say deriving 
its public key so that PC-side hardware can get the `xpub`) then key material 
may be in scan chain registers, clock is now controlled by the attacker, and 
possibly scan mode as well (which disables combinational circuitry thus none of 
your algorithms can run).

>
> > If you are really going to open-source the hardware design then the layout
> > is also open and attackers can probably target specific chip area for ESD
> > pulse to try a flip-flop upset, so you need to be extra careful.
>
> this is extremely valuable advice. in the followup [1] you describe a
> gating method: this we have already deployed on a couple of places in
> case the Libre Cell Library (also being developed at the same time by
> Staf Verhaegen of Chips4Makers) causes errors: we do not want, for
> example, an error in a Cell Library to cause a permanent HI which
> locks us from being able to perform testing of other areas of the
> ASIC.
>
> the idea of being able to actually randomly flip bits inside an ASIC
> from outside is both hilarious and entirely news to me, yet it sounds
> to be exactly the kind of thing that would allow an attacker to
> compromise a hardware wallet. potentially destructively, mind, but
> compromise all the same.

Certainly outside of the the old company design philosophy I have seen many 
experts strongly protest against a design philosophy which assumes that any 
flip-flop could randomly switch.

Yet the design philosophy within the old company always had this assumption, 
supposedly (according to in-company lore) because previous engineers had 
actually found the hard way that random bitflips did occur, and for e.g. 
automobile chips the risk was too great to not have strong mitigations:

* State machines had to force unused states into known states.
  For example a state machine with 3 states needs 2 bits of state, but 2 bits 
of state is actually 4 states, so there is a 4th unused state.
  * Not all state machines needed this rule but during planning we had to 
identify state machines that needed this rule, and often we just targeted 
having 2^n states just to ensure that there were no unused states.
  * I even suggested the use of ECC encoding for important state machines and 
it was something being investigated at the time I left.
* State machines that otherwise did not need the above rule were strongly 
encouraged to clear state at display frame vsync.
  This ensured that any unexpected states they had would only last up to one 
display frame, which was considered acceptable.
* Flip-flops that held settings were periodically reloaded at each display 
frame vsync from a flash memory (which apparently as a lot more immune to 
bitflips).

It could be an artifact as well that the company had its own in-house foundry 
rather than delegate out to TSMC or whatnot --- maybe the technology we had was 
just suckier than state-of-the-art so bitflips were more common.

The reason why this stuck to mind is because at one time we had a DS test where 
shooting the ESD gun could sometimes cause the chip to fail (blank display) 
until reset, when the expectation was that at most it would flicker for one 
display frame.
And afterwards we had to go through the entire RTL looking for which state 
machine or settings register was the culprit.
I even wrote a little Verilog-PLI plugin that would inject deterministically 
random data into flip-flops in the model to try to catch it.
Eventually we found a bunch of possible root causes, and on the next DS 
iteration testing we had fun shooting the chip with the ESD gun over and over 
again and sighing in relief that the display was not failing for more than one 
frame.

The chip was a display driver for automotive, apparently at the time cars were 
starting to transition to using LCD for things like speedometer and 
accelerometer rather than physical dials.
And of course the display suddenly switching off while the car is running at 
high speed due to some extra-powerful pulse elsewhere was potentially dangerous 
and could distract the driver, so that is why we were paranoid about such 
sudden bitflips potentially leading to such massive cascade of upsets as to 
make the display fail permanently.

I think being excessively cautious for cryptographic chips should be standard 
as well.
And certainly test mode exfiltration of data is always an issue, JTAG is very 
standard way of reading memory.

Regards,
ZmnSCPxj
_______________________________________________
bitcoin-dev mailing list
bitcoin-dev@lists.linuxfoundation.org
https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev

Reply via email to