Re: Suspend-to-disk doesn't work anymore

2017-05-23 Thread Sebastien Marie
On Mon, May 22, 2017 at 07:32:07PM +0200, Mike Belopuhov wrote:
> > 
> > It makes it to confirm the problem is related to the switch to
> > constant-time-aes in the context of full-disk-encryption.
> >
> 
> Thanks for verifying this.  I've looked through the sr_hibernate_io
> (that's hib->io_func) but couldn't find anything wrong with it. The
> only thing that springs to mind is that AES_CTX and therefore the
> XTS context (aes_xts_ctx) is larger and requires more stack space.
> Though I can't see what might be affected by that.
> 

While looking at code source and `struct aes_xts_ctx` definition (in
order to add the padding stuff), I see the following:


1. `struct aes_xts_ctx` and aes_xts_* funtions are defined in
crypto/xform.c and used by regular kernel code.

with the constant-time AES (so the faulting one, before reverting), the
code was using AES_CTX and AES_*() functions.

crypto/xform.c:
   123  struct aes_xts_ctx {
   124  AES_CTX key1;
   125  AES_CTX key2;
   126  u_int8_t tweak[AES_XTS_BLOCKSIZE];
   127  };

and later:
   480  void
   481  aes_xts_reinit(caddr_t key, u_int8_t *iv)
   482  {
...
   496  /* Last 64 bits of IV are always zero */
   497  bzero(ctx->tweak + AES_XTS_IVSIZE, AES_XTS_IVSIZE);
   498
   499  AES_Encrypt(>key2, ctx->tweak, ctx->tweak);
   500  }



2. `struct aes_xts_ctx` and functions are *also* defined in
lib/libsa/aes_xts.h and used by bootloader code. They are using
rijndael code.

Please note that the switch to AES_CTX does *not* touch them. There were
still use rijndael code and struct, while the kernel was using AES_CTX
version.

lib/libsa/aes_xts.h:
27  struct aes_xts_ctx {
28  rijndael_ctx key1;
29  rijndael_ctx key2;
30  u_int8_t tweak[AES_XTS_BLOCKSIZE];
31  };

lib/libsa/aes_xts.c:
25  void
26  aes_xts_reinit(struct aes_xts_ctx *ctx, u_int8_t *iv)
27  {
...
40  /* Last 64 bits of IV are always zero */
41  bzero(ctx->tweak + AES_XTS_IVSIZE, AES_XTS_IVSIZE);
42
43  rijndael_encrypt(>key2, ctx->tweak, ctx->tweak);
44  }



3. in dev/softraid.c, the hibernate code sr_hibernate_io() seems to use
to bootloader definition of `struct aes_xts_ctx`:

dev/softraid.c:
53  #ifdef HIBERNATE
54  #include 
55  #include 
56  #include 
57  #endif /* HIBERNATE */

but as it is a regular kernel code and not bootloader code, I expect the
linkage of aes_xts_*() functions to be done using kernel code (so
AES_* version).

The structure used was rijndael_ctx based (sizeof=992), and functions
used was AES_Encrypt() based. As AES_* functions operated on larger
struct (sizeof=1464), it was resulting some garbage and stack corruption
inside sr_hibernate_io().



As we move back to rijndael code for AES_XTS, bootloader code and kernel
code come back in sync, and sr_hibernate_io() uses right code now.

I see several possibles things to do to avoid future mistakes:

- in dev/softraid.c : doesn't use libsa code but standard kernel code.
  But I dunno if it would be compatible or not. It depends if we share
  some elements with bootloader or not.

- in crypto/xform.c and lib/libsa/aes_xts.h : add a comment to recall
  that the struct should be keep in sync together.



Does this analyze makes sens ?
-- 
Sebastien Marie



Re: Suspend-to-disk doesn't work anymore

2017-05-22 Thread Mike Larkin
On Mon, May 22, 2017 at 09:39:57PM +0200, Mike Belopuhov wrote:
> On Mon, May 22, 2017 at 12:34 -0700, Mike Larkin wrote:
> > On Mon, May 22, 2017 at 08:55:31PM +0200, Mike Belopuhov wrote:
> > > On Mon, May 22, 2017 at 20:20 +0200, Sebastien Marie wrote:
> > > > On Mon, May 22, 2017 at 07:32:07PM +0200, Mike Belopuhov wrote:
> > > > > > 
> > > > > > With the last commit to revert AES_XTS to rijndael, I pushed it on
> > > > > > top of the tested tree (7 days old). The hibernate/resume works 
> > > > > > again.
> > > > > > 
> > > > > > It makes it to confirm the problem is related to the switch to
> > > > > > constant-time-aes in the context of full-disk-encryption.
> > > > > >
> > > > > 
> > > > > Thanks for verifying this.  I've looked through the sr_hibernate_io
> > > > > (that's hib->io_func) but couldn't find anything wrong with it. The
> > > > > only thing that springs to mind is that AES_CTX and therefore the
> > > > > XTS context (aes_xts_ctx) is larger and requires more stack space.
> > > > > Though I can't see what might be affected by that.
> > > > 
> > > > I quickly check the difference of struct size and come back with some
> > > > idea to test.
> > > > 
> > > > - aes_xts_ctx based on AES_CTX  has sizeof()=1464
> > > > - aes_xts_ctx based on rijndael_ctx has sizeof()=992
> > > > 
> > > > 
> > > > do you think just adding padding of 472 (1464-992) at end of `struct
> > > > aes_xts_ctx' (rijndael_ctx version) would be enough to test it ?
> > > > 
> > > > Something like:
> > > > 
> > > > diff --git a/sys/crypto/xform.c b/sys/crypto/xform.c
> > > > index 71e173b44..9775d030c 100644
> > > > --- a/sys/crypto/xform.c
> > > > +++ b/sys/crypto/xform.c
> > > > @@ -125,6 +125,7 @@ struct aes_xts_ctx {
> > > > rijndael_ctx key1;
> > > > rijndael_ctx key2;
> > > > u_int8_t tweak[AES_XTS_BLOCKSIZE];
> > > > +   u_int8_t _pad[472];
> > > >  };
> > > > 
> > > >  /* Helper */
> > > >
> > > 
> > > Good question.  I would try both in the end of the structure and in
> > > the beginning, in attempt to at least rule out stack corruption.
> > > 
> > > > 
> > > > if the hibernate/resume process is able to complete with the padded
> > > > version, we could exclude the struct size of the equation. and if the
> > > > process fail, it means the size matters.
> > > >
> > > 
> > > I would agree with this.
> > > 
> > > > any additional thing to do for testing padding (initialisation or
> > > > something else) ? you know better than me this subsystem :)
> > > >
> > > 
> > > Nothing springs to mind, but I didn't have time to investigate
> > > any further yet.
> > > 
> > 
> > Is the new AES implementation side effect free? It cannot touch any memory
> > outside the scratch page used by the sr crypto suspend routine. 
> > 
> > -ml
> 
> It doesn't touch any memory except for its context which is on the stack.
> If this stack must fit into the scratch page, I don't see any indication
> of enforcing it or a comment explaining it.  aes_xts_ctx is not part of
> the "*my = page;" structure.
> 

A private stack is in use while hibernate is occurring, so as long as the stack
doesn't grow past 3 pages it should be ok. The *my = page is for any extra
working area the I/O routines need.

-ml



Re: Suspend-to-disk doesn't work anymore

2017-05-22 Thread Mike Belopuhov
On Mon, May 22, 2017 at 12:34 -0700, Mike Larkin wrote:
> On Mon, May 22, 2017 at 08:55:31PM +0200, Mike Belopuhov wrote:
> > On Mon, May 22, 2017 at 20:20 +0200, Sebastien Marie wrote:
> > > On Mon, May 22, 2017 at 07:32:07PM +0200, Mike Belopuhov wrote:
> > > > > 
> > > > > With the last commit to revert AES_XTS to rijndael, I pushed it on
> > > > > top of the tested tree (7 days old). The hibernate/resume works again.
> > > > > 
> > > > > It makes it to confirm the problem is related to the switch to
> > > > > constant-time-aes in the context of full-disk-encryption.
> > > > >
> > > > 
> > > > Thanks for verifying this.  I've looked through the sr_hibernate_io
> > > > (that's hib->io_func) but couldn't find anything wrong with it. The
> > > > only thing that springs to mind is that AES_CTX and therefore the
> > > > XTS context (aes_xts_ctx) is larger and requires more stack space.
> > > > Though I can't see what might be affected by that.
> > > 
> > > I quickly check the difference of struct size and come back with some
> > > idea to test.
> > > 
> > > - aes_xts_ctx based on AES_CTX  has sizeof()=1464
> > > - aes_xts_ctx based on rijndael_ctx has sizeof()=992
> > > 
> > > 
> > > do you think just adding padding of 472 (1464-992) at end of `struct
> > > aes_xts_ctx' (rijndael_ctx version) would be enough to test it ?
> > > 
> > > Something like:
> > > 
> > > diff --git a/sys/crypto/xform.c b/sys/crypto/xform.c
> > > index 71e173b44..9775d030c 100644
> > > --- a/sys/crypto/xform.c
> > > +++ b/sys/crypto/xform.c
> > > @@ -125,6 +125,7 @@ struct aes_xts_ctx {
> > > rijndael_ctx key1;
> > > rijndael_ctx key2;
> > > u_int8_t tweak[AES_XTS_BLOCKSIZE];
> > > +   u_int8_t _pad[472];
> > >  };
> > > 
> > >  /* Helper */
> > >
> > 
> > Good question.  I would try both in the end of the structure and in
> > the beginning, in attempt to at least rule out stack corruption.
> > 
> > > 
> > > if the hibernate/resume process is able to complete with the padded
> > > version, we could exclude the struct size of the equation. and if the
> > > process fail, it means the size matters.
> > >
> > 
> > I would agree with this.
> > 
> > > any additional thing to do for testing padding (initialisation or
> > > something else) ? you know better than me this subsystem :)
> > >
> > 
> > Nothing springs to mind, but I didn't have time to investigate
> > any further yet.
> > 
> 
> Is the new AES implementation side effect free? It cannot touch any memory
> outside the scratch page used by the sr crypto suspend routine. 
> 
> -ml

It doesn't touch any memory except for its context which is on the stack.
If this stack must fit into the scratch page, I don't see any indication
of enforcing it or a comment explaining it.  aes_xts_ctx is not part of
the "*my = page;" structure.



Re: Suspend-to-disk doesn't work anymore

2017-05-22 Thread Mike Larkin
On Mon, May 22, 2017 at 08:55:31PM +0200, Mike Belopuhov wrote:
> On Mon, May 22, 2017 at 20:20 +0200, Sebastien Marie wrote:
> > On Mon, May 22, 2017 at 07:32:07PM +0200, Mike Belopuhov wrote:
> > > > 
> > > > With the last commit to revert AES_XTS to rijndael, I pushed it on
> > > > top of the tested tree (7 days old). The hibernate/resume works again.
> > > > 
> > > > It makes it to confirm the problem is related to the switch to
> > > > constant-time-aes in the context of full-disk-encryption.
> > > >
> > > 
> > > Thanks for verifying this.  I've looked through the sr_hibernate_io
> > > (that's hib->io_func) but couldn't find anything wrong with it. The
> > > only thing that springs to mind is that AES_CTX and therefore the
> > > XTS context (aes_xts_ctx) is larger and requires more stack space.
> > > Though I can't see what might be affected by that.
> > 
> > I quickly check the difference of struct size and come back with some
> > idea to test.
> > 
> > - aes_xts_ctx based on AES_CTX  has sizeof()=1464
> > - aes_xts_ctx based on rijndael_ctx has sizeof()=992
> > 
> > 
> > do you think just adding padding of 472 (1464-992) at end of `struct
> > aes_xts_ctx' (rijndael_ctx version) would be enough to test it ?
> > 
> > Something like:
> > 
> > diff --git a/sys/crypto/xform.c b/sys/crypto/xform.c
> > index 71e173b44..9775d030c 100644
> > --- a/sys/crypto/xform.c
> > +++ b/sys/crypto/xform.c
> > @@ -125,6 +125,7 @@ struct aes_xts_ctx {
> > rijndael_ctx key1;
> > rijndael_ctx key2;
> > u_int8_t tweak[AES_XTS_BLOCKSIZE];
> > +   u_int8_t _pad[472];
> >  };
> > 
> >  /* Helper */
> >
> 
> Good question.  I would try both in the end of the structure and in
> the beginning, in attempt to at least rule out stack corruption.
> 
> > 
> > if the hibernate/resume process is able to complete with the padded
> > version, we could exclude the struct size of the equation. and if the
> > process fail, it means the size matters.
> >
> 
> I would agree with this.
> 
> > any additional thing to do for testing padding (initialisation or
> > something else) ? you know better than me this subsystem :)
> >
> 
> Nothing springs to mind, but I didn't have time to investigate
> any further yet.
> 

Is the new AES implementation side effect free? It cannot touch any memory
outside the scratch page used by the sr crypto suspend routine. 

-ml



Re: Suspend-to-disk doesn't work anymore

2017-05-22 Thread Mike Belopuhov
On Mon, May 22, 2017 at 20:20 +0200, Sebastien Marie wrote:
> On Mon, May 22, 2017 at 07:32:07PM +0200, Mike Belopuhov wrote:
> > > 
> > > With the last commit to revert AES_XTS to rijndael, I pushed it on
> > > top of the tested tree (7 days old). The hibernate/resume works again.
> > > 
> > > It makes it to confirm the problem is related to the switch to
> > > constant-time-aes in the context of full-disk-encryption.
> > >
> > 
> > Thanks for verifying this.  I've looked through the sr_hibernate_io
> > (that's hib->io_func) but couldn't find anything wrong with it. The
> > only thing that springs to mind is that AES_CTX and therefore the
> > XTS context (aes_xts_ctx) is larger and requires more stack space.
> > Though I can't see what might be affected by that.
> 
> I quickly check the difference of struct size and come back with some
> idea to test.
> 
> - aes_xts_ctx based on AES_CTX  has sizeof()=1464
> - aes_xts_ctx based on rijndael_ctx has sizeof()=992
> 
> 
> do you think just adding padding of 472 (1464-992) at end of `struct
> aes_xts_ctx' (rijndael_ctx version) would be enough to test it ?
> 
> Something like:
> 
> diff --git a/sys/crypto/xform.c b/sys/crypto/xform.c
> index 71e173b44..9775d030c 100644
> --- a/sys/crypto/xform.c
> +++ b/sys/crypto/xform.c
> @@ -125,6 +125,7 @@ struct aes_xts_ctx {
> rijndael_ctx key1;
> rijndael_ctx key2;
> u_int8_t tweak[AES_XTS_BLOCKSIZE];
> +   u_int8_t _pad[472];
>  };
> 
>  /* Helper */
>

Good question.  I would try both in the end of the structure and in
the beginning, in attempt to at least rule out stack corruption.

> 
> if the hibernate/resume process is able to complete with the padded
> version, we could exclude the struct size of the equation. and if the
> process fail, it means the size matters.
>

I would agree with this.

> any additional thing to do for testing padding (initialisation or
> something else) ? you know better than me this subsystem :)
>

Nothing springs to mind, but I didn't have time to investigate
any further yet.



Re: Suspend-to-disk doesn't work anymore

2017-05-22 Thread Sebastien Marie
On Mon, May 22, 2017 at 07:32:07PM +0200, Mike Belopuhov wrote:
> > 
> > With the last commit to revert AES_XTS to rijndael, I pushed it on
> > top of the tested tree (7 days old). The hibernate/resume works again.
> > 
> > It makes it to confirm the problem is related to the switch to
> > constant-time-aes in the context of full-disk-encryption.
> >
> 
> Thanks for verifying this.  I've looked through the sr_hibernate_io
> (that's hib->io_func) but couldn't find anything wrong with it. The
> only thing that springs to mind is that AES_CTX and therefore the
> XTS context (aes_xts_ctx) is larger and requires more stack space.
> Though I can't see what might be affected by that.

I quickly check the difference of struct size and come back with some
idea to test.

- aes_xts_ctx based on AES_CTX  has sizeof()=1464
- aes_xts_ctx based on rijndael_ctx has sizeof()=992


do you think just adding padding of 472 (1464-992) at end of `struct
aes_xts_ctx' (rijndael_ctx version) would be enough to test it ?

Something like:

diff --git a/sys/crypto/xform.c b/sys/crypto/xform.c
index 71e173b44..9775d030c 100644
--- a/sys/crypto/xform.c
+++ b/sys/crypto/xform.c
@@ -125,6 +125,7 @@ struct aes_xts_ctx {
rijndael_ctx key1;
rijndael_ctx key2;
u_int8_t tweak[AES_XTS_BLOCKSIZE];
+   u_int8_t _pad[472];
 };

 /* Helper */


if the hibernate/resume process is able to complete with the padded
version, we could exclude the struct size of the equation. and if the
process fail, it means the size matters.

any additional thing to do for testing padding (initialisation or
something else) ? you know better than me this subsystem :)

Thanks.
-- 
Sebastien Marie



Re: Suspend-to-disk doesn't work anymore

2017-05-22 Thread Mike Belopuhov
On Thu, May 18, 2017 at 10:07 +0200, Sebastien Marie wrote:
> On Fri, May 12, 2017 at 03:11:35PM +, Natasha Kerensikova wrote:
> > >Synopsis:  Suspend-to-disk doesn't work anymore
> > >Category:  
> > >Environment:
> > System  : OpenBSD 6.1
> > Details : OpenBSD 6.1-current (GENERIC.MP) #6: Fri May 12 15:12:39 
> > CEST 2017
> >  
> > sema...@apollo.obspm.bsdfrog.org:/data/semarie/repos/openbsd/src/sys/arch/amd64/compile/GENERIC.MP
> > 
> > Architecture: OpenBSD.amd64
> > Machine : amd64
> > >Description:
> > On my Thinkpad X220 (with Core i5) with full disk encryption,
> > OpenBSD doesn't resume after suspend to disk since my latest snanpshot
> > update (May 7th snapshot). Keeping the same userland and using kernels
> > helpfully provided by semarie, we bisected the problem to the commits
> > detailed below.
> > >How-To-Repeat:
> > Suspend-to-disk a live OpenBSD. On next boot, it should resume from
> > disk, but instead it starts a standard boot with dirty filesystems.
> > >Fix:
> > Reverting the commits identified on github mirror by the hashes
> > d223d7cb85c1f2f705da547a0134b949655abe6a ("Switch glxsb(4), VIA
> > padlock and AES-NI drivers over to the new AES") and
> > cb3087542b323ec5bf5db9dc64f0d54dc40cca40 ("Switch OCF and IPsec over
> > to the new AES") fixes the problem, at least until commit
> > 50f8ee3e5db5b40ae9a05db4742b05e8d975573d (May 11th).
> > 
> 
> With Natacha, we continued a bit a try to debug the problem.
>

Thank you for a follow up mail.  If you can find more info,
this would be helpful.

> By activating HIB_DEBUG, the resume showed that it failed due to wrong magic 
> number:
> 
> [...]
> sd1 at scsibus3 targ 1 lun 0: <OPENBSD, SR CRYPTO, 006> SCSI2 0/direct fixed
> sd1: 953866MB, 512 bytes/sector, 1953519473 sectors
> root on sd1a (63848a4fade4a944.a) swap on sd1b dump on sd1b
> reading hibernate signature block location: 8641783
> wrong magic number in hibernate signature: e82daa08
> 
> I am unsure the reason: it could be the hibernate part that don't write
> it correctly or the resume part that don't read it correctly ? I dunno.
> 
> By "correctly" I mean: wrong aes key ? use of uninitialised or garbaged
> struct ? Something that results a "bad state" on writing or reading.
> 
> 
> With the last commit to revert AES_XTS to rijndael, I pushed it on
> top of the tested tree (7 days old). The hibernate/resume works again.
> 
> It makes it to confirm the problem is related to the switch to
> constant-time-aes in the context of full-disk-encryption.
>

Thanks for verifying this.  I've looked through the sr_hibernate_io
(that's hib->io_func) but couldn't find anything wrong with it. The
only thing that springs to mind is that AES_CTX and therefore the
XTS context (aes_xts_ctx) is larger and requires more stack space.
Though I can't see what might be affected by that.

> Regarding the problem itself, I don't know enough the crypto part and
> the initialisation code path to figure the reason. Does aes.c has some
> initialisation that would arrive later than rijndael.c ? resulting a
> first read on disk with wrong key or uninitialised structure ? I dunno.

No.  Otherwise we would see this kind of issues elsewhere.

> I just hope this problem doesn't hide a more subtile underlined problem.
>

It probably does.

> I expect the problem to be fixed in next snapshot (a one including the revert
> of AES_XTS to rijndael).
> 
> Thanks.
> -- 
> Sebastien Marie
> 



Re: Suspend-to-disk doesn't work anymore

2017-05-18 Thread Sebastien Marie
On Fri, May 12, 2017 at 03:11:35PM +, Natasha Kerensikova wrote:
> >Synopsis:Suspend-to-disk doesn't work anymore
> >Category:
> >Environment:
>   System  : OpenBSD 6.1
>   Details : OpenBSD 6.1-current (GENERIC.MP) #6: Fri May 12 15:12:39 
> CEST 2017
>
> sema...@apollo.obspm.bsdfrog.org:/data/semarie/repos/openbsd/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
>   On my Thinkpad X220 (with Core i5) with full disk encryption,
>   OpenBSD doesn't resume after suspend to disk since my latest snanpshot
>   update (May 7th snapshot). Keeping the same userland and using kernels
>   helpfully provided by semarie, we bisected the problem to the commits
>   detailed below.
> >How-To-Repeat:
>   Suspend-to-disk a live OpenBSD. On next boot, it should resume from
>   disk, but instead it starts a standard boot with dirty filesystems.
> >Fix:
>   Reverting the commits identified on github mirror by the hashes
>   d223d7cb85c1f2f705da547a0134b949655abe6a ("Switch glxsb(4), VIA
>   padlock and AES-NI drivers over to the new AES") and
>   cb3087542b323ec5bf5db9dc64f0d54dc40cca40 ("Switch OCF and IPsec over
>   to the new AES") fixes the problem, at least until commit
>   50f8ee3e5db5b40ae9a05db4742b05e8d975573d (May 11th).
> 

With Natacha, we continued a bit a try to debug the problem.

By activating HIB_DEBUG, the resume showed that it failed due to wrong magic 
number:

[...]
sd1 at scsibus3 targ 1 lun 0: <OPENBSD, SR CRYPTO, 006> SCSI2 0/direct fixed
sd1: 953866MB, 512 bytes/sector, 1953519473 sectors
root on sd1a (63848a4fade4a944.a) swap on sd1b dump on sd1b
reading hibernate signature block location: 8641783
wrong magic number in hibernate signature: e82daa08

I am unsure the reason: it could be the hibernate part that don't write
it correctly or the resume part that don't read it correctly ? I dunno.

By "correctly" I mean: wrong aes key ? use of uninitialised or garbaged
struct ? Something that results a "bad state" on writing or reading.


With the last commit to revert AES_XTS to rijndael, I pushed it on
top of the tested tree (7 days old). The hibernate/resume works again.

It makes it to confirm the problem is related to the switch to
constant-time-aes in the context of full-disk-encryption.

Regarding the problem itself, I don't know enough the crypto part and
the initialisation code path to figure the reason. Does aes.c has some
initialisation that would arrive later than rijndael.c ? resulting a
first read on disk with wrong key or uninitialised structure ? I dunno.
I just hope this problem doesn't hide a more subtile underlined problem.

I expect the problem to be fixed in next snapshot (a one including the revert
of AES_XTS to rijndael).

Thanks.
-- 
Sebastien Marie