Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-19 Thread Andy Lutomirski



> On Apr 19, 2021, at 11:33 AM, Dave Hansen  wrote:
> 
> On 4/19/21 11:10 AM, Andy Lutomirski wrote:
>> I’m confused by this scenario. This should only affect physical pages
>> that are in the 2M area that contains guest memory. But, if we have a
>> 2M direct map PMD entry that contains kernel data and guest private
>> memory, we’re already in a situation in which the kernel touching
>> that memory would machine check, right?
> 
> Not machine check, but page fault.  Do machine checks even play a
> special role in SEV-SNP?  I thought that was only TDX?

Brain fart.

> 
> My point was just that you can't _easily_ do the 2M->4k kernel mapping
> demotion in a page fault handler, like I think Borislav was suggesting.

We are certainly toast if this hits the stack.  Or if it hits a page table or 
the GDT or IDT :). The latter delightful choices would be triple faults.

I sure hope the code we use to split a mapping is properly NMI safe.

> 
>> ISTM we should fully unmap any guest private page from the kernel and
>> all host user pagetables before actually making it be a guest private
>> page.
> 
> Yes, that sounds attractive.  Then, we'd actually know if the host
> kernel was doing stray reads somehow because we'd get a fault there too.




Re: [RFC Part2 PATCH 04/30] x86/mm: split the physmap when adding the page in RMP table

2021-04-19 Thread Andy Lutomirski



> On Apr 19, 2021, at 10:58 AM, Dave Hansen  wrote:
> 
> On 4/19/21 10:46 AM, Brijesh Singh wrote:
>> - guest wants to make gpa 0x1000 as a shared page. To support this, we
>> need to psmash the large RMP entry into 512 4K entries. The psmash
>> instruction breaks the large RMP entry into 512 4K entries without
>> affecting the previous validation. Now the we need to force the host to
>> use the 4K page level instead of the 2MB.
>> 
>> To my understanding, Linux kernel fault handler does not build the page
>> tables on demand for the kernel addresses. All kernel addresses are
>> pre-mapped on the boot. Currently, I am proactively spitting the physmap
>> to avoid running into situation where x86 page level is greater than the
>> RMP page level.
> 
> In other words, if the host maps guest memory with 2M mappings, the
> guest can induce page faults in the host.  The only way the host can
> avoid this is to map everything with 4k mappings.
> 
> If the host does not avoid this, it could end up in the situation where
> it gets page faults on access to kernel data structures.  Imagine if a
> kernel stack page ended up in the same 2M mapping as a guest page.  I
> *think* the next write to the kernel stack would end up double-faulting.

I’m confused by this scenario. This should only affect physical pages that are 
in the 2M area that contains guest memory. But, if we have a 2M direct map PMD 
entry that contains kernel data and guest private memory, we’re already in a 
situation in which the kernel touching that memory would machine check, right?

ISTM we should fully unmap any guest private page from the kernel and all host 
user pagetables before actually making it be a guest private page.

Re: [RFC Part2 PATCH 06/30] x86/fault: dump the RMP entry on #PF

2021-03-24 Thread Andy Lutomirski
On Wed, Mar 24, 2021 at 10:04 AM Brijesh Singh  wrote:
>
> If hardware detects an RMP violation, it will raise a page-fault exception
> with the RMP bit set. To help the debug, dump the RMP entry of the faulting
> address.
>
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Joerg Roedel 
> Cc: "H. Peter Anvin" 
> Cc: Tony Luck 
> Cc: Dave Hansen 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Paolo Bonzini 
> Cc: Tom Lendacky 
> Cc: David Rientjes 
> Cc: Sean Christopherson 
> Cc: x...@kernel.org
> Cc: k...@vger.kernel.org
> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/mm/fault.c | 75 +
>  1 file changed, 75 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f39b551f89a6..7605e06a6dd9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -31,6 +31,7 @@
>  #include  /* VMALLOC_START, ...   */
>  #include   /* kvm_handle_async_pf  */
>  #include   /* fixup_vdso_exception()   */
> +#include/* lookup_rmpentry ...  */
>
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -147,6 +148,76 @@ is_prefetch(struct pt_regs *regs, unsigned long 
> error_code, unsigned long addr)
>  DEFINE_SPINLOCK(pgd_lock);
>  LIST_HEAD(pgd_list);
>
> +static void dump_rmpentry(struct page *page, rmpentry_t *e)
> +{
> +   unsigned long paddr = page_to_pfn(page) << PAGE_SHIFT;
> +
> +   pr_alert("RMPEntry paddr 0x%lx [assigned=%d immutable=%d pagesize=%d 
> gpa=0x%lx asid=%d "
> +   "vmsa=%d validated=%d]\n", paddr, rmpentry_assigned(e), 
> rmpentry_immutable(e),
> +   rmpentry_pagesize(e), rmpentry_gpa(e), rmpentry_asid(e), 
> rmpentry_vmsa(e),
> +   rmpentry_validated(e));
> +   pr_alert("RMPEntry paddr 0x%lx %016llx %016llx\n", paddr, e->high, 
> e->low);
> +}
> +
> +static void show_rmpentry(unsigned long address)
> +{
> +   struct page *page = virt_to_page(address);

This is an error path, and I don't think you have any particular
guarantee that virt_to_page(address) is valid.  Please add appropriate
validation or use one of the slow lookup helpers.


Re: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms

2021-02-24 Thread Andy Lutomirski
On Tue, Feb 23, 2021 at 4:54 PM Dey, Megha  wrote:
>
> Hi Andy,
>
> On 1/24/2021 8:23 AM, Andy Lutomirski wrote:
> > On Fri, Jan 22, 2021 at 11:29 PM Megha Dey  wrote:
> >> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
> >> (first implemented on Intel's Icelake client and Xeon CPUs).
> >>
> >> These algorithms take advantage of the AVX512 registers to keep the CPU
> >> busy and increase memory bandwidth utilization. They provide substantial
> >> (2-10x) improvements over existing crypto algorithms when update data size
> >> is greater than 128 bytes and do not have any significant impact when used
> >> on small amounts of data.
> >>
> >> However, these algorithms may also incur a frequency penalty and cause
> >> collateral damage to other workloads running on the same core(co-scheduled
> >> threads). These frequency drops are also known as bin drops where 1 bin
> >> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
> >> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
> >> are observed on the Icelake server.
> >>
> >> The AVX512 optimization are disabled by default to avoid impact on other
> >> workloads. In order to use these optimized algorithms:
> >> 1. At compile time:
> >> a. User must enable CONFIG_CRYPTO_AVX512 option
> >> b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
> >> 2. At run time:
> >> a. User must set module parameter use_avx512 at boot time
> >> b. Platform must support VPCLMULQDQ and VAES features
> >>
> >> N.B. It is unclear whether these coarse grain controls(global module
> >> parameter) would meet all user needs. Perhaps some per-thread control might
> >> be useful? Looking for guidance here.
> >
> > I've just been looking at some performance issues with in-kernel AVX,
> > and I have a whole pile of questions that I think should be answered
> > first:
> >
> > What is the impact of using an AVX-512 instruction on the logical
> > thread, its siblings, and other cores on the package?
> >
> > Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX 
> > insn?
> >
> > What is the impact on subsequent shorter EVEX, VEX, and legacy
> > SSE(2,3, etc) insns?
> >
> > How does VZEROUPPER figure in?  I can find an enormous amount of
> > misinformation online, but nothing authoritative.
> >
> > What is the effect of the AVX-512 states (5-7) being “in use”?  As far
> > as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
> > and its variants.  Is this correct?
> >
> > On AVX-512 capable CPUs, do we ever get a penalty for executing a
> > non-VEX insn followed by a large-width EVEX insn without an
> > intervening VZEROUPPER?  The docs suggest no, since Broadwell and
> > before don’t support EVEX, but I’d like to know for sure.
> >
> >
> > My current opinion is that we should not enable AVX-512 in-kernel
> > except on CPUs that we determine have good AVX-512 support.  Based on
> > some reading, that seems to mean Ice Lake Client and not anything
> > before it.  I also think a bunch of the above questions should be
> > answered before we do any of this.  Right now we have a regression of
> > unknown impact in regular AVX support in-kernel, we will have
> > performance issues in-kernel depending on what user code has done
> > recently, and I'm still trying to figure out what to do about it.
> > Throwing AVX-512 into the mix without real information is not going to
> > improve the situation.
>
> We are currently working on providing you with answers on the questions
> you have raised regarding AVX.

Thanks!


Re: [RFC V2 0/5] Introduce AVX512 optimized crypto algorithms

2021-01-24 Thread Andy Lutomirski
On Fri, Jan 22, 2021 at 11:29 PM Megha Dey  wrote:
>
> Optimize crypto algorithms using AVX512 instructions - VAES and VPCLMULQDQ
> (first implemented on Intel's Icelake client and Xeon CPUs).
>
> These algorithms take advantage of the AVX512 registers to keep the CPU
> busy and increase memory bandwidth utilization. They provide substantial
> (2-10x) improvements over existing crypto algorithms when update data size
> is greater than 128 bytes and do not have any significant impact when used
> on small amounts of data.
>
> However, these algorithms may also incur a frequency penalty and cause
> collateral damage to other workloads running on the same core(co-scheduled
> threads). These frequency drops are also known as bin drops where 1 bin
> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
> are observed on the Icelake server.
>
> The AVX512 optimization are disabled by default to avoid impact on other
> workloads. In order to use these optimized algorithms:
> 1. At compile time:
>a. User must enable CONFIG_CRYPTO_AVX512 option
>b. Toolchain(assembler) must support VPCLMULQDQ and VAES instructions
> 2. At run time:
>a. User must set module parameter use_avx512 at boot time
>b. Platform must support VPCLMULQDQ and VAES features
>
> N.B. It is unclear whether these coarse grain controls(global module
> parameter) would meet all user needs. Perhaps some per-thread control might
> be useful? Looking for guidance here.


I've just been looking at some performance issues with in-kernel AVX,
and I have a whole pile of questions that I think should be answered
first:

What is the impact of using an AVX-512 instruction on the logical
thread, its siblings, and other cores on the package?

Does the impact depend on whether it’s a 512-bit insn or a shorter EVEX insn?

What is the impact on subsequent shorter EVEX, VEX, and legacy
SSE(2,3, etc) insns?

How does VZEROUPPER figure in?  I can find an enormous amount of
misinformation online, but nothing authoritative.

What is the effect of the AVX-512 states (5-7) being “in use”?  As far
as I can tell, the only operations that clear XINUSE[5-7] are XRSTOR
and its variants.  Is this correct?

On AVX-512 capable CPUs, do we ever get a penalty for executing a
non-VEX insn followed by a large-width EVEX insn without an
intervening VZEROUPPER?  The docs suggest no, since Broadwell and
before don’t support EVEX, but I’d like to know for sure.


My current opinion is that we should not enable AVX-512 in-kernel
except on CPUs that we determine have good AVX-512 support.  Based on
some reading, that seems to mean Ice Lake Client and not anything
before it.  I also think a bunch of the above questions should be
answered before we do any of this.  Right now we have a regression of
unknown impact in regular AVX support in-kernel, we will have
performance issues in-kernel depending on what user code has done
recently, and I'm still trying to figure out what to do about it.
Throwing AVX-512 into the mix without real information is not going to
improve the situation.


Re: [RFC PATCH 0/8] x86: Support Intel Key Locker

2020-12-19 Thread Andy Lutomirski
On Wed, Dec 16, 2020 at 9:46 AM Chang S. Bae  wrote:
>
> Key Locker [1][2] is a new security feature available in new Intel CPUs to
> protect data encryption keys for the Advanced Encryption Standard
> algorithm. The protection limits the amount of time an AES key is exposed
> in memory by sealing a key and referencing it with new AES instructions.

I think some fundamental issues need to be worked out before we can
enable key locker upstream at all.

First, how fast is LOADIWKEY?  Does it depend on the mode?  Is it
credible to context switch the wrapping key?

First, on bare metal, we need to decide whether to use a wrapping key
or a software-provided wrapping key.  Both choices have pros and cons,
and it's not clear to me whether Linux should have a boot-time
parameter, a runtime control, a fixed value, or something else.  If we
use a random key, we need to figure out what to do about S5 and
hibernation.  No matter what we do, we're going to have some issues
with CRIU.

We also need to understand the virtualization situation.  What do we
expect hypervisors to do with Key Locker?  The only obviously
performant way I can see for VMMs to support migration is to use the
same wrapping key fleetwide.  (This is also the only way I can see for
VMMs to manage the wrapping key in a way that a side channel can't
extract it from hypervisor memory.)  But VMMs can't do this without
some degree of cooperation from the guest.  Perhaps we should disable
KL if CPUID.HYPERVISOR is set for now?

It's a shame that the spec seems to have some holes in the key
management mechanisms.  It would be very nice if there was a way to
load IWKey from an SGX enclave, and it would also be nice if there was
a way to load an IWKey that is wrapped by a different key.  Also, for
non-random IWKey values, there doesn't seem to be a way for software
(in an enclave or otherwise) to confirm that it's wrapping an AES key
against a particular wrapping key, which seems to severely limit the
ability to safely provision a new wrapped key at runtime.

--Andy


Re: [RFC PATCH 7/8] crypto: x86/aes-kl - Support AES algorithm using Key Locker instructions

2020-12-17 Thread Andy Lutomirski
On Wed, Dec 16, 2020 at 9:46 AM Chang S. Bae  wrote:
>
> Key Locker (KL) is Intel's new security feature that protects the AES key
> at the time of data transformation. New AES SIMD instructions -- as a
> successor of Intel's AES-NI -- are provided to encode an AES key and
> reference it for the AES algorithm.
>
> New instructions support 128/256-bit keys. While it is not desirable to
> receive any 192-bit key, AES-NI instructions are taken to serve this size.
>
> New instructions are operational in both 32-/64-bit modes.
>
> Add a set of new macros for the new instructions so that no new binutils
> version is required.
>
> Implemented methods are for a single block as well as ECB, CBC, CTR, and
> XTS modes. The methods are not compatible with other AES implementations as
> accessing an encrypted key instead of the normal AES key.
>
> setkey() call encodes an AES key. User may displace the AES key once
> encoded, as encrypt()/decrypt() methods do not need the key.
>
> Most C code follows the AES-NI implementation. It has higher priority than
> the AES-NI as providing key protection.

What does this patch *do*?

IKL gives a few special key slots that have certain restrictions and
certain security properties.  What can you use them for?  With this
series installed, what is the user-visible effect?  Is there a new
API?  Do you use them with the netlink user crypto interface?  Do you
use them for encrypting disks?  Swap?  How?  How do you allocate,
reset, and free keys?  Who has permissions to use them?


Re: [PATCH] random: remove dead code left over from blocking pool

2020-09-16 Thread Andy Lutomirski
On Tue, Sep 15, 2020 at 9:38 PM Eric Biggers  wrote:>
> From: Eric Biggers 
>
> Remove some dead code that was left over following commit 90ea1c6436d2
> ("random: remove the blocking pool").
>

Looks good to me.

Reviewed-by: Andy Lutomirski 


Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-07 Thread Andy Lutomirski
On Mon, Oct 7, 2019 at 8:12 AM Ard Biesheuvel  wrote:
>
> On Mon, 7 Oct 2019 at 17:02, Andy Lutomirski  wrote:
> >
> > On Sun, Oct 6, 2019 at 10:24 PM Ard Biesheuvel
> >  wrote:
> > >
> > > On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski  wrote:
> > > >
> >
> > > > > Actually, this can be addressed by retaining the module dependencies
> > > > > as before, but permitting the arch module to be omitted at load time.
> > > >
> > > > I think that, to avoid surprises, you should refuse to load the arch 
> > > > module if the generic module is loaded, too.
> > > >
> > >
> > > Most arch code depends on CPU features that may not be available given
> > > the context, either because they are SIMD or because they are optional
> > > CPU instructions. So we need both modules at the same time anyway, so
> > > that we can fall back to the generic code at runtime.
> > >
> > > So what I'd like is to have the generic module provide the library
> > > interface, but rely on arch modules that are optional.
> > >
> > > We already have 95% of what we need with weak references. We have the
> > > ability to test for presence of the arch code at runtime, and we even
> > > have code patching for all architectures (through static relocations).
> > >
> > > However, one could argue that this is more a [space] optimization than
> > > anything else, so I am willing to park this discussion until support
> > > for static calls has been merged, and proceed with something simpler.
> >
> > I'd suggest tabling it until static calls are merged.  PeterZ just
> > sent a new patchbomb for it anyway.
> >
>
> As it turns out, static calls are a poor fit for this. Imagine an interface 
> like
>
> poly1305_init(state)
> poly1305_update(state, input)
> poly1305_fina(state, digest)
>
> which can be implemented by different libraries. The problem is that
> state is opaque, and so it is generally not guaranteed that a sequence
> that was started using one implementation can be completed using
> another one.
>
> Since the whole point is having a simple library interface,
> complicating this with RCU hooks or other crazy plumbing to ensure
> that no calls are in progress when you switch one out for another one,
> I don't think static calls are suitable for this use case.
>
> > What I'm trying to get at here and apparently saying badly is that I
> > want to avoid a situation where lsmod shows the arch module loaded but
> > the arch code isn't actually executing.
>
> My goal here is to allow the generic library to be loaded with or
> without the arch code, with the arch code always being used when it is
> loaded. This is what I implemented using weak references, but it
> requires a tweak in the module loader (two lines but not pretty).
> Using weak references preserves the dependency order, since the
> generic module will depend on the arch module (and up the refcount) it
> any of its weak references were fulfilled by the arch module in
> question. Using static calls will invert the dependency relation,
> since the arch code will need to perform a static_call_update() to
> make [users of] the generic library point to its code. How this works
> with managing the module refcounts and unload order is an open
> question afaict.

Indeed.  Dealing with unloading when static calls are in use may be messy.

>
> > Regardless of how everything
> > gets wired up (static calls, weak refs, etc), the system's behavior
> > should match the system's configuration, which means that we should
> > not allow any improper order of loading things so that everything
> > *appears* to be loaded but does not actually function.
> >
>
> Yes. that is the whole point.
>
> > Saying "modprobe will do the right thing and let's not worry about
> > silly admins using insmod directly" is not a good solution.
>
> Agreed.
>
> I have disregarded static calls and weak references for the time
> being, and I will proceed with an implementation that uses neither.
> The downside of this is that, e.g., we are forced to load the NEON
> module on non-NEON capable hardware without calling any of its code,
> but this is basically the Zinc situation only with the NEON and the
> generic code living in different modules.

Makes sense.  It can always be improved after the initial
implementation is merged.


Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-07 Thread Andy Lutomirski
On Sun, Oct 6, 2019 at 10:24 PM Ard Biesheuvel
 wrote:
>
> On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski  wrote:
> >

> > > Actually, this can be addressed by retaining the module dependencies
> > > as before, but permitting the arch module to be omitted at load time.
> >
> > I think that, to avoid surprises, you should refuse to load the arch module 
> > if the generic module is loaded, too.
> >
>
> Most arch code depends on CPU features that may not be available given
> the context, either because they are SIMD or because they are optional
> CPU instructions. So we need both modules at the same time anyway, so
> that we can fall back to the generic code at runtime.
>
> So what I'd like is to have the generic module provide the library
> interface, but rely on arch modules that are optional.
>
> We already have 95% of what we need with weak references. We have the
> ability to test for presence of the arch code at runtime, and we even
> have code patching for all architectures (through static relocations).
>
> However, one could argue that this is more a [space] optimization than
> anything else, so I am willing to park this discussion until support
> for static calls has been merged, and proceed with something simpler.

I'd suggest tabling it until static calls are merged.  PeterZ just
sent a new patchbomb for it anyway.

What I'm trying to get at here and apparently saying badly is that I
want to avoid a situation where lsmod shows the arch module loaded but
the arch code isn't actually executing.  Regardless of how everything
gets wired up (static calls, weak refs, etc), the system's behavior
should match the system's configuration, which means that we should
not allow any improper order of loading things so that everything
*appears* to be loaded but does not actually function.

Saying "modprobe will do the right thing and let's not worry about
silly admins using insmod directly" is not a good solution.


Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-06 Thread Andy Lutomirski



> On Oct 5, 2019, at 12:24 AM, Ard Biesheuvel  wrote:
> 
> On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel  
> wrote:
>> 
>>> On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski  wrote:
>>> 
>>> 
>>> 
>>>> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel  
>>>> wrote:
>>>> 
>>>> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld  wrote:
>>>>> 
>>>>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:
>>>>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel  
>>>>>> wrote:
>>>>>>> 
>>>>>> ...
>>>>>>> 
>>>>>>> In the future, I would like to extend these interfaces to use static 
>>>>>>> calls,
>>>>>>> so that the accelerated implementations can be [un]plugged at runtime. 
>>>>>>> For
>>>>>>> the time being, we rely on weak aliases and conditional exports so that 
>>>>>>> the
>>>>>>> users of the library interfaces link directly to the accelerated 
>>>>>>> versions,
>>>>>>> but without the ability to unplug them.
>>>>>>> 
>>>>>> 
>>>>>> As it turns out, we don't actually need static calls for this.
>>>>>> Instead, we can simply permit weak symbol references to go unresolved
>>>>>> between modules (as we already do in the kernel itself, due to the
>>>>>> fact that ELF permits it), and have the accelerated code live in
>>>>>> separate modules that may not be loadable on certain systems, or be
>>>>>> blacklisted by the user.
>>>>> 
>>>>> You're saying that at module insertion time, the kernel will override
>>>>> weak symbols with those provided by the module itself? At runtime?
>>>>> 
>>>> 
>>>> Yes.
>>>> 
>>>>> Do you know offhand how this patching works? Is there a PLT that gets
>>>>> patched, and so the calls all go through a layer of function pointer
>>>>> indirection? Or are all call sites fixed up at insertion time and the
>>>>> call instructions rewritten with some runtime patching magic?
>>>>> 
>>>> 
>>>> No magic. Take curve25519 for example, when built for ARM:
>>>> 
>>>>  :
>>>>  0:   f240 0300   movwr3, #0
>>>>   0: R_ARM_THM_MOVW_ABS_NCcurve25519_arch
>>>>  4:   f2c0 0300   movtr3, #0
>>>>   4: R_ARM_THM_MOVT_ABS   curve25519_arch
>>>>  8:   b570push{r4, r5, r6, lr}
>>>>  a:   4604mov r4, r0
>>>>  c:   460dmov r5, r1
>>>>  e:   4616mov r6, r2
>>>> 10:   b173cbz r3, 30 
>>>> 12:   f7ff fffe   bl  0 
>>>>   12: R_ARM_THM_CALL  curve25519_arch
>>>> 16:   b158cbz r0, 30 
>>>> 18:   4620mov r0, r4
>>>> 1a:   2220movsr2, #32
>>>> 1c:   f240 0100   movwr1, #0
>>>>   1c: R_ARM_THM_MOVW_ABS_NC   .LANCHOR0
>>>> 20:   f2c0 0100   movtr1, #0
>>>>   20: R_ARM_THM_MOVT_ABS  .LANCHOR0
>>>> 24:   f7ff fffe   bl  0 <__crypto_memneq>
>>>>   24: R_ARM_THM_CALL  __crypto_memneq
>>>> 28:   3000addsr0, #0
>>>> 2a:   bf18it  ne
>>>> 2c:   2001movne   r0, #1
>>>> 2e:   bd70pop {r4, r5, r6, pc}
>>>> 30:   4632mov r2, r6
>>>> 32:   4629mov r1, r5
>>>> 34:   4620mov r0, r4
>>>> 36:   f7ff fffe   bl  0 
>>>>   36: R_ARM_THM_CALL  curve25519_generic
>>>> 3a:   e7edb.n 18 
>>>> 
>>>> curve25519_arch is a weak reference. It either gets satisfied at
>>>> module load time, or it doesn't.
>>>> 
>>>> If it does get satisfied, the relocations covering the movw/movt pair
>>>> and the one covering the bl instruction get updated so that they point
>>>> to the arch routine.
>>>> 
>>>> If it does not get satisfied, the

Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-04 Thread Andy Lutomirski



> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel  wrote:
> 
> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld  wrote:
>> 
>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:
>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel  
>>> wrote:
 
>>> ...
 
 In the future, I would like to extend these interfaces to use static calls,
 so that the accelerated implementations can be [un]plugged at runtime. For
 the time being, we rely on weak aliases and conditional exports so that the
 users of the library interfaces link directly to the accelerated versions,
 but without the ability to unplug them.
 
>>> 
>>> As it turns out, we don't actually need static calls for this.
>>> Instead, we can simply permit weak symbol references to go unresolved
>>> between modules (as we already do in the kernel itself, due to the
>>> fact that ELF permits it), and have the accelerated code live in
>>> separate modules that may not be loadable on certain systems, or be
>>> blacklisted by the user.
>> 
>> You're saying that at module insertion time, the kernel will override
>> weak symbols with those provided by the module itself? At runtime?
>> 
> 
> Yes.
> 
>> Do you know offhand how this patching works? Is there a PLT that gets
>> patched, and so the calls all go through a layer of function pointer
>> indirection? Or are all call sites fixed up at insertion time and the
>> call instructions rewritten with some runtime patching magic?
>> 
> 
> No magic. Take curve25519 for example, when built for ARM:
> 
>  :
>   0:   f240 0300   movwr3, #0
>0: R_ARM_THM_MOVW_ABS_NCcurve25519_arch
>   4:   f2c0 0300   movtr3, #0
>4: R_ARM_THM_MOVT_ABS   curve25519_arch
>   8:   b570push{r4, r5, r6, lr}
>   a:   4604mov r4, r0
>   c:   460dmov r5, r1
>   e:   4616mov r6, r2
>  10:   b173cbz r3, 30 
>  12:   f7ff fffe   bl  0 
>12: R_ARM_THM_CALL  curve25519_arch
>  16:   b158cbz r0, 30 
>  18:   4620mov r0, r4
>  1a:   2220movsr2, #32
>  1c:   f240 0100   movwr1, #0
>1c: R_ARM_THM_MOVW_ABS_NC   .LANCHOR0
>  20:   f2c0 0100   movtr1, #0
>20: R_ARM_THM_MOVT_ABS  .LANCHOR0
>  24:   f7ff fffe   bl  0 <__crypto_memneq>
>24: R_ARM_THM_CALL  __crypto_memneq
>  28:   3000addsr0, #0
>  2a:   bf18it  ne
>  2c:   2001movne   r0, #1
>  2e:   bd70pop {r4, r5, r6, pc}
>  30:   4632mov r2, r6
>  32:   4629mov r1, r5
>  34:   4620mov r0, r4
>  36:   f7ff fffe   bl  0 
>36: R_ARM_THM_CALL  curve25519_generic
>  3a:   e7edb.n 18 
> 
> curve25519_arch is a weak reference. It either gets satisfied at
> module load time, or it doesn't.
> 
> If it does get satisfied, the relocations covering the movw/movt pair
> and the one covering the bl instruction get updated so that they point
> to the arch routine.
> 
> If it does not get satisfied, the relocations are disregarded, in
> which case the cbz instruction at offset 0x10 jumps over the bl call.
> 
> Note that this does not involve any memory accesses. It does involve
> some code patching, but only of the kind the module loader already
> does.

Won’t this have the counterintuitive property that, if you load the modules in 
the opposite order, the reference won’t be re-resolved and performance will 
silently regress?

I think it might be better to allow two different modules to export the same 
symbol but only allow one of them to be loaded. Or use static calls.

Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-04 Thread Andy Lutomirski



> On Oct 4, 2019, at 6:42 AM, Jason A. Donenfeld  wrote:
> 
> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:
>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel  
>>> wrote:
>>> 
>> ...
>>> 
>>> In the future, I would like to extend these interfaces to use static calls,
>>> so that the accelerated implementations can be [un]plugged at runtime. For
>>> the time being, we rely on weak aliases and conditional exports so that the
>>> users of the library interfaces link directly to the accelerated versions,
>>> but without the ability to unplug them.
>>> 
>> 
>> As it turns out, we don't actually need static calls for this.
>> Instead, we can simply permit weak symbol references to go unresolved
>> between modules (as we already do in the kernel itself, due to the
>> fact that ELF permits it), and have the accelerated code live in
>> separate modules that may not be loadable on certain systems, or be
>> blacklisted by the user.
> 
> You're saying that at module insertion time, the kernel will override
> weak symbols with those provided by the module itself? At runtime?
> 
> Do you know offhand how this patching works? Is there a PLT that gets
> patched, and so the calls all go through a layer of function pointer
> indirection? Or are all call sites fixed up at insertion time and the
> call instructions rewritten with some runtime patching magic?
> 
> Unless the method is the latter, I would assume that static calls are
> much faster in general? Or the approach already in this series is
> perhaps fine enough, and we don't need to break this apart into
> individual modules complicating everything?

I admit I’m a bit mystified too. I think it would be great to have a feature 
where a special type of static call could be redirected by the module loader 
when a module that exports a corresponding symbol is loaded.  Unloading such a 
module would be interesting.

Ard, what exactly are you imagining?

Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

2019-09-26 Thread Andy Lutomirski
On Thu, Sep 26, 2019 at 8:54 PM Herbert Xu  wrote:
>
> On Thu, Sep 26, 2019 at 07:54:03PM -0700, Linus Torvalds wrote:
> >
> > Side note: almost nobody does this.
> >
> > Almost every single async interface I've ever seen ends up being "only
> > designed for async".
> >
> > And I think the reason is that everybody first does the simply
> > synchronous interfaces, and people start using those, and a lot of
> > people are perfectly happy with them. They are simple, and they work
> > fine for the huge majority of users.
>
> The crypto API is not the way it is because of async.  In fact, the
> crypto API started out as sync only and async was essentially
> bolted on top with minimial changes.

Then what's up with the insistence on using physical addresses for so
many of the buffers?

--Andy


Re: [RFC PATCH 18/18] net: wireguard - switch to crypto API for packet encryption

2019-09-26 Thread Andy Lutomirski
> On Sep 26, 2019, at 6:38 PM, Linus Torvalds  
> wrote:
>
> - let the caller know what the state size is and allocate the
> synchronous state in its own data structures
>
> - let the caller just call a static "decrypt_xyz()" function for xyz
> decryptrion.
>
> - if you end up doing it synchronously, that function just returns
> "done". No overhead. No extra allocations. No unnecessary stuff. Just
> do it, using the buffers provided. End of story. Efficient and simple.
>
> - BUT.
>
> - any hardware could have registered itself for "I can do xyz", and
> the decrypt_xyz() function would know about those, and *if* it has a
> list of accelerators (hopefully sorted by preference etc), it would
> try to use them. And if they take the job (they might not - maybe
> their queues are full, maybe they don't have room for new keys at the
> moment, which might be a separate setup from the queues), the
> "decrypt_xyz()" function returns a _cookie_ for that job. It's
> probably a pre-allocated one (the hw accelerator might preallocate a
> fixed number of in-progress data structures).

To really do this right, I think this doesn't go far enough.  Suppose
I'm trying to implement send() over a VPN very efficiently.  I want to
do, roughly, this:

void __user *buf, etc;

if (crypto api thinks async is good) {
  copy buf to some kernel memory;
  set up a scatterlist;
  do it async with this callback;
} else {
  do the crypto synchronously, from *user* memory, straight to kernel memory;
  (or, if that's too complicated, maybe copy in little chunks to a
little stack buffer.
   setting up a scatterlist is a waste of time.)
}

I don't know if the network code is structured in a way to make this
work easily, and the API would be more complex, but it could be nice
and fast.


Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

2019-09-26 Thread Andy Lutomirski
On Thu, Sep 26, 2019 at 1:52 PM Jason A. Donenfeld  wrote:
>
> Hi Ard,
>
>
> Our goals are that chacha20_arch() from each of these arch glues gets
> included in the lib/crypto/chacha20.c compilation unit. The reason why
> we want it in its own unit is so that the inliner can get rid of
> unreached code and more tightly integrate the branches. For the MIPS
> case, the advantage is clear.

IMO this needs numbers.  My suggestion from way back, which is at
least a good deal of the way toward being doable, is to do static
calls.  This means that the common code will call out to the arch code
via a regular CALL instruction and will *not* inline the arch code.
This means that the arch code could live in its own module, it can be
selected at boot time, etc.  For x86, inlining seems a but nuts to
avoid a whole mess of:

if (use avx2)
  do_avx2_thing();
else if (use avx1)
  do_avx1_thing();
else
  etc;

On x86, direct calls are pretty cheap.  Certainly for operations like
curve25519, I doubt you will ever see a real-world effect from
inlining.  I'd be surprised for chacha20.  If you really want inlining
to dictate the overall design, I think you need some real numbers for
why it's necessary.  There also needs to be a clear story for how
exactly making everything inline plays with the actual decision of
which implementation to use.  I think it's also worth noting that LTO
is coming.

--Andy


Re: [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines

2018-10-05 Thread Andy Lutomirski
On Fri, Oct 5, 2018 at 10:28 AM Ard Biesheuvel
 wrote:
>
> On 5 October 2018 at 19:26, Andy Lutomirski  wrote:
> > On Fri, Oct 5, 2018 at 10:15 AM Ard Biesheuvel
> >  wrote:
> >>
> >> On 5 October 2018 at 15:37, Jason A. Donenfeld  wrote:
> >> ...
> >> > Therefore, I think this patch goes in exactly the wrong direction. I
> >> > mean, if you want to introduce dynamic patching as a means for making
> >> > the crypto API's dynamic dispatch stuff not as slow in a post-spectre
> >> > world, sure, go for it; that may very well be a good idea. But
> >> > presenting it as an alternative to Zinc very widely misses the point and
> >> > serves to prolong a series of bad design choices, which are now able to
> >> > be rectified by putting energy into Zinc instead.
> >> >
> >>
> >> This series has nothing to do with dynamic dispatch: the call sites
> >> call crypto functions using ordinary function calls (although my
> >> example uses CRC-T10DIF), and these calls are redirected via what is
> >> essentially a PLT entry, so that we can supsersede those routines at
> >> runtime.
> >
> > If you really want to do it PLT-style, then just do:
> >
> > extern void whatever_func(args);
> >
> > Call it like:
> > whatever_func(args here);
> >
> > And rig up something to emit asm like:
> >
> > GLOBAL(whatever_func)
> >   jmpq default_whatever_func
> > ENDPROC(whatever_func)
> >
> > Architectures without support can instead do:
> >
> > void whatever_func(args)
> > {
> >   READ_ONCE(patchable_function_struct_for_whatever_func->ptr)(args);
> > }
> >
> > and patch the asm function for basic support.  It will be slower than
> > necessary, but maybe the relocation trick could be used on top of this
> > to redirect the call to whatever_func directly to the target for
> > architectures that want to squeeze out the last bit of performance.
> > This might actually be the best of all worlds: easy implementation on
> > all architectures, no inline asm, and the totally non-magical version
> > works with okay performance.
> >
> > (Is this what your code is doing?  I admit I didn't follow all the way
> > through all the macros.)
>
> Basically

Adding Josh Poimboeuf.

Here's a sketch of how this could work for better performance.  For a
static call "foo" that returns void and takes no arguments, the
generic implementation does something like this:

extern void foo(void);

struct static_call {
  void (*target)(void);

  /* arch-specific part containing an array of struct static_call_site */
};

void foo(void)
{
  READ_ONCE(__static_call_foo->target)();
}

Arch code overrides it to:

GLOBAL(foo)
  jmpq *__static_call_foo(%rip)
ENDPROC(foo)

and some extra asm to emit a static_call_site object saying that the
address "foo" is a jmp/call instruction where the operand is at offset
1 into the instruction.  (Or whatever the offset is.)

The patch code is like:

void set_static_call(struct static_call *call, void *target)
{
  /* take a spinlock? */
  WRITE_ONCE(call->target, target);
  arch_set_static_call(call, target);
}

and the arch code patches the call site if needed.

On x86, an even better implementation would have objtool make a bunch
of additional static_call_site objects for each call to foo, and
arch_set_static_call() would update all of them, too.  Using
text_poke_bp() if needed, and "if needed" can maybe be clever and
check the alignment of the instruction.  I admit that I never actually
remember the full rules for atomically patching an instruction on x86
SMP.  (Hmm.  This will be really epically slow.  Maybe we don't care.
Or we could finally optimize text_poke, etc to take a list of pokes to
do and do them as a batch.  But that's not a prerequisite for the rest
of this.)

What do you all think?


Re: [RFC PATCH 1/9] kernel: add support for patchable function pointers

2018-10-05 Thread Andy Lutomirski
On Fri, Oct 5, 2018 at 10:38 AM Jason A. Donenfeld  wrote:
>
> On Fri, Oct 5, 2018 at 7:29 PM Andy Lutomirski  wrote:
> > (None of this is to say that I disagree with Jason, though -- I'm not
> > entirely convinced that this makes sense for Zinc.  But maybe it can
> > be done in a way that makes everyone happy.)
>
> Zinc indeed will continue to push in the simpler and more minimal
> direction. Down the line I'm open to trying and benching a few
> different ways of going about it with dynamic patching -- something
> that will be pretty easy to experiment with given the lean structure
> of Zinc -- but for the initial merge I intend to do it the way it is,
> which is super fast and pretty straightforward to follow.
>

I *think* the only change to Zinc per se would be that the calls like
chacha20_simd() would be static calls or patchable functions or
whatever we want to call them.  And there could be a debugfs to
override the default selection.

Ard, I don't think that sticking this in udev rules makes sense.  The
kernel has bascially complete information as to what the right choice
is, and that will change over time as the implementation gets tuned,
and the udev rules will never get updated in sync.


Re: [RFC PATCH 1/9] kernel: add support for patchable function pointers

2018-10-05 Thread Andy Lutomirski
On Fri, Oct 5, 2018 at 10:23 AM Ard Biesheuvel
 wrote:
>
> On 5 October 2018 at 19:20, Andy Lutomirski  wrote:
> > On Fri, Oct 5, 2018 at 10:11 AM Ard Biesheuvel
> >  wrote:
> >>
> >> On 5 October 2018 at 18:58, Andy Lutomirski  wrote:
> >> > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel 
> >> >  wrote:
> >> >>
> >> >> On 5 October 2018 at 17:08, Andy Lutomirski  wrote:
> >> >> >
> >> >> >
> >> >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra  
> >> >> >> wrote:
> >> >> >>
> >> >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
> >> >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
> >> >> >>> new file mode 100644
> >> >> >>> index ..8fc3b4c9b38f
> >> >> >>> --- /dev/null
> >> >> >>> +++ b/include/linux/ffp.h
> >> >> >>> @@ -0,0 +1,43 @@
> >> >> >>> +/* SPDX-License-Identifier: GPL-2.0 */
> >> >> >>> +
> >> >> >>> +#ifndef __LINUX_FFP_H
> >> >> >>> +#define __LINUX_FFP_H
> >> >> >>> +
> >> >> >>> +#include 
> >> >> >>> +#include 
> >> >> >>> +
> >> >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP
> >> >> >>> +#include 
> >> >> >>> +#else
> >> >> >>> +
> >> >> >>> +struct ffp {
> >> >> >>> +void(**fn)(void);
> >> >> >>> +void(*default_fn)(void);
> >> >> >>> +};
> >> >> >>> +
> >> >> >>> +#define DECLARE_FFP(_fn, _def)\
> >> >> >>> +extern typeof(_def) *_fn;\
> >> >> >>> +extern struct ffp const __ffp_ ## _fn
> >> >> >>> +
> >> >> >>> +#define DEFINE_FFP(_fn, _def)\
> >> >> >>> +typeof(_def) *_fn = &_def;\
> >> >> >>> +struct ffp const __ffp_ ## _fn\
> >> >> >>> += { (void(**)(void))&_fn, (void(*)(void))&_def };\
> >> >> >>> +EXPORT_SYMBOL(__ffp_ ## _fn)
> >> >> >>> +
> >> >> >>> +static inline void ffp_set_target(const struct ffp *m, void 
> >> >> >>> *new_fn)
> >> >> >>> +{
> >> >> >>> +WRITE_ONCE(*m->fn, new_fn);
> >> >> >>> +}
> >> >> >>> +
> >> >> >>> +static inline void ffp_reset_target(const struct ffp *m)
> >> >> >>> +{
> >> >> >>> +WRITE_ONCE(*m->fn, m->default_fn);
> >> >> >>> +}
> >> >> >>> +
> >> >> >>> +#endif
> >> >> >>> +
> >> >> >>> +#define SET_FFP(_fn, _new)ffp_set_target(&__ffp_ ## _fn, _new)
> >> >> >>> +#define RESET_FFP(_fn)ffp_reset_target(&__ffp_ ## _fn)
> >> >> >>> +
> >> >> >>> +#endif
> >> >> >>
> >> >> >> I don't understand this interface. There is no wrapper for the call
> >> >> >> site, so how are we going to patch all call-sites when you update the
> >> >> >> target?
> >> >> >
> >> >> > I’m also confused.
> >> >> >
> >> >> > Anyway, we have patchable functions on x86. They’re called PVOPs, and 
> >> >> > they’re way overcomplicated.
> >> >> >
> >> >> > I’ve proposed a better way that should generate better code, be more 
> >> >> > portable, and be more maintainable.  It goes like this.
> >> >> >
> >> >> > To call the function, you literally just call  the default 
> >> >> > implementation.  It *might* be necessary to call a nonexistent 
> >> >> > wrapper to avoid annoying optimizations. At build time, the kernel is 
> >> >> > built with relocations, so the object files contain relocation 
> >> >> > entries for the call. W

Re: [RFC PATCH 0/9] patchable function pointers for pluggable crypto routines

2018-10-05 Thread Andy Lutomirski
On Fri, Oct 5, 2018 at 10:15 AM Ard Biesheuvel
 wrote:
>
> On 5 October 2018 at 15:37, Jason A. Donenfeld  wrote:
> ...
> > Therefore, I think this patch goes in exactly the wrong direction. I
> > mean, if you want to introduce dynamic patching as a means for making
> > the crypto API's dynamic dispatch stuff not as slow in a post-spectre
> > world, sure, go for it; that may very well be a good idea. But
> > presenting it as an alternative to Zinc very widely misses the point and
> > serves to prolong a series of bad design choices, which are now able to
> > be rectified by putting energy into Zinc instead.
> >
>
> This series has nothing to do with dynamic dispatch: the call sites
> call crypto functions using ordinary function calls (although my
> example uses CRC-T10DIF), and these calls are redirected via what is
> essentially a PLT entry, so that we can supsersede those routines at
> runtime.

If you really want to do it PLT-style, then just do:

extern void whatever_func(args);

Call it like:
whatever_func(args here);

And rig up something to emit asm like:

GLOBAL(whatever_func)
  jmpq default_whatever_func
ENDPROC(whatever_func)

Architectures without support can instead do:

void whatever_func(args)
{
  READ_ONCE(patchable_function_struct_for_whatever_func->ptr)(args);
}

and patch the asm function for basic support.  It will be slower than
necessary, but maybe the relocation trick could be used on top of this
to redirect the call to whatever_func directly to the target for
architectures that want to squeeze out the last bit of performance.
This might actually be the best of all worlds: easy implementation on
all architectures, no inline asm, and the totally non-magical version
works with okay performance.

(Is this what your code is doing?  I admit I didn't follow all the way
through all the macros.)


Re: [RFC PATCH 1/9] kernel: add support for patchable function pointers

2018-10-05 Thread Andy Lutomirski
On Fri, Oct 5, 2018 at 10:11 AM Ard Biesheuvel
 wrote:
>
> On 5 October 2018 at 18:58, Andy Lutomirski  wrote:
> > On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel  
> > wrote:
> >>
> >> On 5 October 2018 at 17:08, Andy Lutomirski  wrote:
> >> >
> >> >
> >> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra  wrote:
> >> >>
> >> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
> >> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
> >> >>> new file mode 100644
> >> >>> index ..8fc3b4c9b38f
> >> >>> --- /dev/null
> >> >>> +++ b/include/linux/ffp.h
> >> >>> @@ -0,0 +1,43 @@
> >> >>> +/* SPDX-License-Identifier: GPL-2.0 */
> >> >>> +
> >> >>> +#ifndef __LINUX_FFP_H
> >> >>> +#define __LINUX_FFP_H
> >> >>> +
> >> >>> +#include 
> >> >>> +#include 
> >> >>> +
> >> >>> +#ifdef CONFIG_HAVE_ARCH_FFP
> >> >>> +#include 
> >> >>> +#else
> >> >>> +
> >> >>> +struct ffp {
> >> >>> +void(**fn)(void);
> >> >>> +void(*default_fn)(void);
> >> >>> +};
> >> >>> +
> >> >>> +#define DECLARE_FFP(_fn, _def)\
> >> >>> +extern typeof(_def) *_fn;\
> >> >>> +extern struct ffp const __ffp_ ## _fn
> >> >>> +
> >> >>> +#define DEFINE_FFP(_fn, _def)\
> >> >>> +typeof(_def) *_fn = &_def;\
> >> >>> +struct ffp const __ffp_ ## _fn\
> >> >>> += { (void(**)(void))&_fn, (void(*)(void))&_def };\
> >> >>> +EXPORT_SYMBOL(__ffp_ ## _fn)
> >> >>> +
> >> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
> >> >>> +{
> >> >>> +WRITE_ONCE(*m->fn, new_fn);
> >> >>> +}
> >> >>> +
> >> >>> +static inline void ffp_reset_target(const struct ffp *m)
> >> >>> +{
> >> >>> +WRITE_ONCE(*m->fn, m->default_fn);
> >> >>> +}
> >> >>> +
> >> >>> +#endif
> >> >>> +
> >> >>> +#define SET_FFP(_fn, _new)ffp_set_target(&__ffp_ ## _fn, _new)
> >> >>> +#define RESET_FFP(_fn)ffp_reset_target(&__ffp_ ## _fn)
> >> >>> +
> >> >>> +#endif
> >> >>
> >> >> I don't understand this interface. There is no wrapper for the call
> >> >> site, so how are we going to patch all call-sites when you update the
> >> >> target?
> >> >
> >> > I’m also confused.
> >> >
> >> > Anyway, we have patchable functions on x86. They’re called PVOPs, and 
> >> > they’re way overcomplicated.
> >> >
> >> > I’ve proposed a better way that should generate better code, be more 
> >> > portable, and be more maintainable.  It goes like this.
> >> >
> >> > To call the function, you literally just call  the default 
> >> > implementation.  It *might* be necessary to call a nonexistent wrapper 
> >> > to avoid annoying optimizations. At build time, the kernel is built with 
> >> > relocations, so the object files contain relocation entries for the 
> >> > call. We collect these entries into a table. If we’re using the 
> >> > “nonexistent wrapper” approach, we can link in a .S or linker script to 
> >> > alias them to the default implementation.
> >> >
> >> > To patch them, we just patch them. It can’t necessarily be done 
> >> > concurrently because nothing forces the right alignment. But we can do 
> >> > it at boot time and module load time. (Maybe we can patch at runtime on 
> >> > architectures with appropriate instruction alignment.  Or we ask gcc for 
> >> > an extension to align calls to a function.)
> >> >
> >> > Most of the machinery already exists: this is roughly how the module 
> >> > loader resolves calls outside of a module.
> >>
> >> Yeah nothing is ever simple on x86 :-(
> >>
&

Re: [RFC PATCH 1/9] kernel: add support for patchable function pointers

2018-10-05 Thread Andy Lutomirski
On Fri, Oct 5, 2018 at 8:24 AM Ard Biesheuvel  wrote:
>
> On 5 October 2018 at 17:08, Andy Lutomirski  wrote:
> >
> >
> >> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra  wrote:
> >>
> >>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
> >>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
> >>> new file mode 100644
> >>> index ..8fc3b4c9b38f
> >>> --- /dev/null
> >>> +++ b/include/linux/ffp.h
> >>> @@ -0,0 +1,43 @@
> >>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>> +
> >>> +#ifndef __LINUX_FFP_H
> >>> +#define __LINUX_FFP_H
> >>> +
> >>> +#include 
> >>> +#include 
> >>> +
> >>> +#ifdef CONFIG_HAVE_ARCH_FFP
> >>> +#include 
> >>> +#else
> >>> +
> >>> +struct ffp {
> >>> +void(**fn)(void);
> >>> +void(*default_fn)(void);
> >>> +};
> >>> +
> >>> +#define DECLARE_FFP(_fn, _def)\
> >>> +extern typeof(_def) *_fn;\
> >>> +extern struct ffp const __ffp_ ## _fn
> >>> +
> >>> +#define DEFINE_FFP(_fn, _def)\
> >>> +typeof(_def) *_fn = &_def;\
> >>> +struct ffp const __ffp_ ## _fn\
> >>> += { (void(**)(void))&_fn, (void(*)(void))&_def };\
> >>> +EXPORT_SYMBOL(__ffp_ ## _fn)
> >>> +
> >>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
> >>> +{
> >>> +WRITE_ONCE(*m->fn, new_fn);
> >>> +}
> >>> +
> >>> +static inline void ffp_reset_target(const struct ffp *m)
> >>> +{
> >>> +WRITE_ONCE(*m->fn, m->default_fn);
> >>> +}
> >>> +
> >>> +#endif
> >>> +
> >>> +#define SET_FFP(_fn, _new)ffp_set_target(&__ffp_ ## _fn, _new)
> >>> +#define RESET_FFP(_fn)ffp_reset_target(&__ffp_ ## _fn)
> >>> +
> >>> +#endif
> >>
> >> I don't understand this interface. There is no wrapper for the call
> >> site, so how are we going to patch all call-sites when you update the
> >> target?
> >
> > I’m also confused.
> >
> > Anyway, we have patchable functions on x86. They’re called PVOPs, and 
> > they’re way overcomplicated.
> >
> > I’ve proposed a better way that should generate better code, be more 
> > portable, and be more maintainable.  It goes like this.
> >
> > To call the function, you literally just call  the default implementation.  
> > It *might* be necessary to call a nonexistent wrapper to avoid annoying 
> > optimizations. At build time, the kernel is built with relocations, so the 
> > object files contain relocation entries for the call. We collect these 
> > entries into a table. If we’re using the “nonexistent wrapper” approach, we 
> > can link in a .S or linker script to alias them to the default 
> > implementation.
> >
> > To patch them, we just patch them. It can’t necessarily be done 
> > concurrently because nothing forces the right alignment. But we can do it 
> > at boot time and module load time. (Maybe we can patch at runtime on 
> > architectures with appropriate instruction alignment.  Or we ask gcc for an 
> > extension to align calls to a function.)
> >
> > Most of the machinery already exists: this is roughly how the module loader 
> > resolves calls outside of a module.
>
> Yeah nothing is ever simple on x86 :-(
>
> So are you saying the approach i use in patch #2 (which would
> translate to emitting a jmpq instruction pointing to the default
> implementation, and patching it at runtime to point elsewhere) would
> not fly on x86?

After getting some more sleep, I'm obviously wrong.  The
text_poke_bp() mechanism will work.  It's just really slow.

Let me try to summarize some of the issues.  First, when emitting
jumps and calls from inline asm on x86, there are a few considerations
that are annoying:

1. Following the x86_64 ABI calling conventions is basically
impossible.  x86_64 requires a 128-byte redzone and 16-byte stack
alignment.  After much discussion a while back, we decided that it was
flat-out impossible on current gcc to get the stack pointer aligned in
a known manner in an inline asm statement.  Instead, if we actually
need alignment, we need to align manually.  Fortunately, the kernel is
built w

Re: [RFC PATCH 1/9] kernel: add support for patchable function pointers

2018-10-05 Thread Andy Lutomirski



> On Oct 5, 2018, at 7:14 AM, Peter Zijlstra  wrote:
> 
>> On Fri, Oct 05, 2018 at 10:13:25AM +0200, Ard Biesheuvel wrote:
>> diff --git a/include/linux/ffp.h b/include/linux/ffp.h
>> new file mode 100644
>> index ..8fc3b4c9b38f
>> --- /dev/null
>> +++ b/include/linux/ffp.h
>> @@ -0,0 +1,43 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef __LINUX_FFP_H
>> +#define __LINUX_FFP_H
>> +
>> +#include 
>> +#include 
>> +
>> +#ifdef CONFIG_HAVE_ARCH_FFP
>> +#include 
>> +#else
>> +
>> +struct ffp {
>> +void(**fn)(void);
>> +void(*default_fn)(void);
>> +};
>> +
>> +#define DECLARE_FFP(_fn, _def)\
>> +extern typeof(_def) *_fn;\
>> +extern struct ffp const __ffp_ ## _fn
>> +
>> +#define DEFINE_FFP(_fn, _def)\
>> +typeof(_def) *_fn = &_def;\
>> +struct ffp const __ffp_ ## _fn\
>> += { (void(**)(void))&_fn, (void(*)(void))&_def };\
>> +EXPORT_SYMBOL(__ffp_ ## _fn)
>> +
>> +static inline void ffp_set_target(const struct ffp *m, void *new_fn)
>> +{
>> +WRITE_ONCE(*m->fn, new_fn);
>> +}
>> +
>> +static inline void ffp_reset_target(const struct ffp *m)
>> +{
>> +WRITE_ONCE(*m->fn, m->default_fn);
>> +}
>> +
>> +#endif
>> +
>> +#define SET_FFP(_fn, _new)ffp_set_target(&__ffp_ ## _fn, _new)
>> +#define RESET_FFP(_fn)ffp_reset_target(&__ffp_ ## _fn)
>> +
>> +#endif
> 
> I don't understand this interface. There is no wrapper for the call
> site, so how are we going to patch all call-sites when you update the
> target?

I’m also confused.

Anyway, we have patchable functions on x86. They’re called PVOPs, and they’re 
way overcomplicated.

I’ve proposed a better way that should generate better code, be more portable, 
and be more maintainable.  It goes like this.

To call the function, you literally just call  the default implementation.  It 
*might* be necessary to call a nonexistent wrapper to avoid annoying 
optimizations. At build time, the kernel is built with relocations, so the 
object files contain relocation entries for the call. We collect these entries 
into a table. If we’re using the “nonexistent wrapper” approach, we can link in 
a .S or linker script to alias them to the default implementation.

To patch them, we just patch them. It can’t necessarily be done concurrently 
because nothing forces the right alignment. But we can do it at boot time and 
module load time. (Maybe we can patch at runtime on architectures with 
appropriate instruction alignment.  Or we ask gcc for an extension to align 
calls to a function.)

Most of the machinery already exists: this is roughly how the module loader 
resolves calls outside of a module.

Re: [PATCH net-next v6 01/23] asm: simd context helper API

2018-09-29 Thread Andy Lutomirski



> On Sep 29, 2018, at 9:20 PM, Joe Perches  wrote:
> 
>> On Fri, 2018-09-28 at 16:01 +0200, Jason A. Donenfeld wrote:
>> On Fri, Sep 28, 2018 at 4:00 PM Ard Biesheuvel
>>  wrote:
>>> 
 On 28 September 2018 at 15:59, Jason A. Donenfeld  wrote:
 On Fri, Sep 28, 2018 at 3:58 PM Ard Biesheuvel
  wrote:
> 
>> On 28 September 2018 at 15:47, Jason A. Donenfeld  
>> wrote:
>> On Fri, Sep 28, 2018 at 10:49 AM Ard Biesheuvel
>>  wrote:
> +typedef enum {
> +   HAVE_NO_SIMD = 1 << 0,
> +   HAVE_FULL_SIMD = 1 << 1,
> +   HAVE_SIMD_IN_USE = 1 << 31
> +} simd_context_t;
> +
>>> 
>>> Oh, and another thing (and I'm surprised checkpatch.pl didn't complain
>>> about it): the use of typedef in new code is strongly discouraged.
>>> This policy predates my involvement, so perhaps Joe can elaborate on
>>> the rationale?
>> 
>> In case it matters, the motivation for making this a typedef is I
>> could imagine this at some point turning into a more complicated
>> struct on certain platforms and that would make refactoring easier. I
>> could just make it `struct simd_context` now with 1 member though...
> 
> Yes that makes sense
 
 The rationale for it being a typedef or moving to a struct now?
>>> 
>>> Yes just switch to a struct.
>> 
>> Okay. No problem with that, but will wait to hear from Joe first.
> 
> Why do you need to hear from me again?
> 
> As far as I know, the only info about typedef avoidance are in
> Documentation/process/coding-style.rst section 5.
> 
> 

I personally prefer it with the typedef. If this were my code, I’d say the 
coding style is silly for opaque tiny structs like this.

Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Andy Lutomirski



> On Sep 27, 2018, at 8:19 AM, Jason A. Donenfeld  wrote:
> 
> Hey again Thomas,
> 
>> On Thu, Sep 27, 2018 at 3:26 PM Jason A. Donenfeld  wrote:
>> 
>> Hi Thomas,
>> 
>> I'm trying to optimize this for crypto performance while still taking
>> into account preemption concerns. I'm having a bit of trouble figuring
>> out a way to determine numerically what the upper bounds for this
>> stuff looks like. I'm sure I could pick a pretty sane number that's
>> arguably okay -- and way under the limit -- but I still am interested
>> in determining what that limit actually is. I was hoping there'd be a
>> debugging option called, "warn if preemption is disabled for too
>> long", or something, but I couldn't find anything like that. I'm also
>> not quite sure what the latency limits are, to just compute this with
>> a formula. Essentially what I'm trying to determine is:
>> 
>> preempt_disable();
>> asm volatile(".fill N, 1, 0x90;");
>> preempt_enable();
>> 
>> What is the maximum value of N for which the above is okay? What
>> technique would you generally use in measuring this?
>> 
>> Thanks,
>> Jason
> 
> From talking to Peter (now CC'd) on IRC, it sounds like what you're
> mostly interested in is clocktime latency on reasonable hardware, with
> a goal of around ~20µs as a maximum upper bound? I don't expect to get
> anywhere near this value at all, but if you can confirm that's a
> decent ballpark, it would make for some interesting calculations.
> 
> 

I would add another consideration: if you can get better latency with 
negligible overhead (0.1%? 0.05%), then that might make sense too. For example, 
it seems plausible that checking need_resched() every few blocks adds basically 
no overhead, and the SIMD helpers could do this themselves or perhaps only ever 
do a block at a time.

need_resched() costs a cacheline access, but it’s usually a hot cacheline, and 
the actual check is just whether a certain bit in memory is set.

Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-26 Thread Andy Lutomirski



> On Sep 26, 2018, at 10:03 AM, Jason A. Donenfeld  wrote:
> 
>> On Wed, Sep 26, 2018 at 6:21 PM Andy Lutomirski  wrote:
>> Are, is what you’re saying that the Zinc chacha20 functions should call 
>> simd_relax() every n bytes automatically for some reasonable value of n?  If 
>> so, seems sensible, except that some care might be needed to make sure they 
>> interact with preemption correctly.
>> 
>> What I mean is: the public Zinc entry points should either be callable in an 
>> atomic context or they should not be.  I think this should be checked at 
>> runtime in an appropriate place with an __might_sleep or similar.  Or 
>> simd_relax should learn to *not* schedule if the result of preempt_enable() 
>> leaves it atomic. (And the latter needs to be done in a way that works even 
>> on non-preempt kernels, and I don’t remember whether that’s possible.). And 
>> this should happen regardless of how many bytes are processed. IOW, calling 
>> into Zinc should be equally not atomic-safe for 100 bytes and for 10 MB.
> 
> I'm not sure this is actually a problem. Namely:
> 
> preempt_disable();
> kernel_fpu_begin();
> kernel_fpu_end();
> schedule(); <--- bug!
> 
> Calling kernel_fpu_end() disables preemption, but AFAIK, preemption
> enabling/disabling is recursive, so kernel_fpu_end's use of
> preempt_disable won't actually do anything until the outer preempt
> enable is called:
> 
> preempt_disable();
> kernel_fpu_begin();
> kernel_fpu_end();
> preempt_enable();
> schedule(); <--- works!
> 
> Or am I missing some more subtle point?
> 

No, I think you’re right. I was mid-remembering precisely how simd_relax() 
worked.

Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-26 Thread Andy Lutomirski



> On Sep 26, 2018, at 7:02 AM, Ard Biesheuvel  wrote:
> 
> (+ Herbert, Thomas)
> 
>> On Wed, 26 Sep 2018 at 15:33, Jason A. Donenfeld  wrote:
>> 
>> Hi Ard,
>> .
> 
>> And if it becomes one,
>> this is something we can address *later*, but certainly there's no use
>> of adding additional complexity to the initial patchset to do this
>> now.
>> 
> 
> You are introducing a very useful SIMD abstraction, but it lets code
> run with preemption disabled for unbounded amounts of time, and so now
> is the time to ensure we get it right.
> 
> Part of the [justified] criticism on the current state of the crypto
> API is on its complexity, and so I don't think it makes sense to keep
> it simple now and add the complexity later (and the same concern
> applies to async support btw).

Are, is what you’re saying that the Zinc chacha20 functions should call 
simd_relax() every n bytes automatically for some reasonable value of n?  If 
so, seems sensible, except that some care might be needed to make sure they 
interact with preemption correctly.

What I mean is: the public Zinc entry points should either be callable in an 
atomic context or they should not be.  I think this should be checked at 
runtime in an appropriate place with an __might_sleep or similar.  Or 
simd_relax should learn to *not* schedule if the result of preempt_enable() 
leaves it atomic. (And the latter needs to be done in a way that works even on 
non-preempt kernels, and I don’t remember whether that’s possible.). And this 
should happen regardless of how many bytes are processed. IOW, calling into 
Zinc should be equally not atomic-safe for 100 bytes and for 10 MB.

As for async, ISTM a really good WireGuard accelerator would expose a different 
interface than crypto API supports, and it probably makes sense to wait for 
such hardware to show up before figuring out how to use it.  And no matter what 
form it takes, I don’t think it should complicate the basic Zinc crypto entry 
points.

Re: [PATCH net-next v6 02/23] zinc: introduce minimal cryptography library

2018-09-25 Thread Andy Lutomirski



> On Sep 25, 2018, at 12:43 PM, Jason A. Donenfeld  wrote:
> 
>> On Tue, Sep 25, 2018 at 8:33 PM Joe Perches  wrote:
>> I think the -Dpr_fmt is especially odd and not
>> really acceptable as it not used anywhere else
>> in the kernel.
> 
> There are about 2000 cases in the kernel of the same '#define
> pr_fmt...' being pasted into the top of the file, which seems a bit
> cumbersome. Rather than having to paste that into each and every file
> that I pr_err from, why can't I just do this from the makefile, since
> I want that same pr_fmt to copy the whole directory?

Because people like to be able to just read the C file to figure out what it 
does. Or we could adopt the Makefile approach kernel-wide, since this use of it 
seems reasonable.

Re: [PATCH net-next v5 02/20] zinc: introduce minimal cryptography library

2018-09-20 Thread Andy Lutomirski



> On Sep 20, 2018, at 9:30 PM, Ard Biesheuvel  wrote:
> 
>> On 20 September 2018 at 21:15, Jason A. Donenfeld  wrote:
>> Hi Andy,
>> 
>>> On Fri, Sep 21, 2018 at 5:23 AM Andy Lutomirski  wrote:
>>> At the risk on suggesting something awful: on x86_64, since we turn 
>>> preemption off for simd, it wouldn’t be *completely* insane to do the 
>>> crypto on the irq stack. It would look like:
>>> 
>>> kernel_fpu_call(func, arg);
>>> 
>>> And this helper would disable preemption, enable FPU, switch to the irq 
>>> stack, call func(arg), disable FPU, enable preemption, and return. And we 
>>> can have large IRQ stacks.
>>> 
>>> I refuse to touch this with a ten-foot pole until the lazy FPU restore 
>>> patches land.
>> 
>> Haha. That's fun, and maybe we'll do that at some point, but I have
>> some other reasons too for being on a workqueue now.
>> 
> 
> Kernel mode crypto is callable from any context, and SIMD can be used
> in softirq context on arm64 (and on x86, even from hardirq context
> IIRC if the interrupt is taken from userland), in which case we'd
> already be on the irq stack.

The x86_64 irq stack handles nesting already.


Re: [PATCH net-next v5 02/20] zinc: introduce minimal cryptography library

2018-09-20 Thread Andy Lutomirski


> On Sep 20, 2018, at 8:12 PM, Andrew Lunn  wrote:
> 
>> On Fri, Sep 21, 2018 at 02:11:43AM +0200, Jason A. Donenfeld wrote:
>> Hey Arnd,
>> 
>>> On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann  wrote:
>>> Right, if you hit a stack requirement like this, it's usually the compiler
>>> doing something bad, not just using too much stack but also generating
>>> rather slow object code in the process. It's better to fix the bug by
>>> optimizing the code to not spill registers to the stack.
>>> 
>>> In the long run, I'd like to reduce the stack frame size further, so
>>> best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes
>>> (on 64-bit) is a bug in the code, and stay below that.
>>> 
>>> For prototyping, you can just mark the broken functions individually
>>> by setting the warning limit for a specific function that is known to
>>> be misoptimized by the compiler (with a comment about which compiler
>>> and architectures are affected), but not override the limit for the
>>> entire file.
>> 
>> Thanks for the explanation. Fortunately in my case, the issues were
>> trivially fixable to get it under 1024/1280. (By the way, why does
>> 64-bit have a slightly larger stack frame? To account for 32 pointers
>> taking double the space or something?) That will be rectified in v6.
> 
> Hi Jason
> 
> Do you any stack usage information?
> 
> A VPN can be at the bottom of some deep stack calls. Swap on NFS over
> the VPN? If you have one frame of 1K, you might be O.K. But if you
> have a few of these, i can see there might be issues of overflowing
> the stack.
> 
>  

At the risk on suggesting something awful: on x86_64, since we turn preemption 
off for simd, it wouldn’t be *completely* insane to do the crypto on the irq 
stack. It would look like:

kernel_fpu_call(func, arg);

And this helper would disable preemption, enable FPU, switch to the irq stack, 
call func(arg), disable FPU, enable preemption, and return. And we can have 
large IRQ stacks.

I refuse to touch this with a ten-foot pole until the lazy FPU restore patches 
land.

All that being said, why are these frames so large?  It sounds like something 
may be spilling that ought not to.

Re: [PATCH net-next v5 02/20] zinc: introduce minimal cryptography library

2018-09-20 Thread Andy Lutomirski



> On Sep 20, 2018, at 8:41 AM, Ard Biesheuvel  wrote:
> 
> (+ Arnd, Eric)
> 
> On 18 September 2018 at 09:16, Jason A. Donenfeld  wrote:
> ...
> 
>> diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
>> new file mode 100644
>> index ..83dfd63988c0
>> --- /dev/null
>> +++ b/lib/zinc/Makefile
>> @@ -0,0 +1,4 @@
> 
> Apologies for not spotting these before:
> 
>> +ccflags-y := -O3
> 
> -O3 optimization has been problematic in the past, at least on x86 but
> I think on other architectures as well. Please stick with -O2.
> 
>> +ccflags-y += -Wframe-larger-than=$(if (CONFIG_KASAN),16384,8192)
> 
> There is no way we can support code in the kernel with that kind of
> stack space requirements. I will let Arnd comment on what we typically
> allow, since he deals with such issues on a regular basis.

To make matters worse, KASAN is incompatible with VMAP_STACK right now, and 
KASAN is not so good at detecting stack overflow.

> 
>> +ccflags-y += -D'pr_fmt(fmt)="zinc: " fmt'
>> +ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG
>> --
>> 2.19.0
>> 


Re: [PATCH net-next v5 07/20] zinc: Poly1305 generic C implementations and selftest

2018-09-18 Thread Andy Lutomirski


> On Sep 18, 2018, at 6:35 PM, Jason A. Donenfeld  wrote:
> 
>> On Wed, Sep 19, 2018 at 2:50 AM Eric Biggers  wrote:
>> Hardcoding the 'input' array to 600 bytes forces the full amount of space to 
>> be
>> reserved in the kernel image for every test vector.  Also, if anyone adds a
>> longer test vector they will need to remember to increase the value.
>> 
>> It should be a const pointer instead, like the test vectors in 
>> crypto/testmgr.h.
> 
> I know. The agony. This has been really annoying me. I originally did
> it the right way, but removed it last week, when I noticed that gcc
> failed to put it in the initconst section:
> 
> https://git.zx2c4.com/WireGuard/commit/?id=f4698d20f13946afc6ce99e98685ba3f9adc4474
> 
> Even changing the (u8[]){ ... } into a (const u8[]){ ... } or even
> into a const string literal does not do the trick. It makes it into
> the constant data section with const, but it does not make it into the
> initconst section. What a bummer.
> 
> I went asking about this on the gcc mailing list, to see if there was
> just some aspect of C that I had overlooked:
> https://gcc.gnu.org/ml/gcc/2018-09/msg00043.html

Can you not uglify the code a bit by using normal (non-compound) liberals as 
described in the response to that email?



Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-17 Thread Andy Lutomirski



> On Sep 17, 2018, at 9:16 AM, Jason A. Donenfeld  wrote:
> 
>> On Mon, Sep 17, 2018 at 6:14 PM Andy Lutomirski  wrote:
>> Indeed.  What I'm saying is that you shouldn't refactor it this way
>> because it will be slow.  I agree it would be conceptually nice to be
>> able to blacklist a chacha20_x86_64 module to disable the asm, but I
>> think it would be very hard to get good performance.
> 
> I hadn't understood your nosimd=1 command line suggestion the first
> time through, but now I see what you were after. This would be really
> easy to add. And I can do it for v5 if you want. But I'm kind of loath
> to add too much stuff to the initial patchset. Do you think this is an
> important feature to have for it? Or should I leave it for later?

I think it’s fine for later. It’s potentially useful for benchmarking and 
debugging.

Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-17 Thread Andy Lutomirski
On Mon, Sep 17, 2018 at 8:32 AM Jason A. Donenfeld  wrote:
>
> On Mon, Sep 17, 2018 at 4:52 PM Andy Lutomirski  wrote:
> > I think the module organization needs to change. It needs to be possible to 
> > have chacha20 built in but AES or whatever as a module.
>
> Okay, I'll do that for v5.
>
> > I might have agreed before Spectre :(. Unfortunately, unless we do some 
> > magic, I think the code would look something like:
> >
> > if (static_branch_likely(have_simd)) arch_chacha20();
> >
> > ...where arch_chacha20 is a *pointer*. And that will generate a retpoline 
> > and run very, very slowly.  (I just rewrote some of the x86 entry code to 
> > eliminate one retpoline. I got a 5% speedup on some tests according to the 
> > kbuild bot.)
>
> Actually, the way it works now benefits from the compilers inliner and
> the branch predictor. I benchmarked this without any retpoline
> slowdowns, and the branch predictor becomes correct pretty much all
> the time. We can tinker with this after the initial merge, if you
> really want, but avoiding function pointers and instead using ordinary
> branches really winds up being quite fast.

Indeed.  What I'm saying is that you shouldn't refactor it this way
because it will be slow.  I agree it would be conceptually nice to be
able to blacklist a chacha20_x86_64 module to disable the asm, but I
think it would be very hard to get good performance.

--Andy


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-17 Thread Andy Lutomirski
> On Sep 17, 2018, at 8:28 AM, Jason A. Donenfeld  wrote:
>
> On Mon, Sep 17, 2018 at 4:52 PM Andy Lutomirski  wrote:
>>> * (Nit) The GCC command line -include'd .h files contain variable and
>>> function definitions so they are actually .c files.
>>
>> Hmm. I would suggest just getting rid of the -include magic entirely.  The 
>> resulting ifdef will be more comprehensible.
>
> I really don't think so, actually. The way the -include stuff works
> now is that the glue code is inlined in the same place that the
> assembly object file is added to the build object list, so it gels
> together cleanly, as the thing is defined and set in one single place.
> I could go back to the ifdefs - and even make them as clean as
> possible - but I think that puts more things in more places and is
> therefore more confusing. The -include system now works extremely
> well.

Is it really better than:

#ifdef CONFIG_X86_64
#include "whatever"
#endif

It seems a more obfuscated than needed to put the equivalent of that
into the Makefile, and I don't think people really like searching
through the Makefile to figure out why the code does what it does.


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-17 Thread Andy Lutomirski


> On Sep 16, 2018, at 9:45 PM, David Miller  wrote:
> 
> From: Andy Lutomirski 
> Date: Sun, 16 Sep 2018 21:09:11 -0700
> 
>> CRYPTO API
>> M:  Herbert Xu 
>> M:  "David S. Miller" 
>> L:  linux-crypto@vger.kernel.org
>> 
>> Herbert hasn't replied to any of these submissions.  You're the other
>> maintainer :)
> 
> Herbert is the primary crypto maintainer, I haven't done a serious
> review of crypto code in ages.
> 
> So yes, Herbert review is what is important here.

Would you accept Ard’s (and/or Eric Biggers’, perhaps, as an alternative)?  Or, 
if you really want Herbert’s review, can you ask him to review it?  (Hi 
Herbert!)

Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-17 Thread Andy Lutomirski




> On Sep 16, 2018, at 10:07 PM, Jason A. Donenfeld  wrote:
> 
> Hey Andy,
> 
> Thanks a lot for your feedback.
> 
>> On Mon, Sep 17, 2018 at 6:09 AM Andy Lutomirski  wrote:
>> 1. Zinc conflates the addition of a new API with the replacement of
>> some algorithm implementations.  This is problematic.  Look at the
>> recent benchmarks of ipsec before and after this series.  Apparently
>> big packets get faster and small packets get slower.  It would be
>> really nice to bisect the series to narrow down *where* the regression
>> came from, but, as currently structured, you can't.
>> 
>> The right way to do this is to rearrange the series.  First, the new
>> Zinc APIs should be added, and they should be backed with the
>> *existing* crypto code.  (If the code needs to be moved or copied to a
>> new location, so be it.  The patch will be messy because somehow the
>> Zinc API is going to have to dispatch to the arch-specific code, and
>> the way that the crypto API handles it is not exactly friendly to this
>> type of use.  So be it.)  Then another patch should switch the crypto
>> API to use the Zinc interface.  That patch, *by itself*, can be
>> benchmarked.  If it causes a regression for small ipsec packets, then
>> it can be tracked down relatively easily.  Once this is all done, the
>> actual crypto implementation can be changed, and that changed can be
>> reviewed on its own merits.
> 
> That ipsec regression was less related to the implementation and more
> related to calling kernel_fpu_begin() unnecessarily, something I've
> now fixed. So I'm not sure that's such a good example. However, I can
> try to implement Zinc over the existing assembly (Martin's and Ard's),
> first, as you've described. This will be a pretty large amount of
> work, but if you think it's worth it for the commit history, then I'll
> do it.

Ard, what do you think?  I think it would
be nice, but if the authors of that assembly are convinced it should be 
replaced, then this step is optional IMO.

> 
>> 2. The new Zinc crypto implementations look like they're brand new.  I
>> realize that they have some history, some of them are derived from
>> OpenSSL, etc, but none of this is really apparent in the patches
>> themselves.
> 
> The whole point of going with these is that they _aren't_ brand new,
> yet they are very fast. Eyeballs and fuzzer hours are important, and
> AndyP's seems to get the most eyeballs and fuzzer hours, generally.
> 
>> it would be nice if
>> the patches made it more clear how the code differs from its origin.
>> At the very least, though, if the replacement of the crypto code were,
>> as above, a patch that just replaced the crypto code, it would be much
>> easier to review and benchmark intelligently.
> 
> You seem to have replied to the v3 thread, not the v4 thread. I've
> already started to include lots of detail about the origins of the
> code and [any] important differences in v4, and I'll continue to add
> more detail for v5.

This is indeed better.  Ard’s reply covers this better.


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-17 Thread Andy Lutomirski



> On Sep 16, 2018, at 10:26 PM, Ard Biesheuvel  
> wrote:

> 
> As far as I can tell (i.e., as a user not a network dev), WireGuard is
> an excellent piece of code, and I would like to see it merged. I also
> think there is little disagreement about the quality of the proposed
> algorithm implementations and the usefulness of having a set of easy
> to use solid crypto primitives in addition to or complementing the
> current crypto API.
> 
> I do have some concerns over how the code is organized though:
> 
> * simd_relax() is currently not called by the crypto routines
> themselves. This means that the worst case scheduling latency is
> unbounded, which is unacceptable for the -rt kernel. The worst case
> scheduling latency should never be proportional to the input size.
> (Apologies for not spotting that earlier)
> 
> * Using a cute name for the crypto library [that will end up being the
> preferred choice for casual use only] may confuse people, given that
> we have lots of code in crypto/ already. I'd much prefer using, e.g.,
> crypto/lib and crypto/api (regardless of where the arch specific
> pieces live)
> 
> * I'd prefer the arch specific pieces to live under arch, but I can
> live with keeping them in a single place, as long as the arch
> maintainers have some kind of jurisdiction over them. I also think
> there should be some overlap between the maintainership
> responsibilities of the two generic pieces (api and lib).
> 
> * (Nit) The GCC command line -include'd .h files contain variable and
> function definitions so they are actually .c files.

Hmm. I would suggest just getting rid of the -include magic entirely.  The 
resulting ifdef will be more comprehensible.


> * The current organization of the code puts all available (for the
> arch) versions of all routines into a single module, which can only be
> built in once we update random.c to use Zinc's chacha20 routines. This
> bloats the core kernel (which is a huge deal for embedded systems that
> have very strict boot requirements). It also makes it impossible to
> simply blacklist a module if you, for instance, prefer to use the
> [potentially more power efficient] scalar code over the SIMD code when
> using a distro kernel.

I think the module organization needs to change. It needs to be possible to 
have chacha20 built in but AES or whatever as a module.

> 
> [To summarize the 4 points above, I'd much rather see a more
> conventional organization where different parts are provided by
> different modules. I don't think the argument that inlining is needed
> for performance is actually valid, given that we have branch
> predictors and static keys, and the arch SIMD code is built as
> separate object files anyway]

I might have agreed before Spectre :(. Unfortunately, unless we do some magic, 
I think the code would look something like:

if (static_branch_likely(have_simd)) arch_chacha20();

...where arch_chacha20 is a *pointer*. And that will generate a retpoline and 
run very, very slowly.  (I just rewrote some of the x86 entry code to eliminate 
one retpoline. I got a 5% speedup on some tests according to the kbuild bot.)

So, if we really wanted modules, we’d need a new dynamic patching mechanism.

I would suggest instead adding two boot (or debugfs) options:

simd=off: disables simd_get using a static branch.

crypto_chacha20_nosimd: Does what it sounds like.


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-16 Thread Andy Lutomirski
On Tue, Sep 11, 2018 at 4:57 PM David Miller  wrote:
>
> From: Andrew Lunn 
> Date: Wed, 12 Sep 2018 01:30:15 +0200
>
> > Just as an FYI:
> >
> > 1) I don't think anybody in netdev has taken a serious look at the
> > network code yet. There is little point until the controversial part
> > of the code, Zinc, has been sorted out.
> >
> > 2) I personally would be surprised if DaveM took this code without
> > having an Acked-by from the crypto subsystem people. In the same way,
> > i doubt the crypto people would take an Ethernet driver without having
> > DaveM's Acked-by.
>
> Both of Andrew's statements are completely true.
>
> I'm not looking at any of the networking bits until the crypto stuff
> is fully sorted and fully supported and Ack'd by crypto folks.

So, as a process question, whom exactly are we waiting for:

CRYPTO API
M:  Herbert Xu 
M:  "David S. Miller" 
L:  linux-crypto@vger.kernel.org

Herbert hasn't replied to any of these submissions.  You're the other
maintainer :)

To the extent that you (DaveM) want my ack, here's what I think of the
series so far:

The new APIs to the crypto primitives are good.  For code that wants
to do a specific known crypto operation, they are much, much more
pleasant to use than the existing crypto API.  The code cleanups in
the big keys patch speak for themselves IMO.  I have no problem with
the addition of a brand-new API to the kernel, especially when it's a
nice one like Zinc's, even if that API starts out with only a couple
of users.

Zinc's arrangement of arch code is just fine.  Sure, there are
arguments that putting arch-specific code in arch/ is better, but this
is mostly bikeshedding IMO.

There has been some discussion of the exact best way to handle
simd_relax() and some other minor nitpicks of API details.  This kind
of stuff doesn't need to block the series -- it can always be reworked
down the road if needed.

There are two things I don't like right now, though:

1. Zinc conflates the addition of a new API with the replacement of
some algorithm implementations.  This is problematic.  Look at the
recent benchmarks of ipsec before and after this series.  Apparently
big packets get faster and small packets get slower.  It would be
really nice to bisect the series to narrow down *where* the regression
came from, but, as currently structured, you can't.

The right way to do this is to rearrange the series.  First, the new
Zinc APIs should be added, and they should be backed with the
*existing* crypto code.  (If the code needs to be moved or copied to a
new location, so be it.  The patch will be messy because somehow the
Zinc API is going to have to dispatch to the arch-specific code, and
the way that the crypto API handles it is not exactly friendly to this
type of use.  So be it.)  Then another patch should switch the crypto
API to use the Zinc interface.  That patch, *by itself*, can be
benchmarked.  If it causes a regression for small ipsec packets, then
it can be tracked down relatively easily.  Once this is all done, the
actual crypto implementation can be changed, and that changed can be
reviewed on its own merits.

As a simplistic example, I wrote this code a while back:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=e9e12f056f2abed50a30b762db9185799f5864e6

and its two parents.  This series added a more Zinc-like API to
SHA256.  And it did it without replacing the SHA256 implementation.
Doing this for Zinc would be a bit more complication, since the arch
code would need to be invoked too, but it should be doable.

FWIW, Wireguard should not actually depend on the replacement of the
crypto implementation.

2. The new Zinc crypto implementations look like they're brand new.  I
realize that they have some history, some of them are derived from
OpenSSL, etc, but none of this is really apparent in the patches
themselves.  It would be great if the kernel could literally share the
same code as something like OpenSSL, since OpenSSL gets much more
attention than the kernel's crypto.  Failing that, it would be nice if
the patches made it more clear how the code differs from its origin.
At the very least, though, if the replacement of the crypto code were,
as above, a patch that just replaced the crypto code, it would be much
easier to review and benchmark intelligently.

--Andy


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-13 Thread Andy Lutomirski


> On Sep 12, 2018, at 11:39 PM, Milan Broz  wrote:
> 
>> On 13/09/18 01:45, Andy Lutomirski wrote:
>> On Wed, Sep 12, 2018 at 3:56 PM, Ard Biesheuvel
> ... 
>> b) Crypto that is used dynamically.  This includes dm-crypt
>> (aes-xts-plain64, aes-cbc-essiv, etc), all the ALG_IF interfaces, a
>> lot of IPSEC stuff, possibly KCM, and probably many more.  These will
>> get comparatively little benefit from being converted to a zinc-like
>> interface.  For some of these cases, it wouldn't make any sense at all
>> to convert them.  Certainly the ones that do async hardware crypto
>> using DMA engines will never look at all like zinc, even under the
>> hood.
> 
> Please note, that dm-crypt now uses not only block ciphers and modes,
> but also authenticated encryption and hashes (for ESSIV and HMAC
> in authenticated composed modes) and RNG (for random IV).
> We use crypto API, including async variants (I hope correctly :)

Right. And all this is why I don’t think dm-crypt should use zinc, at least not 
any time soon.


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-12 Thread Andy Lutomirski
On Wed, Sep 12, 2018 at 3:56 PM, Ard Biesheuvel
 wrote:
> I realize you've put a lot of good and hard work into the existing
> I am also concerned about your claim that all software algorithms will
> be moved into this crypto library. You are not specific about whose
> responsibility it will be that this is going to happen in a timely
> fashion. But more importantly, it is not clear at all how you expect
> this to work for, e.g., h/w instruction based SHAxxx or AES in various
> chaining modes, which should be used only on cores that implement
> those instructions (note that on arm64, we have optional instructions
> for AES, PMULL, SHA1, SHA256, SHA512, SHA3, SM3 and SM4). Are all
> those implementations (only few of which will be used on a certain
> core) going to be part of the monolithic library? What are the APIs
> going to look like for block ciphers, taking chaining modes into
> account?

I'm not convinced that there's any real need for *all* crypto
algorithms to move into lib/zinc or to move at all.  As I see it,
there are two classes of crypto algorithms in the kernel:

a) Crypto that is used by code that chooses its algorithm statically
and wants synchronous operations.  These include everything in
drivers/char/random.c, but also a bunch of various networking things
that are hardcoded and basically everything that uses stack buffers.
(This means it includes all the code that I broke when I did
VMAP_STACK.  Sign.)

b) Crypto that is used dynamically.  This includes dm-crypt
(aes-xts-plain64, aes-cbc-essiv, etc), all the ALG_IF interfaces, a
lot of IPSEC stuff, possibly KCM, and probably many more.  These will
get comparatively little benefit from being converted to a zinc-like
interface.  For some of these cases, it wouldn't make any sense at all
to convert them.  Certainly the ones that do async hardware crypto
using DMA engines will never look at all like zinc, even under the
hood.

I think that, as a short-term goal, it makes a lot of sense to have
implementations of the crypto that *new* kernel code (like Wireguard)
wants to use in style (a) that live in /lib, and it obviously makes
sense to consolidate their implementations with the crypto/
implementations in a timely manner.  As a medium-term goal, adding
more algorithms as needed for things that could use the simpler APIs
(Bluetooth, perhaps) would make sense.

But I see no reason at all that /lib should ever contain a grab-bag of
crypto implementations just for the heck of it.  They should have real
in-kernel users IMO.  And this means that there will probably always
be some crypto implementations in crypto/ for things like aes-xts.

--Andy


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-11 Thread Andy Lutomirski



> On Sep 11, 2018, at 3:18 PM, Jason A. Donenfeld  wrote:
> 
>> On Tue, Sep 11, 2018 at 4:16 PM Andy Lutomirski  wrote:
>> Jason, can you do one of these conversions as an example?
> 
> My preference is really to leave that to a different patch series. But
> if you think I *must*, then I shall.
> 
> 

I think Ard’s point is valid: in the long run we don’t want two competing 
software implementations of each primitive. It clearly *should* be possible to 
make crypto API call into zinc for synchronous software operations, but a 
demonstration of how this actually works and that there isn’t some change to 
zinc to make it would well would be in order, I think.

IMO the right approach is do one conversion right away and save the rest for 
later.

Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-11 Thread Andy Lutomirski



> On Sep 11, 2018, at 2:47 PM, Eric Biggers  wrote:
> 
>> On Tue, Sep 11, 2018 at 04:56:24PM +0200, Greg Kroah-Hartman wrote:
>> On Tue, Sep 11, 2018 at 12:08:56PM +0200, Ard Biesheuvel wrote:
 As Zinc is simply library code, its config options are un-menued, with
 the exception of CONFIG_ZINC_DEBUG, which enables various selftests and
 BUG_ONs.
 
>>> 
>>> In spite of the wall of text, you fail to point out exactly why the
>>> existing AEAD API in unsuitable, and why fixing it is not an option.
>>> 
>>> As I pointed out in a previous version, I don't think we need a
>>> separate crypto API/library in the kernel, and I don't think you have
>>> convinced anyone else yet either.
>> 
>> Um, then why do people keep sprinkling new crypto/hash code all around
>> the kernel tree?  It's because what we have as a crypto api is complex
>> and is hard to use for many in-kernel users.
>> 
>> Something like this new interface (zinc) is a much "saner" api for
>> writing kernel code that has to interact with crypto/hash primitives.
>> 
>> I see no reason why the existing crypto code can be redone to use the
>> zinc crypto primitives over time, making there only be one main location
>> for the crypto algorithms.  But to do it the other way around is pretty
>> much impossible given the complexities in the existing api that has been
>> created over time.
>> 
>> Not to say that the existing api is not a viable one, but ugh, really?
>> You have to admit it is a pain to try to use in any "normal" type of
>> "here's a bytestream, go give me a hash" type of method, right?
>> 
>> Also there is the added benefit that the crypto primitives here have
>> been audited by a number of people (so Jason stated), and they are
>> written in a way that the crypto community can more easily interact and
>> contribute to.  Which is _way_ better than what we have today.
>> 
>> So this gets my "stamp of approval" for whatever it is worth :)
>> 
> 
> I think you mean you see no reason why it *cannot* be converted?  The
> conversions definitely *should* be done, just like how some of the existing
> crypto API algorithms like SHA-256 already wrap implementations in lib/.  In 
> my
> view, lib/zinc/ isn't fundamentally different from what we already have for 
> some
> algorithms.  So it's misguided to design/present it as some novel thing, which
> unfortunately this patchset still does to a large extent.  (The actual new 
> thing
> is how architecture-specific implementations are handled.)
> 
> Of course, the real problem is that even after multiple revisions of this
> patchset, there's still no actual conversions of the existing crypto API
> algorithms over to use the new implementations.  "Zinc" is still completely
> separate from the existing crypto API.
> 

Jason, can you do one of these conversions as an example?

> So, it's not yet clear that the conversions will actually work out without
> problems that would require changes in "Zinc".  I don't think it makes sense 
> to
> merge all this stuff without doing the conversions, or at the very least
> demonstrating how they will be done.
> 
> In particular, in its current form "Zinc" is useless for anyone that needs the
> existing crypto API.  For example, for HPolyC,
> (https://lkml.org/lkml/2018/8/6/857), I need to make improvements to ChaCha 
> and
> Poly1305 in the existing crypto API, e.g. to add support for XChaCha and
> NEON-accelerated Poly1305.  Having completely separate ChaCha and Poly1305
> implementations in Zinc doesn't help at all.  If anything, it makes things
> harder because people will have to review/maintain both sets of 
> implementations;
> and when trying to make the improvements I need, I'll find myself in the 
> middle
> of a holy war between two competing camps who each have their own opinion 
> about
> The Right Way To Do Crypto, and their own crypto implementations and APIs in 
> the
> kernel.
> 
> - Eric


Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library

2018-08-09 Thread Andy Lutomirski
On Tue, Aug 7, 2018 at 6:51 PM, Jason A. Donenfeld  wrote:
> On Tue, Aug 7, 2018 at 6:49 PM Andy Lutomirski  wrote:
>> I really wish we had a way to see that we use asm-generic’s copy of a header 
>> in all cases except where an arch opts out.
>
> It's really not that hard to do -- symlink asm-generic to a target
> called "asm" inside an otherwise empty directory, and add that
> otherwise empty directory to the -I paths just after arch/include.
> Since it's searched second, it's only used if the first fails. Maybe
> I'm missing something though, as this seems a bit too obvious. Perhaps
> a project for another day.

The problem here (I think) is that it's preferable for stray files in
the kernel tree to have no effect and, with this scheme, stray files
in arch/*/include/asm could affect the build.  So something explicit
has an advantage.  I just think there should be way to say "this
asm-generic file should affect all arches unless they generic-n or
override-generic-y it or whatever".


Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library

2018-08-07 Thread Andy Lutomirski



> On Aug 7, 2018, at 4:48 PM, Jason A. Donenfeld  wrote:
> 
> Hey Andy,
> 
>> On Tue, Aug 7, 2018 at 12:43 PM Andy Lutomirski  wrote:
>> For "zinc: add simd helper", I think it should be in include/linux,
>> and include/linux/simd.h should (immediately or maybe in the future)
>> include  to pick up arch-specific stuff.  And the patch
>> should get sent to linux-a...@vger.kernel.org.
> 
> I guess you saw my prompt about that in the previous commit message?
> Based on your encouragement, I implemented it:
> https://git.zx2c4.com/linux-dev/commit/?h=simd This is _far_ more
> invasive than I wanted to be, as I don't want this patch submission to
> grow unwieldy and never be merged, but I guess we can roll with this
> for now...
> 

I really wish we had a way to see that we use asm-generic’s copy of a header in 
all cases except where an arch opts out.

>> In your blake2s_arch() implementation, you're not passing in a
>> simd_context_t.  Is that still a work in progress?  I thought the plan
>> was to pass it in rather than doing the check in the _arch()
>> functions.
> 
> I'm inclined to do the explicit context passing only when a function
> is likely to be used in that kind of environment, and adjust as
> needed. Long term, anyway, that API will be removed once the x86 guys
> figure out lazy FPU restoration and the amortization doesn't add
> anything.

Fair enough.

> 
> Jason


Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library

2018-08-07 Thread Andy Lutomirski
On Tue, Aug 7, 2018 at 11:54 AM, Jason A. Donenfeld  wrote:
> Hey Andy, Eric, & all,
>
> I've started the work of separating this out into 16 individual
> commits, have addressed numerous other things brought up like the
> ifdef maze, and have begun rewriting (parts of) the original commit
> message. It's still a work in progress, and I still have some work to
> do, but if you want to follow along, things are happening here:
>
> https://git.zx2c4.com/linux-dev/log/?h=jd/wireguard
>
> I'll be rebasing and reworking this continuously, but that's how it's
> shaping up.
>
> As I'm still working on it, I won't be submitting v2 today, but if you
> have suggestions or concerns or want to poke me while I'm working on
> v2, don't hesitate to send me private email or ping me in IRC (I'm
> "zx2c4" there) to chat.
>
> Regards,
> Jason

For "zinc: add simd helper", I think it should be in include/linux,
and include/linux/simd.h should (immediately or maybe in the future)
include  to pick up arch-specific stuff.  And the patch
should get sent to linux-a...@vger.kernel.org.

In your blake2s_arch() implementation, you're not passing in a
simd_context_t.  Is that still a work in progress?  I thought the plan
was to pass it in rather than doing the check in the _arch()
functions.


Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library

2018-08-03 Thread Andy Lutomirski
On Thu, Aug 2, 2018 at 7:48 PM, Jason A. Donenfeld  wrote:
> Hey Andy,
>
> Thanks too for the feedback. Responses below:
>
> On Wed, Aug 1, 2018 at 7:09 PM Andy Lutomirski  wrote:
>> > I think the above changes would also naturally lead to a much saner
>> > patch series where each algorithm is added by its own patch, rather than
>> > one monster patch that adds many algorithms and 24000 lines of code.
>> >
>>
>> Yes, please.
>
> Ack, will be in v2.
>
>
>> I like this a *lot*.  (But why are you passing have_simd?  Shouldn't
>> that check live in chacha20_arch?  If there's some init code needed,
>> then chacha20_arch() should just return false before the init code
>> runs.  Once the arch does whatever feature detection it needs, it can
>> make chacha20_arch() start returning true.)
>
> The have_simd stuff is so that the FPU state can be amortized across
> several calls to the crypto functions. Here's a snippet from
> WireGuard's send.c:
>
> void packet_encrypt_worker(struct work_struct *work)
> {
> struct crypt_queue *queue = container_of(work, struct
> multicore_worker, work)->ptr;
> struct sk_buff *first, *skb, *next;
> bool have_simd = simd_get();

Gotcha.  That was very hidden in the 24k lines.  Please make this (and
any similar goodies) be their own patches.

Also, please consider spelling it differently:

simd_context_t simd_context = simd_get();

Because we'll feel very silly the first time some architecture has
more than one possible state.  (It wouldn't be entirely insane for x86
to distinguish between "no SIMD", "XMM only", and "go to town!", for
example.)


Re: [PATCH v1 2/3] zinc: Introduce minimal cryptography library

2018-08-01 Thread Andy Lutomirski
[I reordered the snippets below for readability.]

> On Aug 1, 2018, at 12:22 AM, Eric Biggers  wrote:
> In general this is great work, and I'm very excited for WireGuard to be
> upstreamed!  But for the new crypto code, I think a few things are on
> the wrong track, for example treating it is a special library.  Even the
> name is contradicting itself: Zinc is "not crypto/", yet as you stated
> it's intended that the "Zinc" algorithms will be exposed through the
> crypto API -- just like how most of the existing crypto code in lib/ is
> also exposed through the crypto API.  So, I think that what you're doing
> isn't actually *that* different from what already exists in some cases;
> and pretending that it is very different is just going to cause
> problems.  Rather, the actual truly new thing seems to be that the
> dispatch to architecture specific implementations is done at the lib/
> level instead of handled by the crypto API priority numbers.

...

>
>
> I think the above changes would also naturally lead to a much saner
> patch series where each algorithm is added by its own patch, rather than
> one monster patch that adds many algorithms and 24000 lines of code.
>

Yes, please.


>
> As for doing the architecture-specific dispatch in lib/ rather than
> through the crypto API, there definitely are some arguments in favor of
> it.  The main problem, though, is that your code is a mess due to all
> the #ifdefs, and it will only get worse as people add more
> architectures.  You may think you already added all the architectures
> that matter, but tomorrow people will come and want to add PowerPC,
> RISC-V, etc.  I really think you should consider splitting up
> implementations by architecture; this would *not*, however, preclude the
> implementations from still being accessed through a single top-level
> "glue" function.  For example chacha20() could look like:
>
> void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len,
>  bool have_simd)
> {
>if (chacha20_arch(dst, src, len, state->key, state->counter, have_simd))
>goto out;
>
>chacha20_generic(dst, src, len, state->key, state->counter);
>
> out:
>state->counter[0] += (len + 63) / 64;
> }

I like this a *lot*.  (But why are you passing have_simd?  Shouldn't
that check live in chacha20_arch?  If there's some init code needed,
then chacha20_arch() should just return false before the init code
runs.  Once the arch does whatever feature detection it needs, it can
make chacha20_arch() start returning true.)

As I see it, there there are two truly new thing in the zinc patchset:
the direct (in the direct call sense) arch dispatch, and the fact that
the functions can be called directly, without allocating contexts,
using function pointers, etc.

In fact, I had a previous patch set that added such an interface for SHA256.

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=8c59a4dd8b7ba4f2e5a6461132bbd16c83ff7c1f

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=crypto/sha256_bpf&id=7e5fbc02972b03727b71bc71f84175c36cbf01f5

>
> Your patch description is also missing any mention of crypto accelerator
> hardware.  Quite a bit of the complexity in the crypto API, such as
> scatterlist support and asynchronous execution, exists because it
> supports crypto accelerators.  AFAICS your new APIs cannot support
> crypto accelerators, as your APIs are synchronous and operate on virtual
> addresses.  I assume your justification is that "djb algorithms" like
> ChaCha and Poly1305 don't need crypto accelerators as they are fast in
> software.  But you never explicitly stated this and discussed the
> tradeoffs.  Since this is basically the foundation for the design you've
> chosen, it really needs to be addressed.

I see this as an advantage, not a disadvantage.  A very large majority
of in-kernel crypto users (by number of call sites under a *very*
brief survey, not by number of CPU cycles) just want to do some
synchronous crypto on a buffer that is addressed by a regular pointer.
Most of these users would be slowed down if they used any form of
async crypto, since the CPU can complete the whole operation faster
than it could plausibly initiate and complete anything asynchronous.
And, right now, they suffer the full overhead of allocating a context
(often with alloca!), looking up (or caching) some crypto API data
structures, dispatching the operation, and cleaning up.

So I think the right way to do it is to have directly callable
functions like zinc uses and to have the fancy crypto API layer on top
of them.  So if you actually want async accelerated crypto with
scatterlists or whatever, you can call into the fancy API, and the
fancy API can dispatch to hardware or it can dispatch to the normal
static API.

In fact, this is exactly what my patch above did.  But Jason's work is
more complete than mine, and mine wasn't really done because it had
some configury i

Re: [lkp-robot] [x86/kconfig] 81d3871900: BUG:unable_to_handle_kernel

2017-10-13 Thread Andy Lutomirski
On Fri, Oct 13, 2017 at 12:09 PM, Linus Torvalds
 wrote:
> On Fri, Oct 13, 2017 at 6:56 AM, Andrey Ryabinin
>  wrote:
>>
>> This could be fixed by s/vmovdqa/vmovdqu change like bellow, but maybe the 
>> right fix
>> would be to align the data properly?
>
> I suspect anything that has the SHA extensions should also do
> unaligned loads efficiently. The whole "aligned only" model is broken.
> It's just doing two loads from the state pointer, there's likely no
> point in trying to align it.
>
> So your patch looks fine, but maybe somebody could add the required
> alignment to the sha256 context allocation (which I don't know where
> it is).

IIRC if we try the latter, then we'll risk hitting the #*!&@% gcc bug
that mostly prevents 16-byte alignment from working on GCC before 4.8
or so.  That way lies debugging disasters.


Re: [PATCH V2] Fix a sleep-in-atomic bug in shash_setkey_unaligned

2017-10-03 Thread Andy Lutomirski
On Mon, Oct 2, 2017 at 10:26 PM, Herbert Xu  wrote:
> On Mon, Oct 02, 2017 at 09:18:24PM -0700, Andy Lutomirski wrote:
>> > On Oct 2, 2017, at 7:25 PM, Jia-Ju Bai  wrote:
>> >
>> > The SCTP program may sleep under a spinlock, and the function call path is:
>> > sctp_generate_t3_rtx_event (acquire the spinlock)
>> >  sctp_do_sm
>> >sctp_side_effects
>> >  sctp_cmd_interpreter
>> >sctp_make_init_ack
>> >  sctp_pack_cookie
>> >crypto_shash_setkey
>> >  shash_setkey_unaligned
>> >kmalloc(GFP_KERNEL)
>>
>> I'm going to go out on a limb here: why on Earth is out crypto API so
>> full of indirection that we allocate memory at all here?
>
> The crypto API operates on a one key per-tfm basis.  So normally
> tfm allocation and key setting is done once only and not done on
> the data path.
>
> I have looked at the SCTP code and it appears to fit this paradigm.
> That is, we should be able to allocate the tfm and set the key when
> the key is actually generated via get_random_bytes, rather than every
> time the key is used which is not only a waste but as you see runs
> into API issues.

It's a waste because it loses a pre-computation advantage.

The fact that it has memory allocation issues is crypto API's fault,
full stop.  There is no legit reason to need to allocate anything.


Re: [PATCH V2] Fix a sleep-in-atomic bug in shash_setkey_unaligned

2017-10-02 Thread Andy Lutomirski
> On Oct 2, 2017, at 7:25 PM, Jia-Ju Bai  wrote:
>
> The SCTP program may sleep under a spinlock, and the function call path is:
> sctp_generate_t3_rtx_event (acquire the spinlock)
>  sctp_do_sm
>sctp_side_effects
>  sctp_cmd_interpreter
>sctp_make_init_ack
>  sctp_pack_cookie
>crypto_shash_setkey
>  shash_setkey_unaligned
>kmalloc(GFP_KERNEL)
>

I'm going to go out on a limb here: why on Earth is out crypto API so
full of indirection that we allocate memory at all here?

We're synchronously computing a hash of a small amount of data using
either HMAC-SHA1 or HMAC-SHA256 (determined at runtime) if I read it
right.  There's a sane way to do this that doesn't need kmalloc,
alloca, or fancy indirection.  And then there's crypto_shash_xyz().

--Andy, who is sick of seeing stupid bugs caused by the fact that it's
basically impossible to use the crypto API in a sane way.

P.S. gnulib has:

int hmac_sha256 (const void *key, size_t keylen, const void *in,
size_t inlen, void *resbuf);

An init/update/final-style API would be nice, but something like this
would be a phenomenal improvement over what we have.


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Andy Lutomirski
On Thu, Jan 12, 2017 at 7:11 PM, Josh Poimboeuf  wrote:
> On Thu, Jan 12, 2017 at 05:46:55PM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 12, 2017 at 12:15 PM, Josh Poimboeuf  wrote:
>> > On Thu, Jan 12, 2017 at 12:08:07PM -0800, Andy Lutomirski wrote:
>> >> On Thu, Jan 12, 2017 at 11:51 AM, Linus Torvalds
>> >>  wrote:
>> >> > On Thu, Jan 12, 2017 at 6:02 AM, Josh Poimboeuf  
>> >> > wrote:
>> >> >>
>> >> >> Just to clarify, I think you're asking if, for versions of gcc which
>> >> >> don't support -mpreferred-stack-boundary=3, objtool can analyze all C
>> >> >> functions to ensure their stacks are 16-byte aligned.
>> >> >>
>> >> >> It's certainly possible, but I don't see how that solves the problem.
>> >> >> The stack will still be misaligned by entry code.  Or am I missing
>> >> >> something?
>> >> >
>> >> > I think the argument is that we *could* try to align things, if we
>> >> > just had some tool that actually then verified that we aren't missing
>> >> > anything.
>> >> >
>> >> > I'm not entirely happy with checking the generated code, though,
>> >> > because as Ingo says, you have a 50:50 chance of just getting it right
>> >> > by mistake. So I'd much rather have some static tool that checks
>> >> > things at a code level (ie coccinelle or sparse).
>> >>
>> >> What I meant was checking the entry code to see if it aligns stack
>> >> frames, and good luck getting sparse to do that.  Hmm, getting 16-byte
>> >> alignment for real may actually be entirely a lost cause.  After all,
>> >> I think we have some inline functions that do asm volatile ("call
>> >> ..."), and I don't see any credible way of forcing alignment short of
>> >> generating an entirely new stack frame and aligning that.
>> >
>> > Actually we already found all such cases and fixed them by forcing a new
>> > stack frame, thanks to objtool.  For example, see 55a76b59b5fe.
>>
>> What I mean is: what guarantees that the stack is properly aligned for
>> the subroutine call?  gcc promises to set up a stack frame, but does
>> it promise that rsp will be properly aligned to call a C function?
>
> Yes, I did an experiment and you're right.  I had naively assumed that
> all stack frames would be aligned.

Just to check: did you do your experiment with -mpreferred-stack-boundary=4?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Andy Lutomirski
On Thu, Jan 12, 2017 at 12:15 PM, Josh Poimboeuf  wrote:
> On Thu, Jan 12, 2017 at 12:08:07PM -0800, Andy Lutomirski wrote:
>> On Thu, Jan 12, 2017 at 11:51 AM, Linus Torvalds
>>  wrote:
>> > On Thu, Jan 12, 2017 at 6:02 AM, Josh Poimboeuf  
>> > wrote:
>> >>
>> >> Just to clarify, I think you're asking if, for versions of gcc which
>> >> don't support -mpreferred-stack-boundary=3, objtool can analyze all C
>> >> functions to ensure their stacks are 16-byte aligned.
>> >>
>> >> It's certainly possible, but I don't see how that solves the problem.
>> >> The stack will still be misaligned by entry code.  Or am I missing
>> >> something?
>> >
>> > I think the argument is that we *could* try to align things, if we
>> > just had some tool that actually then verified that we aren't missing
>> > anything.
>> >
>> > I'm not entirely happy with checking the generated code, though,
>> > because as Ingo says, you have a 50:50 chance of just getting it right
>> > by mistake. So I'd much rather have some static tool that checks
>> > things at a code level (ie coccinelle or sparse).
>>
>> What I meant was checking the entry code to see if it aligns stack
>> frames, and good luck getting sparse to do that.  Hmm, getting 16-byte
>> alignment for real may actually be entirely a lost cause.  After all,
>> I think we have some inline functions that do asm volatile ("call
>> ..."), and I don't see any credible way of forcing alignment short of
>> generating an entirely new stack frame and aligning that.
>
> Actually we already found all such cases and fixed them by forcing a new
> stack frame, thanks to objtool.  For example, see 55a76b59b5fe.

What I mean is: what guarantees that the stack is properly aligned for
the subroutine call?  gcc promises to set up a stack frame, but does
it promise that rsp will be properly aligned to call a C function?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Andy Lutomirski
On Thu, Jan 12, 2017 at 11:51 AM, Linus Torvalds
 wrote:
> On Thu, Jan 12, 2017 at 6:02 AM, Josh Poimboeuf  wrote:
>>
>> Just to clarify, I think you're asking if, for versions of gcc which
>> don't support -mpreferred-stack-boundary=3, objtool can analyze all C
>> functions to ensure their stacks are 16-byte aligned.
>>
>> It's certainly possible, but I don't see how that solves the problem.
>> The stack will still be misaligned by entry code.  Or am I missing
>> something?
>
> I think the argument is that we *could* try to align things, if we
> just had some tool that actually then verified that we aren't missing
> anything.
>
> I'm not entirely happy with checking the generated code, though,
> because as Ingo says, you have a 50:50 chance of just getting it right
> by mistake. So I'd much rather have some static tool that checks
> things at a code level (ie coccinelle or sparse).

What I meant was checking the entry code to see if it aligns stack
frames, and good luck getting sparse to do that.  Hmm, getting 16-byte
alignment for real may actually be entirely a lost cause.  After all,
I think we have some inline functions that do asm volatile ("call
..."), and I don't see any credible way of forcing alignment short of
generating an entirely new stack frame and aligning that.  Ick.  This
whole situation stinks, and I wish that the gcc developers had been
less daft here in the first place or that we'd noticed and gotten it
fixed much longer ago.

Can we come up with a macro like STACK_ALIGN_16 that turns into
__aligned__(32) on bad gcc versions and combine that with your sparse
patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 11:05 PM, Herbert Xu
 wrote:
> On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
>>
>> I'm pretty sure we have random asm code that may not maintain a
>> 16-byte stack alignment when it calls other code (including, in some
>> cases, calling C code).
>>
>> So I'm not at all convinced that this is a good idea. We shouldn't
>> expect 16-byte alignment to be something trustworthy.
>
> So what if we audited all the x86 assembly code to fix this? Would
> it then be acceptable to do a 16-byte aligned stack?
>
> On the face of it it doesn't seem to be a huge amount of code
> assuming they mostly live under arch/x86.

The problem is that we have nasties like TRACE_IRQS_OFF.  Performance
doesn't really matter for these macros, so we could probably rig up a
helper for forcibly align the stack there.  Maybe
FRAME_BEGIN_FORCE_ALIGN?  I also think I'd rather not to modify
pt_regs.  We should just fix the small number of code paths that
create a pt_regs and then call into C code to align the stack.

But if we can't do this with automatic verification, then I'm not sure
I want to do it at all.  The asm is already more precarious than I'd
like, and having a code path that is misaligned is asking for obscure
bugs down the road.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/8] crypto/testmgr: Allocate only the required output size for hash tests

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 11:47 PM, Herbert Xu
 wrote:
> Andy Lutomirski  wrote:
>> There are some hashes (e.g. sha224) that have some internal trickery
>> to make sure that only the correct number of output bytes are
>> generated.  If something goes wrong, they could potentially overrun
>> the output buffer.
>>
>> Make the test more robust by allocating only enough space for the
>> correct output size so that memory debugging will catch the error if
>> the output is overrun.
>>
>> Tested by intentionally breaking sha224 to output all 256
>> internally-generated bits while running on KASAN.
>>
>> Cc: Ard Biesheuvel 
>> Cc: Herbert Xu 
>> Signed-off-by: Andy Lutomirski 
>
> This patch doesn't seem to depend on anything else in the series.
> Do you want me to take it separately?

Yes, please.  Its only relation to the rest of the series is that I
wanted to make sure that I didn't mess up sha224's finalization code.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-11 Thread Andy Lutomirski
On Tue, Jan 10, 2017 at 10:01 PM, Andy Lutomirski  wrote:
> On Tue, Jan 10, 2017 at 8:35 PM, Herbert Xu  
> wrote:
>> On Tue, Jan 10, 2017 at 08:17:17PM -0800, Linus Torvalds wrote:
>>>
>>> That said, I do think that the "don't assume stack alignment, do it by
>>> hand" may be the safer thing. Because who knows what the random rules
>>> will be on other architectures.
>>
>> Sure we can ban the use of attribute aligned on stacks.  But
>> what about indirect uses through structures?  For example, if
>> someone does
>>
>> struct foo {
>> } __attribute__ ((__aligned__(16)));
>>
>> int bar(...)
>> {
>> struct foo f;
>>
>> return baz(&f);
>> }
>>
>> then baz will end up with an unaligned argument.  The worst part
>> is that it is not at all obvious to the person writing the function
>> bar.
>
> Linus, I'm starting to lean toward agreeing with Herbert here, except
> that we should consider making it conditional on having a silly GCC
> version.  After all, the silly GCC versions are wasting space and time
> with alignment instructions no matter what we do, so this would just
> mean tweaking the asm and adding some kind of check_stack_alignment()
> helper to throw out a WARN_ONCE() if we miss one.  The problem with
> making it conditional is that making pt_regs effectively live at a
> variable offset from %rsp is just nasty.

So actually doing this is gross because we have calls from asm to C
all over the place.  But... maybe we can automate all the testing.
Josh, how hard would it be to teach objtool to (if requested by an
option) check that stack frames with statically known size preserve
16-byte stack alignment?

I find it rather annoying that gcc before 4.8 malfunctions when it
sees __aligned__(16) on x86_64 kernels.  Sigh.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 12:09 AM, Herbert Xu
 wrote:
> On Wed, Jan 11, 2017 at 08:06:54AM +, Ard Biesheuvel wrote:
>>
>> Couldn't we update the __aligned(x) macro to emit 32 if arch == x86
>> and x == 16? All other cases should work just fine afaict
>
> Not everyone uses that macro.  You'd also need to add some checks
> to stop people from using the gcc __attribute__ directly.
>

You'd also have to audit things to make sure that __aligned__(16)
isn't being used for non-stack purposes.  After all, __aligned__(16)
in static data is fine, and it's also fine as a promise to GCC that
some object is 16-byte aligned.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 7/8] net: Rename TCA*BPF_DIGEST to ..._SHA256

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 1:09 AM, Daniel Borkmann  wrote:
> Hi Andy,
>
> On 01/11/2017 04:11 AM, Andy Lutomirski wrote:
>>
>> On Tue, Jan 10, 2017 at 4:50 PM, Daniel Borkmann 
>> wrote:
>>>
>>> On 01/11/2017 12:24 AM, Andy Lutomirski wrote:
>>>>
>>>>
>>>> This makes it easier to add another digest algorithm down the road if
>>>> needed.  It also serves to force any programs that might have been
>>>> written against a kernel that had the old field name to notice the
>>>> change and make any necessary changes.
>>>>
>>>> This shouldn't violate any stable API policies, as no released kernel
>>>> has ever had TCA*BPF_DIGEST.
>>>
>>>
>>> Imho, this and patch 6/8 is not really needed. Should there ever
>>> another digest alg be used (doubt it), then you'd need a new nl
>>> attribute and fdinfo line anyway to keep existing stuff intact.
>>> Nobody made the claim that you can just change this underneath
>>> and not respecting abi for existing applications when I read from
>>> above that such apps now will get "forced" to notice a change.
>>
>>
>> Fair enough.  I was more concerned about prerelease iproute2 versions,
>> but maybe that's a nonissue.  I'll drop these two patches.
>
>
> Ok. Sleeping over this a bit, how about a general rename into
> "prog_tag" for fdinfo and TCA_BPF_TAG resp. TCA_ACT_BPF_TAG for
> the netlink attributes, fwiw, it might reduce any assumptions on
> this being made? If this would be preferable, I could cook that
> patch against -net for renaming it?

That would be fine with me.

I think there are two reasonable approaches to computing the actual tag.

1. Use a standard, modern cryptographic hash.  SHA-256, SHA-512,
Blake2b, whatever.  SHA-1 is a bad choice in part because it's partly
broken and in part because the implementation in lib/ is a real mess
to use (as you noticed while writing the code).

2. Use whatever algorithm you like but make the tag so short that it's
obviously not collision-free.  48 or 64 bits is probably reasonable.

The intermediate versions are just asking for trouble.  Alexei wants
to make the tag shorter, but I admit I still don't understand why he
prefers that over using a better crypto hash and letting user code
truncate the tag if it wants.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 8/8] crypto/testmgr: Allocate only the required output size for hash tests

2017-01-11 Thread Andy Lutomirski
On Wed, Jan 11, 2017 at 7:13 AM, David Laight  wrote:
> From: Andy Lutomirski
>> Sent: 10 January 2017 23:25
>> There are some hashes (e.g. sha224) that have some internal trickery
>> to make sure that only the correct number of output bytes are
>> generated.  If something goes wrong, they could potentially overrun
>> the output buffer.
>>
>> Make the test more robust by allocating only enough space for the
>> correct output size so that memory debugging will catch the error if
>> the output is overrun.
>
> Might be better to test this by allocating an overlong buffer
> and then explicitly checking that the output hasn't overrun
> the allowed space.
>
> If nothing else the error message will be clearer.

I thought about that, but then I'd have to figure out what poison
value to use.  Both KASAN and the usual slab debugging are quite good
at this kind of checking, and KASAN will even point you to the
problematic code directly.

--Andy


-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-10 Thread Andy Lutomirski
On Tue, Jan 10, 2017 at 8:35 PM, Herbert Xu  wrote:
> On Tue, Jan 10, 2017 at 08:17:17PM -0800, Linus Torvalds wrote:
>>
>> That said, I do think that the "don't assume stack alignment, do it by
>> hand" may be the safer thing. Because who knows what the random rules
>> will be on other architectures.
>
> Sure we can ban the use of attribute aligned on stacks.  But
> what about indirect uses through structures?  For example, if
> someone does
>
> struct foo {
> } __attribute__ ((__aligned__(16)));
>
> int bar(...)
> {
> struct foo f;
>
> return baz(&f);
> }
>
> then baz will end up with an unaligned argument.  The worst part
> is that it is not at all obvious to the person writing the function
> bar.

Linus, I'm starting to lean toward agreeing with Herbert here, except
that we should consider making it conditional on having a silly GCC
version.  After all, the silly GCC versions are wasting space and time
with alignment instructions no matter what we do, so this would just
mean tweaking the asm and adding some kind of check_stack_alignment()
helper to throw out a WARN_ONCE() if we miss one.  The problem with
making it conditional is that making pt_regs effectively live at a
variable offset from %rsp is just nasty.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 7/8] net: Rename TCA*BPF_DIGEST to ..._SHA256

2017-01-10 Thread Andy Lutomirski
On Tue, Jan 10, 2017 at 4:50 PM, Daniel Borkmann  wrote:
> On 01/11/2017 12:24 AM, Andy Lutomirski wrote:
>>
>> This makes it easier to add another digest algorithm down the road if
>> needed.  It also serves to force any programs that might have been
>> written against a kernel that had the old field name to notice the
>> change and make any necessary changes.
>>
>> This shouldn't violate any stable API policies, as no released kernel
>> has ever had TCA*BPF_DIGEST.
>
>
> Imho, this and patch 6/8 is not really needed. Should there ever
> another digest alg be used (doubt it), then you'd need a new nl
> attribute and fdinfo line anyway to keep existing stuff intact.
> Nobody made the claim that you can just change this underneath
> and not respecting abi for existing applications when I read from
> above that such apps now will get "forced" to notice a change.

Fair enough.  I was more concerned about prerelease iproute2 versions,
but maybe that's a nonissue.  I'll drop these two patches.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 7/8] net: Rename TCA*BPF_DIGEST to ..._SHA256

2017-01-10 Thread Andy Lutomirski
This makes it easier to add another digest algorithm down the road if
needed.  It also serves to force any programs that might have been
written against a kernel that had the old field name to notice the
change and make any necessary changes.

This shouldn't violate any stable API policies, as no released kernel
has ever had TCA*BPF_DIGEST.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Signed-off-by: Andy Lutomirski 
---
 include/uapi/linux/pkt_cls.h   | 2 +-
 include/uapi/linux/tc_act/tc_bpf.h | 2 +-
 net/sched/act_bpf.c| 2 +-
 net/sched/cls_bpf.c| 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index cb4bcdc58543..ac6b300c1550 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -397,7 +397,7 @@ enum {
TCA_BPF_NAME,
TCA_BPF_FLAGS,
TCA_BPF_FLAGS_GEN,
-   TCA_BPF_DIGEST,
+   TCA_BPF_SHA256,
__TCA_BPF_MAX,
 };
 
diff --git a/include/uapi/linux/tc_act/tc_bpf.h 
b/include/uapi/linux/tc_act/tc_bpf.h
index a6b88a6f7f71..eae18a7430eb 100644
--- a/include/uapi/linux/tc_act/tc_bpf.h
+++ b/include/uapi/linux/tc_act/tc_bpf.h
@@ -27,7 +27,7 @@ enum {
TCA_ACT_BPF_FD,
TCA_ACT_BPF_NAME,
TCA_ACT_BPF_PAD,
-   TCA_ACT_BPF_DIGEST,
+   TCA_ACT_BPF_SHA256,
__TCA_ACT_BPF_MAX,
 };
 #define TCA_ACT_BPF_MAX (__TCA_ACT_BPF_MAX - 1)
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 1c60317f0121..3868a66d0b24 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -123,7 +123,7 @@ static int tcf_bpf_dump_ebpf_info(const struct tcf_bpf 
*prog,
nla_put_string(skb, TCA_ACT_BPF_NAME, prog->bpf_name))
return -EMSGSIZE;
 
-   nla = nla_reserve(skb, TCA_ACT_BPF_DIGEST,
+   nla = nla_reserve(skb, TCA_ACT_BPF_SHA256,
  sizeof(prog->filter->digest));
if (nla == NULL)
return -EMSGSIZE;
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index adc776048d1a..6fc110321621 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -555,7 +555,7 @@ static int cls_bpf_dump_ebpf_info(const struct cls_bpf_prog 
*prog,
nla_put_string(skb, TCA_BPF_NAME, prog->bpf_name))
return -EMSGSIZE;
 
-   nla = nla_reserve(skb, TCA_BPF_DIGEST, sizeof(prog->filter->digest));
+   nla = nla_reserve(skb, TCA_BPF_SHA256, sizeof(prog->filter->digest));
if (nla == NULL)
return -EMSGSIZE;
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/8] bpf: Use SHA256 instead of SHA1 for bpf digests

2017-01-10 Thread Andy Lutomirski
SHA1 is considered obsolete.  It is no longer considered to be
collision resistant and, in general, it should not be used for new
applications.  Change the new-in-4.10 BPF digest to SHA-256.  This
means that potential future applications of the digest that need
collision resistance will be able to use the BPF digest.  Applications
that just want a short identifier for a BPF program are welcome to
truncate the digest.

This is also a cleanup IMO -- the new sha256_*_direct() API is much
nicer than the old SHA1 library helpers.  It will also enable
incremental hashing so the BPF code can avoid calling vmalloc().

I moved the digest field to keep all of the bpf program metadata in
the same cache line.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Signed-off-by: Andy Lutomirski 
---
 include/linux/filter.h | 11 +++
 init/Kconfig   |  1 +
 kernel/bpf/core.c  | 42 --
 3 files changed, 12 insertions(+), 42 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 702314253797..23df2574e30c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -14,7 +14,8 @@
 #include 
 #include 
 #include 
-#include 
+
+#include 
 
 #include 
 
@@ -408,11 +409,11 @@ struct bpf_prog {
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */
-   u32 digest[SHA_DIGEST_WORDS]; /* Program digest */
struct bpf_prog_aux *aux;   /* Auxiliary fields */
struct sock_fprog_kern  *orig_prog; /* Original BPF program */
unsigned int(*bpf_func)(const void *ctx,
const struct bpf_insn *insn);
+   u8  digest[SHA256_DIGEST_SIZE]; /* Program digest */
/* Instructions for interpreter */
union {
struct sock_filter  insns[0];
@@ -519,12 +520,6 @@ static inline u32 bpf_prog_insn_size(const struct bpf_prog 
*prog)
return prog->len * sizeof(struct bpf_insn);
 }
 
-static inline u32 bpf_prog_digest_scratch_size(const struct bpf_prog *prog)
-{
-   return round_up(bpf_prog_insn_size(prog) +
-   sizeof(__be64) + 1, SHA_MESSAGE_BYTES);
-}
-
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
return max(sizeof(struct bpf_prog),
diff --git a/init/Kconfig b/init/Kconfig
index 223b734abccd..f1ea6d023f8c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1389,6 +1389,7 @@ config HAVE_PCSPKR_PLATFORM
 # interpreter that classic socket filters depend on
 config BPF
bool
+   select CRYPTO_SHA256_DIRECT
 
 menuconfig EXPERT
bool "Configure standard kernel features (expert users)"
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1eb4f1303756..668b92f6ab58 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -148,22 +148,18 @@ void __bpf_prog_free(struct bpf_prog *fp)
 
 int bpf_prog_calc_digest(struct bpf_prog *fp)
 {
-   const u32 bits_offset = SHA_MESSAGE_BYTES - sizeof(__be64);
-   u32 raw_size = bpf_prog_digest_scratch_size(fp);
-   u32 ws[SHA_WORKSPACE_WORDS];
-   u32 i, bsize, psize, blocks;
+   struct sha256_state sha;
+   u32 i, psize;
struct bpf_insn *dst;
bool was_ld_map;
-   u8 *raw, *todo;
-   __be32 *result;
-   __be64 *bits;
+   u8 *raw;
 
-   raw = vmalloc(raw_size);
+   psize = bpf_prog_insn_size(fp);
+   raw = vmalloc(psize);
if (!raw)
return -ENOMEM;
 
-   sha_init(fp->digest);
-   memset(ws, 0, sizeof(ws));
+   sha256_init_direct(&sha);
 
/* We need to take out the map fd for the digest calculation
 * since they are unstable from user space side.
@@ -188,30 +184,8 @@ int bpf_prog_calc_digest(struct bpf_prog *fp)
}
}
 
-   psize = bpf_prog_insn_size(fp);
-   memset(&raw[psize], 0, raw_size - psize);
-   raw[psize++] = 0x80;
-
-   bsize  = round_up(psize, SHA_MESSAGE_BYTES);
-   blocks = bsize / SHA_MESSAGE_BYTES;
-   todo   = raw;
-   if (bsize - psize >= sizeof(__be64)) {
-   bits = (__be64 *)(todo + bsize - sizeof(__be64));
-   } else {
-   bits = (__be64 *)(todo + bsize + bits_offset);
-   blocks++;
-   }
-   *bits = cpu_to_be64((psize - 1) << 3);
-
-   while (blocks--) {
-   sha_transform(fp->digest, todo, ws);
-   todo += SHA_MESSAGE_BYTES;
-   }
-
-   result = (__force __be32 *)fp->digest;
-   for (i = 0; i < SHA_DIGEST_WORDS; i++)
-   result[i] = cpu_to_be32(fp->digest[i]);
-
+   sha256_update_direct(&sha, raw, psize);
+   sha256_final_direct(&sha, fp->digest);
vfree(raw);
return 0;
 }
-- 
2.9.3

--
To unsubscribe

[PATCH v2 6/8] bpf: Rename fdinfo's prog_digest to prog_sha256

2017-01-10 Thread Andy Lutomirski
This makes it easier to add another digest algorithm down the road
if needed.  It also serves to force any programs that might have
been written against a kernel that had 'prog_digest' to be updated.

This shouldn't violate any stable API policies, as no released
kernel has ever had 'prog_digest'.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Signed-off-by: Andy Lutomirski 
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e89acea22ecf..956370b80296 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -694,7 +694,7 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct 
file *filp)
seq_printf(m,
   "prog_type:\t%u\n"
   "prog_jited:\t%u\n"
-  "prog_digest:\t%s\n"
+  "prog_sha256:\t%s\n"
   "memlock:\t%llu\n",
   prog->type,
   prog->jited,
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-10 Thread Andy Lutomirski
On Tue, Jan 10, 2017 at 12:00 PM, Ard Biesheuvel
 wrote:
> On 10 January 2017 at 19:22, Andy Lutomirski  wrote:
>> On Tue, Jan 10, 2017 at 11:16 AM, Ard Biesheuvel
>>  wrote:
>>> On 10 January 2017 at 19:00, Andy Lutomirski  wrote:
>>>> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
>>>>  wrote:
>>>>> On 10 January 2017 at 14:33, Herbert Xu  
>>>>> wrote:
>>>>>> I recently applied the patch
>>>>>>
>>>>>> https://patchwork.kernel.org/patch/9468391/
>>>>>>
>>>>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>>>>> code.  It turned out that the patch changed a manually aligned
>>>>>> stack buffer to one that is aligned by gcc.  What was happening was
>>>>>> that gcc can stack align to any value on x86-64 except 16.  The
>>>>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>>>>> which is not actually the case in the kernel.
>>>>>>
>>>>>
>>>>> Apologies for introducing this breakage. It seemed like an obvious and
>>>>> simple cleanup, so I didn't even bother to mention it in the commit
>>>>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>>>>> we should revert to the old method. If SSE instructions are the only
>>>>> ones that require this alignment, then I suppose not having a ABI
>>>>> conforming stack pointer should not be an issue in general.
>>>>
>>>> Here's what I think is really going on.  This is partially from
>>>> memory, so I could be off base.  The kernel is up against
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
>>>> on some GCC versions (like the bad one and maybe even current ones),
>>>> things compiled without -mno-sse can't have the stack alignment set
>>>> properly.  IMO we should fix this in the affected code, not the entry
>>>> code.  In fact, I think that fixing it in the entry code won't even
>>>> fully fix it because modern GCC will compile the rest of the kernel
>>>> with 8-byte alignment and the stack will get randomly unaligned (GCC
>>>> 4.8 and newer).
>>>>
>>>> Can we just add __attribute__((force_align_arg_pointer)) to the
>>>> affected functions?  Maybe have:
>>>>
>>>> #define __USES_SSE __attribute__((force_align_arg_pointer))
>>>>
>>>> on affected gcc versions?
>>>>
>>>> ***HOWEVER***
>>>>
>>>> I think this is missing the tree for the supposed forest.  The actual
>>>> affected code appears to be:
>>>>
>>>> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist 
>>>> *dst,
>>>>  struct scatterlist *src, unsigned int nbytes)
>>>> {
>>>> u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 
>>>> 1];
>>>>
>>>> ...
>>>>
>>>> state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>>>>
>>>> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
>>>> and optimizes out the roundup.  How about just declaring an actual
>>>> __aligned(16) buffer, marking the function
>>>> __attribute__((force_align_arg_pointer)), and being done with it?
>>>> After all, we need that forcible alignment on *all* gcc versions.
>>>>
>>>
>>> Actually, the breakage is introduced by the patch Herbert refers to
>>>
>>> https://patchwork.kernel.org/patch/9468391/
>>>
>>> where the state is replaced by a simple
>>>
>>> u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
>>>
>>> which seemed harmless enough to me. So the code above works fine.
>>
>> So how about just the one-line patch of adding the
>> force_align_arg_pointer?  Would that solve the problem?
>
> If it does what it says on the tin, it should fix the issue, but after
> adding the attribute, I get the exact same object output, so there's
> something dodgy going on here.

Ugh, that's annoying.  Maybe it needs noinline too?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/8] bpf: Avoid copying the entire BPF program when hashing it

2017-01-10 Thread Andy Lutomirski
The sha256 helpers can consume a message incrementally, so there's no need
to allocate a buffer to store the whole blob to be hashed.

This may be a slight slowdown for very long messages because gcc can't
inline the sha256_update() calls.  For reasonably-sized programs,
however, this should be a considerable speedup as vmalloc() is quite
slow.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Signed-off-by: Andy Lutomirski 
---
 kernel/bpf/core.c | 33 +
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 668b92f6ab58..106162a1bc54 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -149,44 +149,37 @@ void __bpf_prog_free(struct bpf_prog *fp)
 int bpf_prog_calc_digest(struct bpf_prog *fp)
 {
struct sha256_state sha;
-   u32 i, psize;
-   struct bpf_insn *dst;
+   u32 i;
bool was_ld_map;
-   u8 *raw;
-
-   psize = bpf_prog_insn_size(fp);
-   raw = vmalloc(psize);
-   if (!raw)
-   return -ENOMEM;
 
sha256_init_direct(&sha);
 
/* We need to take out the map fd for the digest calculation
 * since they are unstable from user space side.
 */
-   dst = (void *)raw;
for (i = 0, was_ld_map = false; i < fp->len; i++) {
-   dst[i] = fp->insnsi[i];
+   struct bpf_insn insn = fp->insnsi[i];
+
if (!was_ld_map &&
-   dst[i].code == (BPF_LD | BPF_IMM | BPF_DW) &&
-   dst[i].src_reg == BPF_PSEUDO_MAP_FD) {
+   insn.code == (BPF_LD | BPF_IMM | BPF_DW) &&
+   insn.src_reg == BPF_PSEUDO_MAP_FD) {
was_ld_map = true;
-   dst[i].imm = 0;
+   insn.imm = 0;
} else if (was_ld_map &&
-  dst[i].code == 0 &&
-  dst[i].dst_reg == 0 &&
-  dst[i].src_reg == 0 &&
-  dst[i].off == 0) {
+  insn.code == 0 &&
+  insn.dst_reg == 0 &&
+  insn.src_reg == 0 &&
+  insn.off == 0) {
was_ld_map = false;
-   dst[i].imm = 0;
+   insn.imm = 0;
} else {
was_ld_map = false;
}
+
+   sha256_update_direct(&sha, (const u8 *)&insn, sizeof(insn));
}
 
-   sha256_update_direct(&sha, raw, psize);
sha256_final_direct(&sha, fp->digest);
-   vfree(raw);
return 0;
 }
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/8] crypto/sha256: Build the SHA256 core separately from the crypto module

2017-01-10 Thread Andy Lutomirski
This just moves code around -- no code changes in this patch.  This
wil let BPF-based tracing link against the SHA256 core code without
depending on the crypto core.

Cc: Ard Biesheuvel 
Cc: Herbert Xu 
Signed-off-by: Andy Lutomirski 
---
 crypto/Kconfig  |   8 ++
 crypto/Makefile |   1 +
 crypto/sha256_direct.c  | 238 
 crypto/sha256_generic.c | 214 ---
 4 files changed, 247 insertions(+), 214 deletions(-)
 create mode 100644 crypto/sha256_direct.c

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 160f08e721cc..b83ae6789e78 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -10,6 +10,13 @@ config XOR_BLOCKS
 source "crypto/async_tx/Kconfig"
 
 #
+# Cryptographic algorithms that are usable without the Crypto API.
+# None of these should have visible config options.
+#
+config CRYPTO_SHA256_DIRECT
+   bool
+
+#
 # Cryptographic API Configuration
 #
 menuconfig CRYPTO
@@ -763,6 +770,7 @@ config CRYPTO_SHA512_MB
 
 config CRYPTO_SHA256
tristate "SHA224 and SHA256 digest algorithm"
+   select CRYPTO_SHA256_DIRECT
select CRYPTO_HASH
help
  SHA256 secure hash standard (DFIPS 180-2).
diff --git a/crypto/Makefile b/crypto/Makefile
index b8f0e3eb0791..e0118a3b0d99 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_CRYPTO_RMD160) += rmd160.o
 obj-$(CONFIG_CRYPTO_RMD256) += rmd256.o
 obj-$(CONFIG_CRYPTO_RMD320) += rmd320.o
 obj-$(CONFIG_CRYPTO_SHA1) += sha1_generic.o
+obj-$(CONFIG_CRYPTO_SHA256_DIRECT) += sha256_direct.o
 obj-$(CONFIG_CRYPTO_SHA256) += sha256_generic.o
 obj-$(CONFIG_CRYPTO_SHA512) += sha512_generic.o
 obj-$(CONFIG_CRYPTO_SHA3) += sha3_generic.o
diff --git a/crypto/sha256_direct.c b/crypto/sha256_direct.c
new file mode 100644
index ..2029e4c08339
--- /dev/null
+++ b/crypto/sha256_direct.c
@@ -0,0 +1,238 @@
+/*
+ * Cryptographic API.
+ *
+ * SHA-256, as specified in
+ * http://csrc.nist.gov/groups/STM/cavp/documents/shs/sha256-384-512.pdf
+ *
+ * SHA-256 code by Jean-Luc Cooke .
+ *
+ * Copyright (c) Jean-Luc Cooke 
+ * Copyright (c) Andrew McDonald 
+ * Copyright (c) 2002 James Morris 
+ * SHA224 Support Copyright 2007 Intel Corporation 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option) 
+ * any later version.
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static inline u32 Ch(u32 x, u32 y, u32 z)
+{
+   return z ^ (x & (y ^ z));
+}
+
+static inline u32 Maj(u32 x, u32 y, u32 z)
+{
+   return (x & y) | (z & (x | y));
+}
+
+#define e0(x)   (ror32(x, 2) ^ ror32(x,13) ^ ror32(x,22))
+#define e1(x)   (ror32(x, 6) ^ ror32(x,11) ^ ror32(x,25))
+#define s0(x)   (ror32(x, 7) ^ ror32(x,18) ^ (x >> 3))
+#define s1(x)   (ror32(x,17) ^ ror32(x,19) ^ (x >> 10))
+
+static inline void LOAD_OP(int I, u32 *W, const u8 *input)
+{
+   W[I] = get_unaligned_be32((__u32 *)input + I);
+}
+
+static inline void BLEND_OP(int I, u32 *W)
+{
+   W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
+}
+
+static void sha256_transform(u32 *state, const u8 *input)
+{
+   u32 a, b, c, d, e, f, g, h, t1, t2;
+   u32 W[64];
+   int i;
+
+   /* load the input */
+   for (i = 0; i < 16; i++)
+   LOAD_OP(i, W, input);
+
+   /* now blend */
+   for (i = 16; i < 64; i++)
+   BLEND_OP(i, W);
+
+   /* load the state into our registers */
+   a=state[0];  b=state[1];  c=state[2];  d=state[3];
+   e=state[4];  f=state[5];  g=state[6];  h=state[7];
+
+   /* now iterate */
+   t1 = h + e1(e) + Ch(e,f,g) + 0x428a2f98 + W[ 0];
+   t2 = e0(a) + Maj(a,b,c);d+=t1;h=t1+t2;
+   t1 = g + e1(d) + Ch(d,e,f) + 0x71374491 + W[ 1];
+   t2 = e0(h) + Maj(h,a,b);c+=t1;g=t1+t2;
+   t1 = f + e1(c) + Ch(c,d,e) + 0xb5c0fbcf + W[ 2];
+   t2 = e0(g) + Maj(g,h,a);b+=t1;f=t1+t2;
+   t1 = e + e1(b) + Ch(b,c,d) + 0xe9b5dba5 + W[ 3];
+   t2 = e0(f) + Maj(f,g,h);a+=t1;e=t1+t2;
+   t1 = d + e1(a) + Ch(a,b,c) + 0x3956c25b + W[ 4];
+   t2 = e0(e) + Maj(e,f,g);h+=t1;d=t1+t2;
+   t1 = c + e1(h) + Ch(h,a,b) + 0x59f111f1 + W[ 5];
+   t2 = e0(d) + Maj(d,e,f);g+=t1;c=t1+t2;
+   t1 = b + e1(g) + Ch(g,h,a) + 0x923f82a4 + W[ 6];
+   t2 = e0(c) + Maj(c,d,e);f+=t1;b=t1+t2;
+   t1 = a + e1(f) + Ch(f,g,h) + 0xab1c5ed5 + W[ 7];
+   t2 = e0(b) + Maj(b,c,d);e+=t1;a=t1+t2;
+
+   t1 = h + e1(e) + Ch(e,f,g) + 0xd807aa98 + W[ 8];
+   t2 = e0(a) + Maj(a,b,c);d+=t1;h=t1+t2;
+   t1 = g + e1(d) + Ch(d,e,f) + 0x12835b01 + W[ 9];
+   t2 = e0(h) + Maj(h,a,b);c+=t1;g=t1+t2;
+   t1 = f + e1(c) + Ch(c,d,e) + 

[PATCH v2 8/8] crypto/testmgr: Allocate only the required output size for hash tests

2017-01-10 Thread Andy Lutomirski
There are some hashes (e.g. sha224) that have some internal trickery
to make sure that only the correct number of output bytes are
generated.  If something goes wrong, they could potentially overrun
the output buffer.

Make the test more robust by allocating only enough space for the
correct output size so that memory debugging will catch the error if
the output is overrun.

Tested by intentionally breaking sha224 to output all 256
internally-generated bits while running on KASAN.

Cc: Ard Biesheuvel 
Cc: Herbert Xu 
Signed-off-by: Andy Lutomirski 
---
 crypto/testmgr.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index f616ad74cce7..575fc28a9ab2 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -265,6 +265,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct 
hash_testvec *template,
   const int align_offset)
 {
const char *algo = crypto_tfm_alg_driver_name(crypto_ahash_tfm(tfm));
+   size_t digest_size = crypto_ahash_digestsize(tfm);
unsigned int i, j, k, temp;
struct scatterlist sg[8];
char *result;
@@ -275,7 +276,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct 
hash_testvec *template,
char *xbuf[XBUFSIZE];
int ret = -ENOMEM;
 
-   result = kmalloc(MAX_DIGEST_SIZE, GFP_KERNEL);
+   result = kmalloc(digest_size, GFP_KERNEL);
if (!result)
return ret;
key = kmalloc(MAX_KEYLEN, GFP_KERNEL);
@@ -305,7 +306,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct 
hash_testvec *template,
goto out;
 
j++;
-   memset(result, 0, MAX_DIGEST_SIZE);
+   memset(result, 0, digest_size);
 
hash_buff = xbuf[0];
hash_buff += align_offset;
@@ -380,7 +381,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct 
hash_testvec *template,
continue;
 
j++;
-   memset(result, 0, MAX_DIGEST_SIZE);
+   memset(result, 0, digest_size);
 
temp = 0;
sg_init_table(sg, template[i].np);
@@ -458,7 +459,7 @@ static int __test_hash(struct crypto_ahash *tfm, struct 
hash_testvec *template,
continue;
 
j++;
-   memset(result, 0, MAX_DIGEST_SIZE);
+   memset(result, 0, digest_size);
 
ret = -EINVAL;
hash_buff = xbuf[0];
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/8] crypto/sha256: Export a sha256_{init,update,final}_direct() API

2017-01-10 Thread Andy Lutomirski
This provides a very simple interface for kernel code to use to do
synchronous, unaccelerated, virtual-address-based SHA256 hashing
without needing to create a crypto context.

Subsequent patches will make this work without building the crypto
core and will use to avoid making BPF-based tracing depend on
crypto.

Cc: Ard Biesheuvel 
Cc: Herbert Xu 
Signed-off-by: Andy Lutomirski 
---
 crypto/sha256_generic.c  | 31 ++-
 include/crypto/sha.h | 24 
 include/crypto/sha256_base.h | 13 -
 3 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/crypto/sha256_generic.c b/crypto/sha256_generic.c
index 8f9c47e1a96e..573e114382f9 100644
--- a/crypto/sha256_generic.c
+++ b/crypto/sha256_generic.c
@@ -240,24 +240,45 @@ static void sha256_generic_block_fn(struct sha256_state 
*sst, u8 const *src,
}
 }
 
+void sha256_update_direct(struct sha256_state *sctx, const u8 *data,
+ unsigned int len)
+{
+   __sha256_base_do_update(sctx, data, len, sha256_generic_block_fn);
+}
+EXPORT_SYMBOL(sha256_update_direct);
+
 int crypto_sha256_update(struct shash_desc *desc, const u8 *data,
  unsigned int len)
 {
-   return sha256_base_do_update(desc, data, len, sha256_generic_block_fn);
+   sha256_update_direct(shash_desc_ctx(desc), data, len);
+   return 0;
 }
 EXPORT_SYMBOL(crypto_sha256_update);
 
 static int sha256_final(struct shash_desc *desc, u8 *out)
 {
-   sha256_base_do_finalize(desc, sha256_generic_block_fn);
-   return sha256_base_finish(desc, out);
+   __sha256_final_direct(shash_desc_ctx(desc),
+ crypto_shash_digestsize(desc->tfm), out);
+   return 0;
 }
 
+void __sha256_final_direct(struct sha256_state *sctx, unsigned int digest_size,
+  u8 *out)
+{
+   sha256_do_finalize_direct(sctx, sha256_generic_block_fn);
+   __sha256_base_finish(sctx, digest_size, out);
+}
+EXPORT_SYMBOL(sha256_final_direct);
+
 int crypto_sha256_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *hash)
 {
-   sha256_base_do_update(desc, data, len, sha256_generic_block_fn);
-   return sha256_final(desc, hash);
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+   unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
+
+   sha256_update_direct(sctx, data, len);
+   __sha256_final_direct(sctx, digest_size, hash);
+   return 0;
 }
 EXPORT_SYMBOL(crypto_sha256_finup);
 
diff --git a/include/crypto/sha.h b/include/crypto/sha.h
index c94d3eb1cefd..0df1a0e42c95 100644
--- a/include/crypto/sha.h
+++ b/include/crypto/sha.h
@@ -88,6 +88,30 @@ struct sha512_state {
u8 buf[SHA512_BLOCK_SIZE];
 };
 
+static inline void sha256_init_direct(struct sha256_state *sctx)
+{
+   sctx->state[0] = SHA256_H0;
+   sctx->state[1] = SHA256_H1;
+   sctx->state[2] = SHA256_H2;
+   sctx->state[3] = SHA256_H3;
+   sctx->state[4] = SHA256_H4;
+   sctx->state[5] = SHA256_H5;
+   sctx->state[6] = SHA256_H6;
+   sctx->state[7] = SHA256_H7;
+   sctx->count = 0;
+}
+
+extern void sha256_update_direct(struct sha256_state *sctx, const u8 *data,
+unsigned int len);
+
+extern void __sha256_final_direct(struct sha256_state *sctx,
+ unsigned int digest_size, u8 *out);
+
+static inline void sha256_final_direct(struct sha256_state *sctx, u8 *out)
+{
+   __sha256_final_direct(sctx, SHA256_DIGEST_SIZE, out);
+}
+
 struct shash_desc;
 
 extern int crypto_sha1_update(struct shash_desc *desc, const u8 *data,
diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
index fc77b8e099a7..9bbe73ce458f 100644
--- a/include/crypto/sha256_base.h
+++ b/include/crypto/sha256_base.h
@@ -37,19 +37,6 @@ static inline int sha224_base_init(struct shash_desc *desc)
return 0;
 }
 
-static inline void sha256_init_direct(struct sha256_state *sctx)
-{
-   sctx->state[0] = SHA256_H0;
-   sctx->state[1] = SHA256_H1;
-   sctx->state[2] = SHA256_H2;
-   sctx->state[3] = SHA256_H3;
-   sctx->state[4] = SHA256_H4;
-   sctx->state[5] = SHA256_H5;
-   sctx->state[6] = SHA256_H6;
-   sctx->state[7] = SHA256_H7;
-   sctx->count = 0;
-}
-
 static inline int sha256_base_init(struct shash_desc *desc)
 {
sha256_init_direct(shash_desc_ctx(desc));
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/8] Switch BPF's digest to SHA256

2017-01-10 Thread Andy Lutomirski
I can imagine future uses for the new-in-4.10 BPF digest feature that
would be problematic if malicious users could produce collisions, and
SHA-1 is no longer consdiered to be collision-free.  Even without
needing collision resistance, SHA-1 is no longer recommended for new
applications.  Switch the BPF digest to SHA-256 instead.

The actual switchover is trivial.  Most of this series consists of
cleanups to the SHA256 code to make it usable as a standalone library
(since BPF should not depend on crypto).

The cleaned up library is much more user-friendly than the SHA-1 code,
so this also significantly tidies up the BPF digest code.

This is intended for 4.10.  If this series misses 4.10 and nothing
takes its place, then we'll have an unpleasant ABI stability
situation.

NB: I would be happy to make parallel changes to the SHA-512 code if
the crypto folks would like for me to do so.  I haven't yet because I
wanted to minimize churn.  Also, the changes will be essentially
identical to the SHA-256 changes and I want to get the latter
reviewed first.

Andy Lutomirski (8):
  crypto/sha256: Factor out the parts of base API that don't use
shash_desc
  crypto/sha256: Export a sha256_{init,update,final}_direct() API
  crypto/sha256: Build the SHA256 core separately from the crypto module
  bpf: Use SHA256 instead of SHA1 for bpf digests
  bpf: Avoid copying the entire BPF program when hashing it
  bpf: Rename fdinfo's prog_digest to prog_sha256
  net: Rename TCA*BPF_DIGEST to ..._SHA256
  crypto/testmgr: Allocate only the required output size for hash tests

 crypto/Kconfig |   8 ++
 crypto/Makefile|   1 +
 crypto/sha256_direct.c | 238 +
 crypto/sha256_generic.c| 215 ++---
 crypto/testmgr.c   |   9 +-
 include/crypto/sha.h   |  24 
 include/crypto/sha256_base.h   |  58 +
 include/linux/filter.h |  11 +-
 include/uapi/linux/pkt_cls.h   |   2 +-
 include/uapi/linux/tc_act/tc_bpf.h |   2 +-
 init/Kconfig   |   1 +
 kernel/bpf/core.c  |  63 +++---
 kernel/bpf/syscall.c   |   2 +-
 net/sched/act_bpf.c|   2 +-
 net/sched/cls_bpf.c|   2 +-
 15 files changed, 343 insertions(+), 295 deletions(-)
 create mode 100644 crypto/sha256_direct.c

-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/8] crypto/sha256: Factor out the parts of base API that don't use shash_desc

2017-01-10 Thread Andy Lutomirski
I want to expose a minimal SHA256 API that can be used without the
depending on the crypto core.  To prepare for this, factor out the
meat of the sha256_base_*() helpers.

Cc: Ard Biesheuvel 
Cc: Herbert Xu 
Signed-off-by: Andy Lutomirski 
---
 include/crypto/sha256_base.h | 53 ++--
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
index d1f2195bb7de..fc77b8e099a7 100644
--- a/include/crypto/sha256_base.h
+++ b/include/crypto/sha256_base.h
@@ -18,10 +18,8 @@
 typedef void (sha256_block_fn)(struct sha256_state *sst, u8 const *src,
   int blocks);
 
-static inline int sha224_base_init(struct shash_desc *desc)
+static inline void sha224_init_direct(struct sha256_state *sctx)
 {
-   struct sha256_state *sctx = shash_desc_ctx(desc);
-
sctx->state[0] = SHA224_H0;
sctx->state[1] = SHA224_H1;
sctx->state[2] = SHA224_H2;
@@ -31,14 +29,16 @@ static inline int sha224_base_init(struct shash_desc *desc)
sctx->state[6] = SHA224_H6;
sctx->state[7] = SHA224_H7;
sctx->count = 0;
+}
 
+static inline int sha224_base_init(struct shash_desc *desc)
+{
+   sha224_init_direct(shash_desc_ctx(desc));
return 0;
 }
 
-static inline int sha256_base_init(struct shash_desc *desc)
+static inline void sha256_init_direct(struct sha256_state *sctx)
 {
-   struct sha256_state *sctx = shash_desc_ctx(desc);
-
sctx->state[0] = SHA256_H0;
sctx->state[1] = SHA256_H1;
sctx->state[2] = SHA256_H2;
@@ -48,16 +48,19 @@ static inline int sha256_base_init(struct shash_desc *desc)
sctx->state[6] = SHA256_H6;
sctx->state[7] = SHA256_H7;
sctx->count = 0;
+}
 
+static inline int sha256_base_init(struct shash_desc *desc)
+{
+   sha256_init_direct(shash_desc_ctx(desc));
return 0;
 }
 
-static inline int sha256_base_do_update(struct shash_desc *desc,
-   const u8 *data,
-   unsigned int len,
-   sha256_block_fn *block_fn)
+static inline void __sha256_base_do_update(struct sha256_state *sctx,
+  const u8 *data,
+  unsigned int len,
+  sha256_block_fn *block_fn)
 {
-   struct sha256_state *sctx = shash_desc_ctx(desc);
unsigned int partial = sctx->count % SHA256_BLOCK_SIZE;
 
sctx->count += len;
@@ -86,15 +89,21 @@ static inline int sha256_base_do_update(struct shash_desc 
*desc,
}
if (len)
memcpy(sctx->buf + partial, data, len);
+}
 
+static inline int sha256_base_do_update(struct shash_desc *desc,
+   const u8 *data,
+   unsigned int len,
+   sha256_block_fn *block_fn)
+{
+   __sha256_base_do_update(shash_desc_ctx(desc), data, len, block_fn);
return 0;
 }
 
-static inline int sha256_base_do_finalize(struct shash_desc *desc,
- sha256_block_fn *block_fn)
+static inline void sha256_do_finalize_direct(struct sha256_state *sctx,
+sha256_block_fn *block_fn)
 {
const int bit_offset = SHA256_BLOCK_SIZE - sizeof(__be64);
-   struct sha256_state *sctx = shash_desc_ctx(desc);
__be64 *bits = (__be64 *)(sctx->buf + bit_offset);
unsigned int partial = sctx->count % SHA256_BLOCK_SIZE;
 
@@ -109,14 +118,18 @@ static inline int sha256_base_do_finalize(struct 
shash_desc *desc,
memset(sctx->buf + partial, 0x0, bit_offset - partial);
*bits = cpu_to_be64(sctx->count << 3);
block_fn(sctx, sctx->buf, 1);
+}
 
+static inline int sha256_base_do_finalize(struct shash_desc *desc,
+ sha256_block_fn *block_fn)
+{
+   sha256_do_finalize_direct(shash_desc_ctx(desc), block_fn);
return 0;
 }
 
-static inline int sha256_base_finish(struct shash_desc *desc, u8 *out)
+static inline void __sha256_base_finish(struct sha256_state *sctx,
+   unsigned int digest_size, u8 *out)
 {
-   unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
-   struct sha256_state *sctx = shash_desc_ctx(desc);
__be32 *digest = (__be32 *)out;
int i;
 
@@ -124,5 +137,11 @@ static inline int sha256_base_finish(struct shash_desc 
*desc, u8 *out)
put_unaligned_be32(sctx->state[i], digest++);
 
*sctx = (struct sha256_state){};
+}
+
+static inline int sha256_base_finish(struct shash_desc *desc, u8 *out)
+{
+   __sha256_base_finish(shash_desc_ctx(desc),
+crypto_shash_digestsize(desc->t

Re: x86-64: Maintain 16-byte stack alignment

2017-01-10 Thread Andy Lutomirski
On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
 wrote:
> On 10 January 2017 at 14:33, Herbert Xu  wrote:
>> I recently applied the patch
>>
>> https://patchwork.kernel.org/patch/9468391/
>>
>> and ended up with a boot crash when it tried to run the x86 chacha20
>> code.  It turned out that the patch changed a manually aligned
>> stack buffer to one that is aligned by gcc.  What was happening was
>> that gcc can stack align to any value on x86-64 except 16.  The
>> reason is that gcc assumes that the stack is always 16-byte aligned,
>> which is not actually the case in the kernel.
>>
>
> Apologies for introducing this breakage. It seemed like an obvious and
> simple cleanup, so I didn't even bother to mention it in the commit
> log, but if the kernel does not guarantee 16 byte alignment, I guess
> we should revert to the old method. If SSE instructions are the only
> ones that require this alignment, then I suppose not having a ABI
> conforming stack pointer should not be an issue in general.

Here's what I think is really going on.  This is partially from
memory, so I could be off base.  The kernel is up against
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
on some GCC versions (like the bad one and maybe even current ones),
things compiled without -mno-sse can't have the stack alignment set
properly.  IMO we should fix this in the affected code, not the entry
code.  In fact, I think that fixing it in the entry code won't even
fully fix it because modern GCC will compile the rest of the kernel
with 8-byte alignment and the stack will get randomly unaligned (GCC
4.8 and newer).

Can we just add __attribute__((force_align_arg_pointer)) to the
affected functions?  Maybe have:

#define __USES_SSE __attribute__((force_align_arg_pointer))

on affected gcc versions?

***HOWEVER***

I think this is missing the tree for the supposed forest.  The actual
affected code appears to be:

static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist *dst,
 struct scatterlist *src, unsigned int nbytes)
{
u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];

...

state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);

gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
and optimizes out the roundup.  How about just declaring an actual
__aligned(16) buffer, marking the function
__attribute__((force_align_arg_pointer)), and being done with it?
After all, we need that forcible alignment on *all* gcc versions.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-10 Thread Andy Lutomirski
On Tue, Jan 10, 2017 at 11:16 AM, Ard Biesheuvel
 wrote:
> On 10 January 2017 at 19:00, Andy Lutomirski  wrote:
>> On Tue, Jan 10, 2017 at 9:30 AM, Ard Biesheuvel
>>  wrote:
>>> On 10 January 2017 at 14:33, Herbert Xu  wrote:
>>>> I recently applied the patch
>>>>
>>>> https://patchwork.kernel.org/patch/9468391/
>>>>
>>>> and ended up with a boot crash when it tried to run the x86 chacha20
>>>> code.  It turned out that the patch changed a manually aligned
>>>> stack buffer to one that is aligned by gcc.  What was happening was
>>>> that gcc can stack align to any value on x86-64 except 16.  The
>>>> reason is that gcc assumes that the stack is always 16-byte aligned,
>>>> which is not actually the case in the kernel.
>>>>
>>>
>>> Apologies for introducing this breakage. It seemed like an obvious and
>>> simple cleanup, so I didn't even bother to mention it in the commit
>>> log, but if the kernel does not guarantee 16 byte alignment, I guess
>>> we should revert to the old method. If SSE instructions are the only
>>> ones that require this alignment, then I suppose not having a ABI
>>> conforming stack pointer should not be an issue in general.
>>
>> Here's what I think is really going on.  This is partially from
>> memory, so I could be off base.  The kernel is up against
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53383, which means that,
>> on some GCC versions (like the bad one and maybe even current ones),
>> things compiled without -mno-sse can't have the stack alignment set
>> properly.  IMO we should fix this in the affected code, not the entry
>> code.  In fact, I think that fixing it in the entry code won't even
>> fully fix it because modern GCC will compile the rest of the kernel
>> with 8-byte alignment and the stack will get randomly unaligned (GCC
>> 4.8 and newer).
>>
>> Can we just add __attribute__((force_align_arg_pointer)) to the
>> affected functions?  Maybe have:
>>
>> #define __USES_SSE __attribute__((force_align_arg_pointer))
>>
>> on affected gcc versions?
>>
>> ***HOWEVER***
>>
>> I think this is missing the tree for the supposed forest.  The actual
>> affected code appears to be:
>>
>> static int chacha20_simd(struct blkcipher_desc *desc, struct scatterlist 
>> *dst,
>>  struct scatterlist *src, unsigned int nbytes)
>> {
>> u32 *state, state_buf[16 + (CHACHA20_STATE_ALIGN / sizeof(u32)) - 1];
>>
>> ...
>>
>> state = (u32 *)roundup((uintptr_t)state_buf, CHACHA20_STATE_ALIGN);
>>
>> gcc presumably infers (incorrectly) that state_buf is 16-byte aligned
>> and optimizes out the roundup.  How about just declaring an actual
>> __aligned(16) buffer, marking the function
>> __attribute__((force_align_arg_pointer)), and being done with it?
>> After all, we need that forcible alignment on *all* gcc versions.
>>
>
> Actually, the breakage is introduced by the patch Herbert refers to
>
> https://patchwork.kernel.org/patch/9468391/
>
> where the state is replaced by a simple
>
> u32 state[16] __aligned(CHACHA20_STATE_ALIGN);
>
> which seemed harmless enough to me. So the code above works fine.

So how about just the one-line patch of adding the
force_align_arg_pointer?  Would that solve the problem?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-10 Thread Andy Lutomirski
On Tue, Jan 10, 2017 at 9:05 AM, Linus Torvalds
 wrote:
> On Tue, Jan 10, 2017 at 6:39 AM, Herbert Xu  
> wrote:
>>
>> BTW this is with Debian gcc 4.7.2 which does not allow an 8-byte
>> stack alignment as attempted by the Makefile:
>
> I'm pretty sure we have random asm code that may not maintain a
> sus16-byte stack alignment when it calls other code (including, in some
> cases, calling C code).

I suspect so.

If we change this, changing pt_regs might make sense but is kind of
weird.  It also needs to be tested with and without frame pointers.

>
> So I'm not at all convinced that this is a good idea. We shouldn't
> expect 16-byte alignment to be something trustworthy.
>
>  Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash

2016-12-27 Thread Andy Lutomirski
On Tue, Dec 27, 2016 at 6:16 AM, Daniel Borkmann  wrote:
> On 12/27/2016 10:58 AM, Herbert Xu wrote:
>>
>> On Mon, Dec 26, 2016 at 10:08:48AM -0800, Andy Lutomirski wrote:
>>>
>>>
>>> According to Daniel, the networking folks want to let embedded systems
>>> include BPF without requiring the crypto core.
>>
>>
>> Last I checked the IPv4 stack depended on the crypto API so this
>> sounds bogus.
>
>
> I think there's a bit of a mixup here with what I said. To clarify,
> requirement back then from tracing folks was that bpf engine and
> therefore bpf syscall can be build w/o networking enabled for small
> devices, so dependencies preferably need to be kept on a absolute
> minimum, same counts for either making it suddenly a depend on
> CRYPTO or a select CRYPTO for just those few lines that can be
> pulled in from lib/ code instead.

Somehow I had that in my head as "networking" not "tracing", probably
because of the TCA stuff.  Whoops.

Anyway, I'm rewriting the crypto part of the patch completely based on
Ard's feedback.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4.10 3/6] bpf: Use SHA256 instead of SHA1 for bpf digests

2016-12-26 Thread Andy Lutomirski
On Mon, Dec 26, 2016 at 5:36 PM, Alexei Starovoitov
 wrote:
> On Sat, Dec 24, 2016 at 08:59:53PM +0100, Daniel Borkmann wrote:
>> On 12/24/2016 03:22 AM, Andy Lutomirski wrote:
>> >BPF digests are intended to be used to avoid reloading programs that
>> >are already loaded.  For use cases (CRIU?) where untrusted programs
>> >are involved, intentional hash collisions could cause the wrong BPF
>> >program to execute.  Additionally, if BPF digests are ever used
>> >in-kernel to skip verification, a hash collision could give privilege
>> >escalation directly.
>>
>> Just for the record, digests will never ever be used to skip the
>> verification step, so I don't know why this idea even comes up
>> here (?) or is part of the changelog? As this will never be done
>> anyway, rather drop that part so we can avoid confusion on this?
>
> +1 to what Daniel said above.
>
> For the others let me explain what this patch set is actually
> trying to accomplish.

The patch:

a) cleans up the code and

b) uses a cryptographic hash that is actually believed to satisfy the
definition of a cryptographic hash.

There's no excuse for not doing b.

> and I have an obvious NACK for bpf related patches 3,4,5,6.

Did you *read* the ones that were pure cleanups?

>
> sha1 is 20 bytes which is already a bit long to print and copy paste by 
> humans.
> whereas 4 byte jhash is a bit too short, since collisions are not that rare
> and may lead to incorrect assumptions from the users that develop the 
> programs.
> I would prefer something in 6-10 byte range that prevents collisions most of
> the time and short to print as hex, but I don't know of anything like this
> in the existing kernel and inventing bpf specific hash is not great.
> Another requirement for debugging (and prog_digest) that user space
> should be able to produce the same hash without asking kernel, so
> sha1 fits that as well, since it's well known and easy to put into library.

Then truncate them in user space.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash

2016-12-26 Thread Andy Lutomirski
On Mon, Dec 26, 2016 at 9:51 AM, Ard Biesheuvel
 wrote:
> On 26 December 2016 at 07:57, Herbert Xu  wrote:
>> On Sat, Dec 24, 2016 at 09:57:53AM -0800, Andy Lutomirski wrote:
>>>
>>> I actually do use incremental hashing later on.   BPF currently
>>> vmallocs() a big temporary buffer just so it can fill it and hash it.
>>> I change it to hash as it goes.
>>
>> How much data is this supposed to hash on average? If it's a large
>> amount then perhaps using the existing crypto API would be a better
>> option than adding this.
>>
>
> This is a good point actually: you didn't explain *why* BPF shouldn't
> depend on the crypto API.

According to Daniel, the networking folks want to let embedded systems
include BPF without requiring the crypto core.

At some point, I'd also like to use modern hash functions for module
verification.  If doing so would require the crypto core to be
available when modules are loaded, then the crypto core couldn't be
modular.  (Although it occurs to me that my patches get that wrong --
if this change happens, I need to split the code so that the library
functions can be built in even if CRYPTO=m.)

Daniel, would you be okay with BPF selecting CRYPTO and CRYPTO_HASH?

Also, as a bikeshed thought: I could call the functions
sha256_init_direct(), etc.  Then there wouldn't be namespace
collisions and the fact that they bypass accelerated drivers would be
more obvious.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash

2016-12-24 Thread Andy Lutomirski
On Sat, Dec 24, 2016 at 2:33 AM, Ard Biesheuvel
 wrote:
> Hi Andy,
>
> On 24 December 2016 at 02:22, Andy Lutomirski  wrote:
>> There are some pieecs of kernel code that want to compute SHA256
>> directly without going through the crypto core.  Adjust the exported
>> API to decouple it from the crypto core.
>>
>
> There are a bunch of things happening at the same time in this patch,
> i.e., unnecessary renames of functions with static linkage, return
> type changes to the base prototypes (int (*)(...) to void (*)(...))
> and the change for the base functions to take a struct sha256_state
> ctx rather than a shash_desc. I suppose you are mainly after the
> latter, so could we please drop the other changes?
>
> For the name clashes, could we simply use the crypto_ prefix for the
> globally visible functions rather than using names that are already in
> use? (and having to go around clean up the conflicts)
> As for the return type changes, the base functions intentionally
> return int to allow tail calls from the functions exposed by the
> crypto API (whose prototypes cannot be changed). Unlikely to matter in
> the grand scheme of things (especially now that the base layer
> consists of static inline functions primarily), but it is equally
> pointless to go around and change them to return void IMO.
>
> So what remains is the struct shash_desc to struct sha256_state
> change, which makes sense given that you are after a sha256_digest()
> function that does not require the crypto API. But it seems your use
> case does not rely on incremental hashing, and so there is no reason
> for the state to be exposed outside of the implementation, and we
> could simply expose a crypto_sha256_digest() routine from the
> sha256_generic.c implementation instead.

I actually do use incremental hashing later on.   BPF currently
vmallocs() a big temporary buffer just so it can fill it and hash it.
I change it to hash as it goes.

I painted the bike shed the other way because I thought that crypto_
names should indicate that they're the versions compatible with the
crypto API, but I take your point about churn.  Part of the reason I
didn't want to use crypto_sha256_update is because that function is
currently like this:

int crypto_sha256_update(struct shash_desc *desc, const u8 *data,
  unsigned int len)

and I wanted to avoid churn.  The sha256_update() functions scattered
all over were static, so I didn't worry about them.

I'm going to give this another try as a split-up series that tries to
avoid making any changes beyond simple function renames to the
drivers.

>
> Also, I strongly feel that crypto and other security related patches
> should be tested before being posted, even if they are only RFC,
> especially when they are posted by high profile kernel devs like
> yourself. (Your code incorrectly calls crypto_sha2_final() in a couple
> of places, resulting in the finalization being performed twice, once
> with the accelerated block transform and again with the generic
> transform)
>

I tested it, albeit poorly.  I wanted feedback on the API (thanks!)
and I figured I could more carefully check the implementation once the
API survives a bit of review.  Since it looks like I have to rework
this, I'd need to re-test anyway.

>> I suspect this will very slightly speed up the SHA256 shash operations
>> as well by reducing the amount of indirection involved.
>>
>
> I think you have a valid point when it comes to the complexity of the
> crypto API in general. But the struct sha256_state is embedded in the
> shash_desc rather than referred to via a pointer, so the level of
> indirection does not appear to change. And given how 99.9% of the
> SHA256 execution time is spent in the block transform routine anyway,
> I expect the performance delta to be in the noise tbh.

s/very slightly/negligibly?  There's an extra speedup from avoiding a
variable-length stack allocation, but that probably doesn't matter
much either.

>
> Finally, another thing to keep in mind is that the base layers of
> SHA-1, SHA-256 and SHA-512 are intentionally structured in the same
> way. If there is a need for a digest() entry point, I'd prefer to add
> them for all flavours.

I want to get sha256 right first.  Once it's in good shape, making the
same changes to the other variants should be easy.

>
> Whether this still belongs under crypto or under lib/sha256.c as a
> library function (allowing archs to override it) is open for debate.
> If hashing BPF programs becomes a hot spot, we probably have bigger
> problems.
>
> Regards,
> Ard.
>
> P.S. I do take your point regarding the arch_sha256_block_transform()
> proposed in your follow up email, but there are some details 

Re: [RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash

2016-12-23 Thread Andy Lutomirski
On Fri, Dec 23, 2016 at 6:22 PM, Andy Lutomirski  wrote:
> There are some pieecs of kernel code that want to compute SHA256
> directly without going through the crypto core.  Adjust the exported
> API to decouple it from the crypto core.
>
> I suspect this will very slightly speed up the SHA256 shash operations
> as well by reducing the amount of indirection involved.
>

I should also mention: there's a nice potential cleanup that's
possible on top of this.  Currently, most of the accelerated SHA256
implementations just swap out the block function.  Another approach to
enabling this would be to restructure sha256_update along the lines
of:

sha256_block_fn_t fn = arch_sha256_block_fn(len);
sha256_base_do_update(sctx, data, len, arch_sha256_block_fn(len));

The idea being that arch code can decide whether to use an accelerated
block function based on context (x86, for example, can't always use
xmm regs) and length (on x86, using the accelerated versions for short
digests is very slow due to the state save/restore that happens) and
then the core code can just use it.

This would allow a lot of the boilerplate that this patch was forced
to modify to be deleted outright.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 4.10 3/6] bpf: Use SHA256 instead of SHA1 for bpf digests

2016-12-23 Thread Andy Lutomirski
BPF digests are intended to be used to avoid reloading programs that
are already loaded.  For use cases (CRIU?) where untrusted programs
are involved, intentional hash collisions could cause the wrong BPF
program to execute.  Additionally, if BPF digests are ever used
in-kernel to skip verification, a hash collision could give privilege
escalation directly.

SHA1 is no longer considered adequately collision-resistant (see, for
example, all the major browsers dropping support for SHA1
certificates).  Use SHA256 instead.

I moved the digest field to keep all of the bpf program metadata in
the same cache line.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Signed-off-by: Andy Lutomirski 
---
 include/linux/filter.h | 11 +++
 init/Kconfig   |  1 +
 kernel/bpf/core.c  | 41 +++--
 3 files changed, 11 insertions(+), 42 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 702314253797..23df2574e30c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -14,7 +14,8 @@
 #include 
 #include 
 #include 
-#include 
+
+#include 
 
 #include 
 
@@ -408,11 +409,11 @@ struct bpf_prog {
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */
-   u32 digest[SHA_DIGEST_WORDS]; /* Program digest */
struct bpf_prog_aux *aux;   /* Auxiliary fields */
struct sock_fprog_kern  *orig_prog; /* Original BPF program */
unsigned int(*bpf_func)(const void *ctx,
const struct bpf_insn *insn);
+   u8  digest[SHA256_DIGEST_SIZE]; /* Program digest */
/* Instructions for interpreter */
union {
struct sock_filter  insns[0];
@@ -519,12 +520,6 @@ static inline u32 bpf_prog_insn_size(const struct bpf_prog 
*prog)
return prog->len * sizeof(struct bpf_insn);
 }
 
-static inline u32 bpf_prog_digest_scratch_size(const struct bpf_prog *prog)
-{
-   return round_up(bpf_prog_insn_size(prog) +
-   sizeof(__be64) + 1, SHA_MESSAGE_BYTES);
-}
-
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
return max(sizeof(struct bpf_prog),
diff --git a/init/Kconfig b/init/Kconfig
index 223b734abccd..5a4e2d99cc38 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1634,6 +1634,7 @@ config BPF_SYSCALL
bool "Enable bpf() system call"
select ANON_INODES
select BPF
+   select CRYPTO_SHA256_LIB
default n
help
  Enable the bpf() system call that allows to manipulate eBPF
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1eb4f1303756..911993863799 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -148,22 +148,18 @@ void __bpf_prog_free(struct bpf_prog *fp)
 
 int bpf_prog_calc_digest(struct bpf_prog *fp)
 {
-   const u32 bits_offset = SHA_MESSAGE_BYTES - sizeof(__be64);
-   u32 raw_size = bpf_prog_digest_scratch_size(fp);
-   u32 ws[SHA_WORKSPACE_WORDS];
-   u32 i, bsize, psize, blocks;
+   struct sha256_state sha;
+   u32 i, psize;
struct bpf_insn *dst;
bool was_ld_map;
-   u8 *raw, *todo;
-   __be32 *result;
-   __be64 *bits;
+   u8 *raw;
 
-   raw = vmalloc(raw_size);
+   psize = bpf_prog_insn_size(fp);
+   raw = vmalloc(psize);
if (!raw)
return -ENOMEM;
 
-   sha_init(fp->digest);
-   memset(ws, 0, sizeof(ws));
+   sha256_init(&sha);
 
/* We need to take out the map fd for the digest calculation
 * since they are unstable from user space side.
@@ -188,30 +184,7 @@ int bpf_prog_calc_digest(struct bpf_prog *fp)
}
}
 
-   psize = bpf_prog_insn_size(fp);
-   memset(&raw[psize], 0, raw_size - psize);
-   raw[psize++] = 0x80;
-
-   bsize  = round_up(psize, SHA_MESSAGE_BYTES);
-   blocks = bsize / SHA_MESSAGE_BYTES;
-   todo   = raw;
-   if (bsize - psize >= sizeof(__be64)) {
-   bits = (__be64 *)(todo + bsize - sizeof(__be64));
-   } else {
-   bits = (__be64 *)(todo + bsize + bits_offset);
-   blocks++;
-   }
-   *bits = cpu_to_be64((psize - 1) << 3);
-
-   while (blocks--) {
-   sha_transform(fp->digest, todo, ws);
-   todo += SHA_MESSAGE_BYTES;
-   }
-
-   result = (__force __be32 *)fp->digest;
-   for (i = 0; i < SHA_DIGEST_WORDS; i++)
-   result[i] = cpu_to_be32(fp->digest[i]);
-
+   sha256_finup(&sha, raw, psize, fp->digest);
vfree(raw);
return 0;
 }
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 4.10 2/6] crypto/sha256: Make the sha256 library functions selectable

2016-12-23 Thread Andy Lutomirski
This will let other kernel code call into sha256_init(), etc. without
pulling in the core crypto code.

Signed-off-by: Andy Lutomirski 
---
 crypto/Kconfig  | 8 
 crypto/Makefile | 2 +-
 crypto/sha256_generic.c | 4 
 include/crypto/sha.h| 4 
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 160f08e721cc..85a2b3440c2b 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -10,6 +10,13 @@ config XOR_BLOCKS
 source "crypto/async_tx/Kconfig"
 
 #
+# Cryptographic algorithms that are usable without the Crypto API.
+# None of these should have visible config options.
+#
+config CRYPTO_SHA256_LIB
+   bool
+
+#
 # Cryptographic API Configuration
 #
 menuconfig CRYPTO
@@ -763,6 +770,7 @@ config CRYPTO_SHA512_MB
 
 config CRYPTO_SHA256
tristate "SHA224 and SHA256 digest algorithm"
+   select CRYPTO_SHA256_LIB
select CRYPTO_HASH
help
  SHA256 secure hash standard (DFIPS 180-2).
diff --git a/crypto/Makefile b/crypto/Makefile
index b8f0e3eb0791..d147d4c911f5 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -71,7 +71,7 @@ obj-$(CONFIG_CRYPTO_RMD160) += rmd160.o
 obj-$(CONFIG_CRYPTO_RMD256) += rmd256.o
 obj-$(CONFIG_CRYPTO_RMD320) += rmd320.o
 obj-$(CONFIG_CRYPTO_SHA1) += sha1_generic.o
-obj-$(CONFIG_CRYPTO_SHA256) += sha256_generic.o
+obj-$(CONFIG_CRYPTO_SHA256_LIB) += sha256_generic.o
 obj-$(CONFIG_CRYPTO_SHA512) += sha512_generic.o
 obj-$(CONFIG_CRYPTO_SHA3) += sha3_generic.o
 obj-$(CONFIG_CRYPTO_WP512) += wp512.o
diff --git a/crypto/sha256_generic.c b/crypto/sha256_generic.c
index f2747893402c..9df71ac66dc4 100644
--- a/crypto/sha256_generic.c
+++ b/crypto/sha256_generic.c
@@ -261,6 +261,8 @@ void sha256_final(struct sha256_state *sctx, u8 *out)
 }
 EXPORT_SYMBOL(sha256_final);
 
+#ifdef CONFIG_CRYPTO_HASH
+
 static int crypto_sha256_update(struct shash_desc *desc, const u8 *data,
unsigned int len)
 {
@@ -328,6 +330,8 @@ static void __exit sha256_generic_mod_fini(void)
 module_init(sha256_generic_mod_init);
 module_exit(sha256_generic_mod_fini);
 
+#endif /* CONFIG_CRYPTO_HASH */
+
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("SHA-224 and SHA-256 Secure Hash Algorithm");
 
diff --git a/include/crypto/sha.h b/include/crypto/sha.h
index 2b6978471605..381ba7fa5e3f 100644
--- a/include/crypto/sha.h
+++ b/include/crypto/sha.h
@@ -96,6 +96,8 @@ extern int crypto_sha1_update(struct shash_desc *desc, const 
u8 *data,
 extern int crypto_sha1_finup(struct shash_desc *desc, const u8 *data,
 unsigned int len, u8 *hash);
 
+#ifdef CONFIG_CRYPTO_SHA256_LIB
+
 static inline void sha256_init(struct sha256_state *sctx)
 {
sctx->state[0] = SHA256_H0;
@@ -121,6 +123,8 @@ static inline void sha256_finup(struct sha256_state *sctx, 
const u8 *data,
sha256_final(sctx, hash);
 }
 
+#endif /* CONFIG_CRYPTO_SHA256_LIB */
+
 extern int crypto_sha512_update(struct shash_desc *desc, const u8 *data,
  unsigned int len);
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash

2016-12-23 Thread Andy Lutomirski
There are some pieecs of kernel code that want to compute SHA256
directly without going through the crypto core.  Adjust the exported
API to decouple it from the crypto core.

I suspect this will very slightly speed up the SHA256 shash operations
as well by reducing the amount of indirection involved.

Cc: Ard Biesheuvel 
Cc: Herbert Xu 
Signed-off-by: Andy Lutomirski 
---
 arch/arm/crypto/sha2-ce-glue.c  | 10 ---
 arch/arm/crypto/sha256_glue.c   | 23 ++-
 arch/arm/crypto/sha256_neon_glue.c  | 34 +++--
 arch/arm64/crypto/sha2-ce-glue.c| 13 
 arch/arm64/crypto/sha256-glue.c | 59 +
 arch/x86/crypto/sha256_ssse3_glue.c | 46 +
 arch/x86/purgatory/purgatory.c  |  2 +-
 arch/x86/purgatory/sha256.c | 25 ++--
 arch/x86/purgatory/sha256.h | 22 --
 crypto/sha256_generic.c | 50 +++
 include/crypto/sha.h| 29 ++
 include/crypto/sha256_base.h| 40 -
 12 files changed, 184 insertions(+), 169 deletions(-)
 delete mode 100644 arch/x86/purgatory/sha256.h

diff --git a/arch/arm/crypto/sha2-ce-glue.c b/arch/arm/crypto/sha2-ce-glue.c
index 0755b2d657f3..8832c2f85591 100644
--- a/arch/arm/crypto/sha2-ce-glue.c
+++ b/arch/arm/crypto/sha2-ce-glue.c
@@ -38,7 +38,7 @@ static int sha2_ce_update(struct shash_desc *desc, const u8 
*data,
return crypto_sha256_arm_update(desc, data, len);
 
kernel_neon_begin();
-   sha256_base_do_update(desc, data, len,
+   sha256_base_do_update(sctx, data, len,
  (sha256_block_fn *)sha2_ce_transform);
kernel_neon_end();
 
@@ -48,17 +48,19 @@ static int sha2_ce_update(struct shash_desc *desc, const u8 
*data,
 static int sha2_ce_finup(struct shash_desc *desc, const u8 *data,
 unsigned int len, u8 *out)
 {
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+
if (!may_use_simd())
return crypto_sha256_arm_finup(desc, data, len, out);
 
kernel_neon_begin();
if (len)
-   sha256_base_do_update(desc, data, len,
+   sha256_base_do_update(sctx, data, len,
  (sha256_block_fn *)sha2_ce_transform);
-   sha256_base_do_finalize(desc, (sha256_block_fn *)sha2_ce_transform);
+   sha256_base_do_finalize(sctx, (sha256_block_fn *)sha2_ce_transform);
kernel_neon_end();
 
-   return sha256_base_finish(desc, out);
+   return crypto_sha2_final(desc, out);
 }
 
 static int sha2_ce_final(struct shash_desc *desc, u8 *out)
diff --git a/arch/arm/crypto/sha256_glue.c b/arch/arm/crypto/sha256_glue.c
index a84e869ef900..405a29a9a9d3 100644
--- a/arch/arm/crypto/sha256_glue.c
+++ b/arch/arm/crypto/sha256_glue.c
@@ -36,27 +36,34 @@ asmlinkage void sha256_block_data_order(u32 *digest, const 
void *data,
 int crypto_sha256_arm_update(struct shash_desc *desc, const u8 *data,
 unsigned int len)
 {
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+
/* make sure casting to sha256_block_fn() is safe */
BUILD_BUG_ON(offsetof(struct sha256_state, state) != 0);
 
-   return sha256_base_do_update(desc, data, len,
+   sha256_base_do_update(sctx, data, len,
(sha256_block_fn *)sha256_block_data_order);
+   return 0;
 }
 EXPORT_SYMBOL(crypto_sha256_arm_update);
 
-static int sha256_final(struct shash_desc *desc, u8 *out)
+static int sha256_arm_final(struct shash_desc *desc, u8 *out)
 {
-   sha256_base_do_finalize(desc,
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+
+   sha256_base_do_finalize(sctx,
(sha256_block_fn *)sha256_block_data_order);
-   return sha256_base_finish(desc, out);
+   return crypto_sha2_final(desc, out);
 }
 
 int crypto_sha256_arm_finup(struct shash_desc *desc, const u8 *data,
unsigned int len, u8 *out)
 {
-   sha256_base_do_update(desc, data, len,
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+
+   sha256_base_do_update(sctx, data, len,
  (sha256_block_fn *)sha256_block_data_order);
-   return sha256_final(desc, out);
+   return crypto_sha2_final(desc, out);
 }
 EXPORT_SYMBOL(crypto_sha256_arm_finup);
 
@@ -64,7 +71,7 @@ static struct shash_alg algs[] = { {
.digestsize =   SHA256_DIGEST_SIZE,
.init   =   sha256_base_init,
.update =   crypto_sha256_arm_update,
-   .final  =   sha256_final,
+   .final  =   sha256_arm_final,
.finup  =   crypto_sha256_arm_finup,
.descsize   =   sizeof(struct sha256_state),
.base   =   {
@@ -79,7 +86,7 @@ static struct shash_alg algs

[RFC PATCH 4.10 4/6] bpf: Avoid copying the entire BPF program when hashing it

2016-12-23 Thread Andy Lutomirski
The sha256 helpers can consume a message incrementally, so there's no need
to allocate a buffer to store the whole blob to be hashed.

This may be a slight slowdown for very long messages because gcc can't
inline the sha256_update() calls.  For reasonably-sized programs,
however, this should be a considerable speedup as vmalloc() is quite
slow.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Signed-off-by: Andy Lutomirski 
---
 kernel/bpf/core.c | 34 ++
 1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 911993863799..1c2931f505af 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -149,43 +149,37 @@ void __bpf_prog_free(struct bpf_prog *fp)
 int bpf_prog_calc_digest(struct bpf_prog *fp)
 {
struct sha256_state sha;
-   u32 i, psize;
-   struct bpf_insn *dst;
+   u32 i;
bool was_ld_map;
-   u8 *raw;
-
-   psize = bpf_prog_insn_size(fp);
-   raw = vmalloc(psize);
-   if (!raw)
-   return -ENOMEM;
 
sha256_init(&sha);
 
/* We need to take out the map fd for the digest calculation
 * since they are unstable from user space side.
 */
-   dst = (void *)raw;
for (i = 0, was_ld_map = false; i < fp->len; i++) {
-   dst[i] = fp->insnsi[i];
+   struct bpf_insn insn = fp->insnsi[i];
+
if (!was_ld_map &&
-   dst[i].code == (BPF_LD | BPF_IMM | BPF_DW) &&
-   dst[i].src_reg == BPF_PSEUDO_MAP_FD) {
+   insn.code == (BPF_LD | BPF_IMM | BPF_DW) &&
+   insn.src_reg == BPF_PSEUDO_MAP_FD) {
was_ld_map = true;
-   dst[i].imm = 0;
+   insn.imm = 0;
} else if (was_ld_map &&
-  dst[i].code == 0 &&
-  dst[i].dst_reg == 0 &&
-  dst[i].src_reg == 0 &&
-  dst[i].off == 0) {
+  insn.code == 0 &&
+  insn.dst_reg == 0 &&
+  insn.src_reg == 0 &&
+  insn.off == 0) {
was_ld_map = false;
-   dst[i].imm = 0;
+   insn.imm = 0;
} else {
was_ld_map = false;
}
+
+   sha256_update(&sha, (const u8 *)&insn, sizeof(insn));
}
 
-   sha256_finup(&sha, raw, psize, fp->digest);
-   vfree(raw);
+   sha256_final(&sha, fp->digest);
return 0;
 }
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 4.10 5/6] bpf: Rename fdinfo's prog_digest to prog_sha256

2016-12-23 Thread Andy Lutomirski
This makes it easier to add another digest algorithm down the road
if needed.  It also serves to force any programs that might have
been written against a kernel that had 'prog_digest' to be updated.

This shouldn't violate any stable API policies, as no released
kernel has ever had 'prog_digest'.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Signed-off-by: Andy Lutomirski 
---
 kernel/bpf/syscall.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e89acea22ecf..956370b80296 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -694,7 +694,7 @@ static void bpf_prog_show_fdinfo(struct seq_file *m, struct 
file *filp)
seq_printf(m,
   "prog_type:\t%u\n"
   "prog_jited:\t%u\n"
-  "prog_digest:\t%s\n"
+  "prog_sha256:\t%s\n"
   "memlock:\t%llu\n",
   prog->type,
   prog->jited,
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 4.10 6/6] net: Rename TCA*BPF_DIGEST to ..._SHA256

2016-12-23 Thread Andy Lutomirski
This makes it easier to add another digest algorithm down the road if
needed.  It also serves to force any programs that might have been
written against a kernel that had the old field name to notice the
change and make any necessary changes.

This shouldn't violate any stable API policies, as no released kernel
has ever had TCA*BPF_DIGEST.

Cc: Daniel Borkmann 
Cc: Alexei Starovoitov 
Signed-off-by: Andy Lutomirski 
---
 include/uapi/linux/pkt_cls.h   | 2 +-
 include/uapi/linux/tc_act/tc_bpf.h | 2 +-
 net/sched/act_bpf.c| 2 +-
 net/sched/cls_bpf.c| 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index cb4bcdc58543..ac6b300c1550 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -397,7 +397,7 @@ enum {
TCA_BPF_NAME,
TCA_BPF_FLAGS,
TCA_BPF_FLAGS_GEN,
-   TCA_BPF_DIGEST,
+   TCA_BPF_SHA256,
__TCA_BPF_MAX,
 };
 
diff --git a/include/uapi/linux/tc_act/tc_bpf.h 
b/include/uapi/linux/tc_act/tc_bpf.h
index a6b88a6f7f71..eae18a7430eb 100644
--- a/include/uapi/linux/tc_act/tc_bpf.h
+++ b/include/uapi/linux/tc_act/tc_bpf.h
@@ -27,7 +27,7 @@ enum {
TCA_ACT_BPF_FD,
TCA_ACT_BPF_NAME,
TCA_ACT_BPF_PAD,
-   TCA_ACT_BPF_DIGEST,
+   TCA_ACT_BPF_SHA256,
__TCA_ACT_BPF_MAX,
 };
 #define TCA_ACT_BPF_MAX (__TCA_ACT_BPF_MAX - 1)
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index 1c60317f0121..3868a66d0b24 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -123,7 +123,7 @@ static int tcf_bpf_dump_ebpf_info(const struct tcf_bpf 
*prog,
nla_put_string(skb, TCA_ACT_BPF_NAME, prog->bpf_name))
return -EMSGSIZE;
 
-   nla = nla_reserve(skb, TCA_ACT_BPF_DIGEST,
+   nla = nla_reserve(skb, TCA_ACT_BPF_SHA256,
  sizeof(prog->filter->digest));
if (nla == NULL)
return -EMSGSIZE;
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index adc776048d1a..6fc110321621 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -555,7 +555,7 @@ static int cls_bpf_dump_ebpf_info(const struct cls_bpf_prog 
*prog,
nla_put_string(skb, TCA_BPF_NAME, prog->bpf_name))
return -EMSGSIZE;
 
-   nla = nla_reserve(skb, TCA_BPF_DIGEST, sizeof(prog->filter->digest));
+   nla = nla_reserve(skb, TCA_BPF_SHA256, sizeof(prog->filter->digest));
if (nla == NULL)
return -EMSGSIZE;
 
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 4.10 0/6] Switch BPF's digest to SHA256

2016-12-23 Thread Andy Lutomirski
Since there are plenty of uses for the new-in-4.10 BPF digest feature
that would be problematic if malicious users could produce collisions,
the BPF digest should be collision-resistant.  SHA-1 is no longer
considered collision-resistant, so switch it to SHA-256.

The actual switchover is trivial.  Most of this series consists of
cleanups to the SHA256 code to make it usable as a standalone library
(since BPF should not depend on crypto).

The cleaned up library is much more user-friendly than the SHA-1 code,
so this also significantly tidies up the BPF digest code.

This is intended for 4.10.  If this series misses 4.10 and nothing
takes its place, then we'll have an unpleasant ABI stability
situation.

Andy Lutomirski (6):
  crypto/sha256: Refactor the API so it can be used without shash
  crypto/sha256: Make the sha256 library functions selectable
  bpf: Use SHA256 instead of SHA1 for bpf digests
  bpf: Avoid copying the entire BPF program when hashing it
  bpf: Rename fdinfo's prog_digest to prog_sha256
  net: Rename TCA*BPF_DIGEST to ..._SHA256

 arch/arm/crypto/sha2-ce-glue.c  | 10 +++---
 arch/arm/crypto/sha256_glue.c   | 23 +-
 arch/arm/crypto/sha256_neon_glue.c  | 34 ++--
 arch/arm64/crypto/sha2-ce-glue.c| 13 
 arch/arm64/crypto/sha256-glue.c | 59 +++---
 arch/x86/crypto/sha256_ssse3_glue.c | 46 ---
 arch/x86/purgatory/purgatory.c  |  2 +-
 arch/x86/purgatory/sha256.c | 25 ++-
 arch/x86/purgatory/sha256.h | 22 -
 crypto/Kconfig  |  8 +
 crypto/Makefile |  2 +-
 crypto/sha256_generic.c | 54 +++
 include/crypto/sha.h| 33 ---
 include/crypto/sha256_base.h| 40 +++
 include/linux/filter.h  | 11 ++-
 include/uapi/linux/pkt_cls.h|  2 +-
 include/uapi/linux/tc_act/tc_bpf.h  |  2 +-
 init/Kconfig|  1 +
 kernel/bpf/core.c   | 63 +
 kernel/bpf/syscall.c|  2 +-
 net/sched/act_bpf.c |  2 +-
 net/sched/cls_bpf.c |  2 +-
 22 files changed, 225 insertions(+), 231 deletions(-)
 delete mode 100644 arch/x86/purgatory/sha256.h

-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)

2016-12-23 Thread Andy Lutomirski
On Fri, Dec 23, 2016 at 8:23 AM, Andy Lutomirski  wrote:
> On Fri, Dec 23, 2016 at 3:59 AM, Daniel Borkmann  wrote:
>> On 12/23/2016 11:59 AM, Hannes Frederic Sowa wrote:
>>>
>>> On Fri, 2016-12-23 at 11:04 +0100, Daniel Borkmann wrote:
>>>>
>>>> On 12/22/2016 05:59 PM, Hannes Frederic Sowa wrote:
>>>>>
>>>>> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>>
>> [...]
>>
>>>>> The hashing is not a proper sha1 neither, unfortunately. I think that
>>>>> is why it will have a custom implementation in iproute2?
>>>>
>>>>
>>>> Still trying to catch up on this admittedly bit confusing thread. I
>>>> did run automated tests over couple of days comparing the data I got
>>>> from fdinfo with the one from af_alg and found no mismatch on the test
>>>> cases varying from min to max possible program sizes. In the process
>>>> of testing, as you might have seen on netdev, I found couple of other
>>>> bugs in bpf code along the way and fixed them up as well. So my question,
>>>> do you or Andy or anyone participating in claiming this have any
>>>> concrete data or test cases that suggests something different? If yes,
>>>> I'm very curious to hear about it and willing fix it up, of course.
>>>> When I'm back from pto I'll prep and cook up my test suite to be
>>>> included into the selftests/bpf/, should have done this initially,
>>>> sorry about that. I'll also post something to expose the alg, that
>>>> sounds fine to me.
>>>
>>>
>>> Looking into your code closer, I noticed that you indeed seem to do the
>>> finalization of sha-1 by hand by aligning and padding the buffer
>>> accordingly and also patching in the necessary payload length.
>>>
>>> Apologies for my side for claiming that this is not correct sha1
>>> output, I was only looking at sha_transform and its implementation and
>>> couldn't see the padding and finalization round with embedding the data
>>> length in there and hadn't thought of it being done manually.
>>>
>>> Anyway, is it difficult to get the sha finalization into some common
>>> code library? It is not very bpf specific and crypto code reviewers
>>> won't find it there at all.
>>
>>
>> Yes, sure, I'll rework it that way (early next year when I'm back if
>> that's fine with you).
>
> Can we make it SHA-256 before 4.10 comes out, though?  This really
> looks like it will be used in situations where collisions matter and
> it will be exposed to malicious programs, and SHA-1 should not be used
> for new designs for this purpose because it simply isn't long enough.
>
> Also, a SHA-1 digest isn't a pile of u32s, so u32 digest[...] is very
> misleading.  That should be u8 or, at the very least, __be32.
>
> I realize that there isn't a sha-256 implementation in lib, but would
> it really be so bad to make the bpf digest only work (for now) when
> crypto is enabled?  I would *love* to see the crypto core learn how to
> export simple primitives for direct use without needing the whole
> crypto core, and this doesn't seem particularly hard to do, but I
> don't think that's 4.10 material.

I'm going to try to send out RFC patches for all of this today or
tomorrow.  It doesn't look bad at all.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)

2016-12-23 Thread Andy Lutomirski
On Fri, Dec 23, 2016 at 3:59 AM, Daniel Borkmann  wrote:
> On 12/23/2016 11:59 AM, Hannes Frederic Sowa wrote:
>>
>> On Fri, 2016-12-23 at 11:04 +0100, Daniel Borkmann wrote:
>>>
>>> On 12/22/2016 05:59 PM, Hannes Frederic Sowa wrote:
>>>>
>>>> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>
> [...]
>
>>>> The hashing is not a proper sha1 neither, unfortunately. I think that
>>>> is why it will have a custom implementation in iproute2?
>>>
>>>
>>> Still trying to catch up on this admittedly bit confusing thread. I
>>> did run automated tests over couple of days comparing the data I got
>>> from fdinfo with the one from af_alg and found no mismatch on the test
>>> cases varying from min to max possible program sizes. In the process
>>> of testing, as you might have seen on netdev, I found couple of other
>>> bugs in bpf code along the way and fixed them up as well. So my question,
>>> do you or Andy or anyone participating in claiming this have any
>>> concrete data or test cases that suggests something different? If yes,
>>> I'm very curious to hear about it and willing fix it up, of course.
>>> When I'm back from pto I'll prep and cook up my test suite to be
>>> included into the selftests/bpf/, should have done this initially,
>>> sorry about that. I'll also post something to expose the alg, that
>>> sounds fine to me.
>>
>>
>> Looking into your code closer, I noticed that you indeed seem to do the
>> finalization of sha-1 by hand by aligning and padding the buffer
>> accordingly and also patching in the necessary payload length.
>>
>> Apologies for my side for claiming that this is not correct sha1
>> output, I was only looking at sha_transform and its implementation and
>> couldn't see the padding and finalization round with embedding the data
>> length in there and hadn't thought of it being done manually.
>>
>> Anyway, is it difficult to get the sha finalization into some common
>> code library? It is not very bpf specific and crypto code reviewers
>> won't find it there at all.
>
>
> Yes, sure, I'll rework it that way (early next year when I'm back if
> that's fine with you).

Can we make it SHA-256 before 4.10 comes out, though?  This really
looks like it will be used in situations where collisions matter and
it will be exposed to malicious programs, and SHA-1 should not be used
for new designs for this purpose because it simply isn't long enough.

Also, a SHA-1 digest isn't a pile of u32s, so u32 digest[...] is very
misleading.  That should be u8 or, at the very least, __be32.

I realize that there isn't a sha-256 implementation in lib, but would
it really be so bad to make the bpf digest only work (for now) when
crypto is enabled?  I would *love* to see the crypto core learn how to
export simple primitives for direct use without needing the whole
crypto core, and this doesn't seem particularly hard to do, but I
don't think that's 4.10 material.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)

2016-12-22 Thread Andy Lutomirski
On Thu, Dec 22, 2016 at 11:34 AM, Alexei Starovoitov
 wrote:
> On Thu, Dec 22, 2016 at 9:25 AM, Andy Lutomirski  wrote:
>> On Thu, Dec 22, 2016 at 8:59 AM, Hannes Frederic Sowa
>>  wrote:
>>> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>>>
>>> We don't prevent ebpf programs being loaded based on the digest but
>>> just to uniquely identify loaded programs from user space and match up
>>> with their source.
>>
>> The commit log talks about using the hash to see if the program has
>> already been compiled and JITted.  If that's done, then a collision
>> will directly cause the kernel to malfunction.
>
> Andy, please read the code.
> we could have used jhash there just as well.
> Collisions are fine.

There's relevant in the code to read yet AFAICS.  The code exports it
via fdinfo, and userspace is expected to do something with it.  The
commit message says:

When programs are pinned and retrieved by an ELF loader, the loader
can check the program's digest through fdinfo and compare it against
one that was generated over the ELF file's program section to see
if the program needs to be reloaded.

I assume this means that a userspace component is expected to compare
the digest of a loaded program to a digest of a program it wants to
load and to use the result of the comparison to decide whether the
programs are the same.  If that's indeed the case (and it sure sounds
like it, and I fully expect CRIU to do very similar things when
support is added), then malicious collisions do matter.

It's also not quite clear to me why userspace needs to be able to
calculate the digest on its own.  A bpf(BPF_CALC_PROGRAM_DIGEST)
command that takes a BPF program as input and hashes it would seem to
serve the same purpose, and that would allow the kernel to key the
digest and change the algorithm down the road without breaking things.

Regardless, adding a new hash algorithm that is almost-but-not-quite
SHA-1 and making it a stable interface to userspace is not a good
thing.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread Andy Lutomirski
On Thu, Dec 22, 2016 at 11:24 AM, George Spelvin
 wrote:
>> Having slept on this, I like it less.  The problem is that a
>> backtracking attacker doesn't just learn H(random seed || entropy_0 ||
>> secret || ...) -- they learn the internal state of the hash function
>> that generates that value.  This probably breaks any attempt to apply
>> security properties of the hash function.  For example, the internal
>> state could easily contain a whole bunch of prior outputs it in
>> verbatim.
>
> The problem is, anti-backtracing is in severe tension with your desire
> to use unmodified SipHash.
>
> First of all, I'd like to repeat that it isn't a design goal of the current
> generator and isn't necessary.

Agreed.

> Now, the main point:  it's not likely to be solvable.
>
> The problem with unmodified SipHash is that is has only 64 bits of
> output.  No mix-back mechanism can get around the fundamental problem
> that that's too small to prevent a brute-force guessing attack.  You need
> wider mix-back.  And getting more output from unmodified SipHash requires
> more finalization rounds, which is expensive.

It could only mix the output back in every two calls, in which case
you can backtrack up to one call but you need to do 2^128 work to
backtrack farther.  But yes, this is getting excessively complicated.

> Finally, your discomfort about an attacker learning the internal state...
> if an attacker knows the key and the input, they can construct the
> internal state.  Yes, we could discard the internal state and construct
> a fresh one on the next call to get_random_int, but what are you going
> to key it with?  What are you going to feed it?  What keeps *that*
> internal state any more secret from an attacker than one that's explicitly
> stored?

I do tend to like Ted's version in which we use batched
get_random_bytes() output.  If it's fast enough, it's simpler and lets
us get the full strength of a CSPRNG.

(Aside: some day I want to move all that code from drivers/ to lib/
and teach it to be buildable in userspace, too, so it's easy to play
with it, feed it test vectors, confirm that it's equivalent to a
reference implementation, write up the actual design and try to get
real cryptographers to analyze it, etc.)

> For example, clearly stating the concern over starting new processes
> with predictable layout, and the limits on the fresh entropy supply,
> has made me realize that there *is* a possible source: each exec()
> is passed 128 bits from get_random_bytes in the AT_RANDOM element of
> its auxv.  Since get_random_int() accesses "current" anyway, we could
> store some seed material there rather than using "pid".  While this is
> not fresh for each call to get_random_int, it *is* fresh for each new
> address space whose layout is being randomized.

Hmm, interesting.  Although, for ASLR, we could use get_random_bytes()
directly and be done with it.  It won't be a bottleneck.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)

2016-12-22 Thread Andy Lutomirski
On Thu, Dec 22, 2016 at 8:59 AM, Hannes Frederic Sowa
 wrote:
> On Thu, 2016-12-22 at 08:07 -0800, Andy Lutomirski wrote:
>> On Thu, Dec 22, 2016 at 7:51 AM, Hannes Frederic Sowa
>>  wrote:
>> > On Thu, 2016-12-22 at 16:41 +0100, Jason A. Donenfeld wrote:
>> > > Hi Hannes,
>> > >
>> > > On Thu, Dec 22, 2016 at 4:33 PM, Hannes Frederic Sowa
>> > >  wrote:
>> > > > IPv6 you cannot touch anymore. The hashing algorithm is part of uAPI.
>> > > > You don't want to give people new IPv6 addresses with the same stable
>> > > > secret (across reboots) after a kernel upgrade. Maybe they lose
>> > > > connectivity then and it is extra work?
>> > >
>> > > Ahh, too bad. So it goes.
>> >
>> > If no other users survive we can put it into the ipv6 module.
>> >
>> > > > The bpf hash stuff can be changed during this merge window, as it is
>> > > > not yet in a released kernel. Albeit I would probably have preferred
>> > > > something like sha256 here, which can be easily replicated by user
>> > > > space tools (minus the problem of patching out references to not
>> > > > hashable data, which must be zeroed).
>> > >
>> > > Oh, interesting, so time is of the essence then. Do you want to handle
>> > > changing the new eBPF code to something not-SHA1 before it's too late,
>> > > as part of a ne
>>
>> w patchset that can fast track itself to David? And
>> > > then I can preserve my large series for the next merge window.
>> >
>> > This algorithm should be a non-seeded algorithm, because the hashes
>> > should be stable and verifiable by user space tooling. Thus this would
>> > need a hashing algorithm that is hardened against pre-image
>> > attacks/collision resistance, which siphash is not. I would prefer some
>> > higher order SHA algorithm for that actually.
>> >
>>
>> You mean:
>>
>> commit 7bd509e311f408f7a5132fcdde2069af65fa05ae
>> Author: Daniel Borkmann 
>> Date:   Sun Dec 4 23:19:41 2016 +0100
>>
>> bpf: add prog_digest and expose it via fdinfo/netlink
>>
>>
>> Yes, please!  This actually matters for security -- imagine a
>> malicious program brute-forcing a collision so that it gets loaded
>> wrong.  And this is IMO a use case for SHA-256 or SHA-512/256
>> (preferably the latter).  Speed basically doesn't matter here and
>> Blake2 is both less stable (didn't they slightly change it recently?)
>> and much less well studied.
>
> We don't prevent ebpf programs being loaded based on the digest but
> just to uniquely identify loaded programs from user space and match up
> with their source.

The commit log talks about using the hash to see if the program has
already been compiled and JITted.  If that's done, then a collision
will directly cause the kernel to malfunction.

>> My inclination would have been to seed them with something that isn't
>> exposed to userspace for the precise reason that it would prevent user
>> code from making assumptions about what's in the hash.  But if there's
>> a use case for why user code needs to be able to calculate the hash on
>> its own, then that's fine.  But perhaps the actual fdinfo string
>> should be "sha256:abcd1234..." to give some flexibility down the road.
>>
>> Also:
>>
>> +   result = (__force __be32 *)fp->digest;
>> +   for (i = 0; i < SHA_DIGEST_WORDS; i++)
>> +   result[i] = cpu_to_be32(fp->digest[i]);
>>
>> Everyone, please, please, please don't open-code crypto primitives.
>> Is this and the code above it even correct?  It might be but on a very
>> brief glance it looks wrong to me.  If you're doing this to avoid
>> depending on crypto, then fix crypto so you can pull in the algorithm
>> without pulling in the whole crypto core.
>
> The hashing is not a proper sha1 neither, unfortunately. I think that
> is why it will have a custom implementation in iproute2?

Putting on crypto hat:

NAK NAK NAK NAK NAK.  "The Linux kernel invented a new primitive in
2016 when people know better and is going to handle it by porting that
new primitive to userspace" is not a particularly good argument.

Okay, crypto hack back off.

>
> I wondered if bpf program loading should have used the module loading
> infrastructure from the beginning...

That would be way too complicated and would be nasty for the unprivileged cases.

>
>> At the very least, there sh

Re: BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)

2016-12-22 Thread Andy Lutomirski
On Thu, Dec 22, 2016 at 8:28 AM, Jason A. Donenfeld  wrote:
> Hi all,
>
> I don't know what your design requirements are for this. It looks like
> you're generating some kind of crypto digest of a program, and you
> need to avoid collisions. If you'd like to go with a PRF (keyed hash
> function) that uses some kernel secret key, then I'd strongly suggest
> using Keyed-Blake2. Alternatively, if you need for userspace to be
> able to calculate the same hash, and don't want to use some kernel
> secret, then I'd still suggest using Blake2, which will be fast and
> secure.
>
> If you can wait until January, I'll work on a commit adding the
> primitive to the tree. I've already written it and I just need to get
> things cleaned up.
>
>> Blake2 is both less stable (didn't they slightly change it recently?)
>
> No, Blake2 is very stable. It's also extremely secure and has been
> extensively studied. Not to mention it's faster than SHA2. And if you
> want to use it as a PRF, it's obvious better suited and faster to use
> Blake2's keyed PRF mode than HMAC-SHA2.
>
> If you don't care about performance, and you don't want to use a PRF,
> then just use SHA2-256. If you're particularly concerned about certain
> types of attacks, you could go with SHA2-512 truncated to 256 bytes,
> but somehow I doubt you need this.

I don't think this cares about performance.  (Well, it cares about
performance, but the verifier will likely dominiate the cost by such a
large margin that the hash algo doesn't matter.)  And staying
FIPS-compliant-ish is worth a little bit, so I'd advocate for
something in the SHA2 family.

> If userspace hasn't landed, can we get away with changing this code
> after 4.10? Or should we just fix it before 4.10? Or should we revert
> it before 4.10? Development-policy-things like this I have zero clue
> about, so I heed to your guidance.

I think it should be fixed or reverted before 4.10.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-22 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 6:07 PM, Andy Lutomirski  wrote:
> On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
>  wrote:
>> As a separate message, to disentangle the threads, I'd like to
>> talk about get_random_long().
>>
>> After some thinking, I still like the "state-preserving" construct
>> that's equivalent to the current MD5 code.  Yes, we could just do
>> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
>> preserve a bit more.
>>
>> It requires library support from the SipHash code to return the full
>> SipHash state, but I hope that's a fair thing to ask for.
>
> I don't even think it needs that.  This is just adding a
> non-destructive final operation, right?
>
>>
>> Here's my current straw man design for comment.  It's very similar to
>> the current MD5-based design, but feeds all the seed material in the
>> "correct" way, as opposed to Xring directly into the MD5 state.
>>
>> * Each CPU has a (Half)SipHash state vector,
>>   "unsigned long get_random_int_hash[4]".  Unlike the current
>>   MD5 code, we take care to initialize it to an asymmetric state.
>>
>> * There's a global 256-bit random_int_secret (which we could
>>   reseed periodically).
>>
>> To generate a random number:
>> * If get_random_int_hash is all-zero, seed it with fresh a half-sized
>>   SipHash key and the appropriate XOR constants.
>> * Generate three words of random_get_entropy(), jiffies, and current->pid.
>>   (This is arbitary seed material, copied from the current code.)
>> * Crank through that with (Half)SipHash-1-0.
>> * Crank through the random_int_secret with (Half)SipHash-1-0.
>> * Return v1 ^ v3.
>
> Just to clarify, if we replace SipHash with a black box, I think this
> effectively means, where "entropy" is random_get_entropy() || jiffies
> || current->pid:
>
> The first call returns H(random seed || entropy_0 || secret).  The
> second call returns H(random seed || entropy_0 || secret || entropy_1
> || secret).  Etc.

Having slept on this, I like it less.  The problem is that a
backtracking attacker doesn't just learn H(random seed || entropy_0 ||
secret || ...) -- they learn the internal state of the hash function
that generates that value.  This probably breaks any attempt to apply
security properties of the hash function.  For example, the internal
state could easily contain a whole bunch of prior outputs it in
verbatim.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BPF hash algo (Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5)

2016-12-22 Thread Andy Lutomirski
On Thu, Dec 22, 2016 at 7:51 AM, Hannes Frederic Sowa
 wrote:
> On Thu, 2016-12-22 at 16:41 +0100, Jason A. Donenfeld wrote:
>> Hi Hannes,
>>
>> On Thu, Dec 22, 2016 at 4:33 PM, Hannes Frederic Sowa
>>  wrote:
>> > IPv6 you cannot touch anymore. The hashing algorithm is part of uAPI.
>> > You don't want to give people new IPv6 addresses with the same stable
>> > secret (across reboots) after a kernel upgrade. Maybe they lose
>> > connectivity then and it is extra work?
>>
>> Ahh, too bad. So it goes.
>
> If no other users survive we can put it into the ipv6 module.
>
>> > The bpf hash stuff can be changed during this merge window, as it is
>> > not yet in a released kernel. Albeit I would probably have preferred
>> > something like sha256 here, which can be easily replicated by user
>> > space tools (minus the problem of patching out references to not
>> > hashable data, which must be zeroed).
>>
>> Oh, interesting, so time is of the essence then. Do you want to handle
>> changing the new eBPF code to something not-SHA1 before it's too late,
>> as part of a ne
w patchset that can fast track itself to David? And
>> then I can preserve my large series for the next merge window.
>
> This algorithm should be a non-seeded algorithm, because the hashes
> should be stable and verifiable by user space tooling. Thus this would
> need a hashing algorithm that is hardened against pre-image
> attacks/collision resistance, which siphash is not. I would prefer some
> higher order SHA algorithm for that actually.
>

You mean:

commit 7bd509e311f408f7a5132fcdde2069af65fa05ae
Author: Daniel Borkmann 
Date:   Sun Dec 4 23:19:41 2016 +0100

bpf: add prog_digest and expose it via fdinfo/netlink


Yes, please!  This actually matters for security -- imagine a
malicious program brute-forcing a collision so that it gets loaded
wrong.  And this is IMO a use case for SHA-256 or SHA-512/256
(preferably the latter).  Speed basically doesn't matter here and
Blake2 is both less stable (didn't they slightly change it recently?)
and much less well studied.

My inclination would have been to seed them with something that isn't
exposed to userspace for the precise reason that it would prevent user
code from making assumptions about what's in the hash.  But if there's
a use case for why user code needs to be able to calculate the hash on
its own, then that's fine.  But perhaps the actual fdinfo string
should be "sha256:abcd1234..." to give some flexibility down the road.

Also:

+   result = (__force __be32 *)fp->digest;
+   for (i = 0; i < SHA_DIGEST_WORDS; i++)
+   result[i] = cpu_to_be32(fp->digest[i]);

Everyone, please, please, please don't open-code crypto primitives.
Is this and the code above it even correct?  It might be but on a very
brief glance it looks wrong to me.  If you're doing this to avoid
depending on crypto, then fix crypto so you can pull in the algorithm
without pulling in the whole crypto core.

At the very least, there should be a separate function that calculates
the hash of a buffer and that function should explicitly run itself
against test vectors of various lengths to make sure that it's
calculating what it claims to be calculating.  And it doesn't look
like the userspace code has landed, so, if this thing isn't
calculating SHA1 correctly, it's plausible that no one has noticed.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 9:01 PM, George Spelvin
 wrote:
> Andy Lutomirski wrote:
>> I don't even think it needs that.  This is just adding a
>> non-destructive final operation, right?
>
> It is, but the problem is that SipHash is intended for *small* inputs,
> so the standard implementations aren't broken into init/update/final
> functions.
>
> There's just one big function that keeps the state variables in
> registers and never stores them anywhere.
>
> If we *had* init/update/final functions, then it would be trivial.
>
>> Just to clarify, if we replace SipHash with a black box, I think this
>> effectively means, where "entropy" is random_get_entropy() || jiffies
>> || current->pid:
>
>> The first call returns H(random seed || entropy_0 || secret).  The
>> second call returns H(random seed || entropy_0 || secret || entropy_1
>> || secret).  Etc.
>
> Basically, yes.  I was skipping the padding byte and keying the
> finalization rounds on the grounds of "can't hurt and might help",
> but we could do it a more standard way.
>
>> If not, then I have a fairly strong preference to keep whatever
>> construction we come up with consistent with something that could
>> actually happen with invocations of unmodified SipHash -- then all the
>> security analysis on SipHash goes through.
>
> Okay.  I don't think it makes a difference, but it's not a *big* waste
> of time.  If we have finalization rounds, we can reduce the secret
> to 128 bits.
>
> If we include the padding byte, we can do one of two things:
> 1) Make the secret 184 bits, to fill up the final partial word as
>much as possible, or
> 2) Make the entropy 1 byte smaller and conceptually misalign the
>secret.  What we'd actually do is remove the last byte of
>the secret and include it in the entropy words, but that's
>just a rotation of the secret between storage and hashing.
>
> Also, I assume you'd like SipHash-2-4, since you want to rely
> on a security analysis.

I haven't looked, but I assume that the analysis at least thought
about reduced rounds, so maybe other variants are okay.

>> The one thing I don't like is
>> that I don't see how to prove that you can't run it backwards if you
>> manage to acquire a memory dump.  In fact, I that that there exist, at
>> least in theory, hash functions that are secure in the random oracle
>> model but that *can* be run backwards given the full state.  From
>> memory, SHA-3 has exactly that property, and it would be a bit sad for
>> a CSPRNG to be reversible.
>
> Er...  get_random_int() is specifically *not* designed to be resistant
> to state capture, and I didn't try.  Remember, what it's used for
> is ASLR, what we're worried about is somene learning the layouts
> of still-running processes, and and if you get a memory dump, you have
> the memory layout!

True, but it's called get_random_int(), and it seems like making it
stronger, especially if the performance cost is low to zero, is a good
thing.

>
> If you want anti-backtracking, though, it's easy to add.  What we
> hash is:
>
> entropy_0 || secret || output_0 || entropy_1 || secret || output_1 || ...
>
> You mix the output word right back in to the (unfinalized) state after
> generating it.  This is still equivalent to unmodified back-box SipHash,
> you're just using a (conceptually independent) SipHash invocation to
> produce some of its input.

Ah, cute.  This could probably be sped up by doing something like:

entropy_0 || secret || output_0 ^ entropy_1 || secret || ...

It's a little weak because the output is only 64 bits, so you could
plausibly backtrack it on a GPU or FPGA cluster or on an ASIC if the
old entropy is guessable.  I suspect there are sneaky ways around it
like using output_n-1 ^ output_n-2 or similar.  I'll sleep on it.

>
> The only remaining issues are:
> 1) How many rounds, and
> 2) May we use HalfSipHash?

I haven't looked closely enough to have a real opinion here.  I don't
know what the security margin is believed to be.

>
> I'd *like* to persuade you that skipping the padding byte wouldn't
> invalidate any security proofs, because it's true and would simplify
> the code.  But if you want 100% stock, I'm willing to cater to that.

I lean toward stock in the absence of a particularly good reason.  At
the very least I'd want to read that paper carefully.

>
> Ted, what do you think?



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 6:07 PM, Hannes Frederic Sowa
 wrote:
> On 22.12.2016 00:42, Andy Lutomirski wrote:
>> On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld  wrote:
>>>  unsigned int get_random_int(void)
>>>  {
>>> -   __u32 *hash;
>>> -   unsigned int ret;
>>> -
>>> -   if (arch_get_random_int(&ret))
>>> -   return ret;
>>> -
>>> -   hash = get_cpu_var(get_random_int_hash);
>>> -
>>> -   hash[0] += current->pid + jiffies + random_get_entropy();
>>> -   md5_transform(hash, random_int_secret);
>>> -   ret = hash[0];
>>> -   put_cpu_var(get_random_int_hash);
>>> -
>>> -   return ret;
>>> +   unsigned int arch_result;
>>> +   u64 result;
>>> +   struct random_int_secret *secret;
>>> +
>>> +   if (arch_get_random_int(&arch_result))
>>> +   return arch_result;
>>> +
>>> +   secret = get_random_int_secret();
>>> +   result = siphash_3u64(secret->chaining, jiffies,
>>> + (u64)random_get_entropy() + current->pid,
>>> + secret->secret);
>>> +   secret->chaining += result;
>>> +   put_cpu_var(secret);
>>> +   return result;
>>>  }
>>>  EXPORT_SYMBOL(get_random_int);
>>
>> Hmm.  I haven't tried to prove anything for real.  But here goes (in
>> the random oracle model):
>>
>> Suppose I'm an attacker and I don't know the secret or the chaining
>> value.  Then, regardless of what the entropy is, I can't predict the
>> numbers.
>>
>> Now suppose I do know the secret and the chaining value due to some
>> leak.  If I want to deduce prior outputs, I think I'm stuck: I'd need
>> to find a value "result" such that prev_chaining + result = chaining
>> and result = H(prev_chaining, ..., secret);.  I don't think this can
>> be done efficiently in the random oracle model regardless of what the
>> "..." is.
>>
>> But, if I know the secret and chaining value, I can predict the next
>> output assuming I can guess the entropy.  What's worse is that, even
>> if I can't guess the entropy, if I *observe* the next output then I
>> can calculate the next chaining value.
>>
>> So this is probably good enough, and making it better is hard.  Changing it 
>> to:
>>
>> u64 entropy = (u64)random_get_entropy() + current->pid;
>> result = siphash(..., entropy, ...);
>> secret->chaining += result + entropy;
>>
>> would reduce this problem by forcing an attacker to brute-force the
>> entropy on each iteration, which is probably an improvement.
>>
>> To fully fix it, something like "catastrophic reseeding" would be
>> needed, but that's hard to get right.
>
> I wonder if Ted's proposal was analyzed further in terms of performance
> if get_random_int should provide cprng alike properties?
>
> For reference: https://lkml.org/lkml/2016/12/14/351
>
> The proposal made sense to me and would completely solve the above
> mentioned problem on the cost of repeatedly reseeding from the crng.
>

Unless I've misunderstood it, Ted's proposal causes get_random_int()
to return bytes straight from urandom (effectively), which should make
it very strong.  And if urandom is competitively fast now, I don't see
the problem.  ChaCha20 is designed for speed, after all.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


George's crazy full state idea (Re: HalfSipHash Acceptable Usage)

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 5:13 PM, George Spelvin
 wrote:
> As a separate message, to disentangle the threads, I'd like to
> talk about get_random_long().
>
> After some thinking, I still like the "state-preserving" construct
> that's equivalent to the current MD5 code.  Yes, we could just do
> siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
> preserve a bit more.
>
> It requires library support from the SipHash code to return the full
> SipHash state, but I hope that's a fair thing to ask for.

I don't even think it needs that.  This is just adding a
non-destructive final operation, right?

>
> Here's my current straw man design for comment.  It's very similar to
> the current MD5-based design, but feeds all the seed material in the
> "correct" way, as opposed to Xring directly into the MD5 state.
>
> * Each CPU has a (Half)SipHash state vector,
>   "unsigned long get_random_int_hash[4]".  Unlike the current
>   MD5 code, we take care to initialize it to an asymmetric state.
>
> * There's a global 256-bit random_int_secret (which we could
>   reseed periodically).
>
> To generate a random number:
> * If get_random_int_hash is all-zero, seed it with fresh a half-sized
>   SipHash key and the appropriate XOR constants.
> * Generate three words of random_get_entropy(), jiffies, and current->pid.
>   (This is arbitary seed material, copied from the current code.)
> * Crank through that with (Half)SipHash-1-0.
> * Crank through the random_int_secret with (Half)SipHash-1-0.
> * Return v1 ^ v3.

Just to clarify, if we replace SipHash with a black box, I think this
effectively means, where "entropy" is random_get_entropy() || jiffies
|| current->pid:

The first call returns H(random seed || entropy_0 || secret).  The
second call returns H(random seed || entropy_0 || secret || entropy_1
|| secret).  Etc.

If not, then I have a fairly strong preference to keep whatever
construction we come up with consistent with something that could
actually happen with invocations of unmodified SipHash -- then all the
security analysis on SipHash goes through.

Anyway, I have mixed thoughts about the construction.  It manages to
have a wide state at essentially no cost, which buys us quite a bit of
work factor to break it.  Even with full knowledge of the state, an
output doesn't reveal the entropy except to the extent that it can be
brute-force (this is just whatever the appropriate extended version of
first preimage resistance gives us).  The one thing I don't like is
that I don't see how to prove that you can't run it backwards if you
manage to acquire a memory dump.  In fact, I that that there exist, at
least in theory, hash functions that are secure in the random oracle
model but that *can* be run backwards given the full state.  From
memory, SHA-3 has exactly that property, and it would be a bit sad for
a CSPRNG to be reversible.

We could also periodically mix in a big (128-bit?) chunk of fresh
urandom output to keep the bad guys guessing.

(P.S.  This kind of resembles the duplex sponge construction.  If
hardware SHA-3 ever shows up, a duplex sponge RNG might nice indeed.)
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 9:25 AM, Linus Torvalds
 wrote:
> On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin
>  wrote:
>>
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.
>
> And I warn you already: it will _benchmark_ a hell of a lot better
> than it will work in reality. In benchmarks, you'll hit all the
> optimizations ("oh, I've already saved away all the FP registers, no
> need to do it again").
>
> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Hah, you're thinking that the x86 code works the way that Rik and I
want it to work, and you just made my day. :)  What actually happens
is that the state is saved in kernel_fpu_begin() and restored in
kernel_fpu_end(), and it'll take a few hundred cycles best case.  If
you do it a bunch of times in a loop, you *might* trigger a CPU
optimization that notices that the state being saved is the same state
that was just restored, but you're still going to pay the full restore
code each round trip no matter what.

The code is much clearer in 4.10 kernels now that I deleted the unused
"lazy" branches.

>
> Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX
> unit and keep it powered up for a while afterwards", while in real
> life you would quite easily hit the "oh, AVX is powered down because
> we were idle, now it powers up at half speed which is another latency
> hit _and_ the AVX unit won't run full out anyway".

I *think* that was mostly fixed in Broadwell or thereabouts (in terms
of latency -- throughput and power consumption still suffers).
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Andy Lutomirski
On Wed, Dec 21, 2016 at 3:02 PM, Jason A. Donenfeld  wrote:
>  unsigned int get_random_int(void)
>  {
> -   __u32 *hash;
> -   unsigned int ret;
> -
> -   if (arch_get_random_int(&ret))
> -   return ret;
> -
> -   hash = get_cpu_var(get_random_int_hash);
> -
> -   hash[0] += current->pid + jiffies + random_get_entropy();
> -   md5_transform(hash, random_int_secret);
> -   ret = hash[0];
> -   put_cpu_var(get_random_int_hash);
> -
> -   return ret;
> +   unsigned int arch_result;
> +   u64 result;
> +   struct random_int_secret *secret;
> +
> +   if (arch_get_random_int(&arch_result))
> +   return arch_result;
> +
> +   secret = get_random_int_secret();
> +   result = siphash_3u64(secret->chaining, jiffies,
> + (u64)random_get_entropy() + current->pid,
> + secret->secret);
> +   secret->chaining += result;
> +   put_cpu_var(secret);
> +   return result;
>  }
>  EXPORT_SYMBOL(get_random_int);

Hmm.  I haven't tried to prove anything for real.  But here goes (in
the random oracle model):

Suppose I'm an attacker and I don't know the secret or the chaining
value.  Then, regardless of what the entropy is, I can't predict the
numbers.

Now suppose I do know the secret and the chaining value due to some
leak.  If I want to deduce prior outputs, I think I'm stuck: I'd need
to find a value "result" such that prev_chaining + result = chaining
and result = H(prev_chaining, ..., secret);.  I don't think this can
be done efficiently in the random oracle model regardless of what the
"..." is.

But, if I know the secret and chaining value, I can predict the next
output assuming I can guess the entropy.  What's worse is that, even
if I can't guess the entropy, if I *observe* the next output then I
can calculate the next chaining value.

So this is probably good enough, and making it better is hard.  Changing it to:

u64 entropy = (u64)random_get_entropy() + current->pid;
result = siphash(..., entropy, ...);
secret->chaining += result + entropy;

would reduce this problem by forcing an attacker to brute-force the
entropy on each iteration, which is probably an improvement.

To fully fix it, something like "catastrophic reseeding" would be
needed, but that's hard to get right.

(An aside: on x86 at least, using two percpu variables is faster
because directly percpu access is essentially free, whereas getting
the address of a percpu variable is not free.)
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >