Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls

2017-12-02 Thread Peter Zijlstra
On Sat, Dec 02, 2017 at 11:15:14AM +, Ard Biesheuvel wrote:
> On 2 December 2017 at 09:11, Ard Biesheuvel  wrote:

> > They consume the entire input in a single go, yes. But making it more
> > granular than that is going to hurt performance, unless we introduce
> > some kind of kernel_neon_yield(), which does a end+begin but only if
> > the task is being scheduled out.
> >
> > For example, the SHA256 keeps 256 bytes of round constants in NEON
> > registers, and reloading those from memory for each 64 byte block of
> > input is going to be noticeable. The same applies to the AES code
> > (although the numbers are slightly different)
> 
> Something like below should do the trick I think (apologies for the
> patch soup). I.e., check TIF_NEED_RESCHED at a point where only very
> few NEON registers are live, and preserve/restore the live registers
> across calls to kernel_neon_end + kernel_neon_begin. Would that work
> for RT?

Probably yes. The important point is that preempt latencies (and thus by
extension NEON regions) are bounded and preferably small.

Unbounded stuff (like depends on the amount of data fed) are a complete
no-no for RT since then you cannot make predictions on how long things
will take.


Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls

2017-12-02 Thread Peter Zijlstra
On Sat, Dec 02, 2017 at 09:11:46AM +, Ard Biesheuvel wrote:
> On 2 December 2017 at 09:01, Peter Zijlstra <pet...@infradead.org> wrote:
> > On Fri, Dec 01, 2017 at 09:19:22PM +, Ard Biesheuvel wrote:
> >> Note that the remaining crypto drivers simply operate on fixed buffers, so
> >> while the RT crowd may still feel the need to disable those (and the ones
> >> below as well, perhaps), they don't call back into the crypto layer like
> >> the ones updated by this series, and so there's no room for improvement
> >> there AFAICT.
> >
> > Do these other drivers process all the blocks fed to them in one go
> > under a single NEON section, or do they do a single fixed block per
> > NEON invocation?
> 
> They consume the entire input in a single go, yes. But making it more
> granular than that is going to hurt performance, unless we introduce
> some kind of kernel_neon_yield(), which does a end+begin but only if
> the task is being scheduled out.

A little something like this:

https://lkml.kernel.org/r/20171201113235.6tmkwtov5cg2l...@hirez.programming.kicks-ass.net

> For example, the SHA256 keeps 256 bytes of round constants in NEON
> registers, and reloading those from memory for each 64 byte block of
> input is going to be noticeable. The same applies to the AES code
> (although the numbers are slightly different)

Quite. We could augment the above function with a return value that says
if we actually did a end/begin and registers were clobbered.


Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls

2017-12-02 Thread Peter Zijlstra
On Fri, Dec 01, 2017 at 09:19:22PM +, Ard Biesheuvel wrote:
> Note that the remaining crypto drivers simply operate on fixed buffers, so
> while the RT crowd may still feel the need to disable those (and the ones
> below as well, perhaps), they don't call back into the crypto layer like
> the ones updated by this series, and so there's no room for improvement
> there AFAICT.

Do these other drivers process all the blocks fed to them in one go
under a single NEON section, or do they do a single fixed block per
NEON invocation?


Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array

2017-03-24 Thread Peter Zijlstra
On Fri, Mar 24, 2017 at 02:50:24PM +0100, Dmitry Vyukov wrote:
> OK, I guess should not have referenced the llvm-linux page.
> So here are reasons on our side that I am ready to vouch:
> 
>  - clang make it possible to implement KMSAN (dynamic detection of
> uses of uninit memory)

How does GCC make this impossible?

>  - better code coverage for fuzzing

How so? Why can't the same be achieved using GCC?

>  - why simpler and faster development (e.g. we can port our user-space
> hardening technologies -- CFI and SafeStack)

That's just because you've already implemented this in clang, right? So
less work for you. Not because its impossible.


Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array

2017-03-24 Thread Peter Zijlstra
On Fri, Mar 24, 2017 at 02:47:15PM +0100, Dmitry Vyukov wrote:
> > Seriously, you should have taken the hack the first time that this
> > needs to be fixed.  Just because this is a fairly uncommon construct
> > in the kernel doesn't mean it is not in userspace.
> 
> There is a reason why it is fairly uncommon in kernel.

So first off; its not entirely clear that the code as it exists it
correct. From a cursory reading of it and surrounding code, there is no
actual upper limit on the variable. If I were stupid enough to make a
raid with 64 devices I'd get a huge on-stack structure.

Since you're touching it; you should check these things.

And secondly, refactor the code to not look like dog vomit. You can do
more than the absolute minimal patch to make it compile, I'm sure.


Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array

2017-03-17 Thread Peter Zijlstra
On Fri, Mar 17, 2017 at 08:05:16PM +0100, Dmitry Vyukov wrote:
> You can also find some reasons in the Why section of LLVM-Linux project:
> http://llvm.linuxfoundation.org/index.php/Main_Page

>From that:

 - LLVM/Clang is a fast moving project with many things fixed quickly
   and features added.

So what's the deal with that 5 year old bug you want us to work around?

Also, clang doesn't support asm cc flags output and a few other
extensions last time I checked.



Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array

2017-03-17 Thread Peter Zijlstra
On Fri, Mar 17, 2017 at 08:26:42PM +0100, Peter Zijlstra wrote:
> On Fri, Mar 17, 2017 at 08:05:16PM +0100, Dmitry Vyukov wrote:
> > You can also find some reasons in the Why section of LLVM-Linux project:
> > http://llvm.linuxfoundation.org/index.php/Main_Page
> 
> From that:
> 
>  - LLVM/Clang is a fast moving project with many things fixed quickly
>and features added.
> 
> So what's the deal with that 5 year old bug you want us to work around?
> 
> Also, clang doesn't support asm cc flags output and a few other
> extensions last time I checked.
> 

Another great one:

 - BSD License (some people prefer this license to the GPL)

Seems a very weak argument to make when talking about the Linux Kernel
which is very explicitly GPLv2 (and not later).


Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array

2017-03-17 Thread Peter Zijlstra
On Fri, Mar 17, 2017 at 11:52:01AM -0700, Michael Davidson wrote:
> On Fri, Mar 17, 2017 at 5:44 AM, Peter Zijlstra <pet...@infradead.org> wrote:
> >
> > Be that as it may; what you construct above is disgusting. Surely the
> > code can be refactored to not look like dog vomit?
> >
> > Also; its not immediately obvious conf->copies is 'small' and this
> > doesn't blow up the stack; I feel that deserves a comment somewhere.
> >
> 
> I agree that the code is horrible.
> 
> It is, in fact, exactly the same solution that was used to remove
> variable length arrays in structs from several of the crypto drivers a
> few years ago - see the definition of SHASH_DESC_ON_STACK() in
> "crypto/hash.h" - I did not, however, hide the horrors in a macro
> preferring to leave the implementation visible as a warning to whoever
> might touch the code next.
> 
> I believe that the actual stack usage is exactly the same as it was 
> previously.
> 
> I can certainly wrap this  up in a macro and add comments with
> appropriately dire warnings in it if you feel that is both necessary
> and sufficient.

We got away with ugly in the past, so we should get to do it again?


Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array

2017-03-17 Thread Peter Zijlstra
On Fri, Mar 17, 2017 at 01:31:23PM +0100, Alexander Potapenko wrote:
> On Fri, Mar 17, 2017 at 1:08 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> > On Thu, Mar 16, 2017 at 05:15:19PM -0700, Michael Davidson wrote:
> >> Replace a variable length array in a struct by allocating
> >> the memory for the entire struct in a char array on the stack.
> >>
> >> Signed-off-by: Michael Davidson <m...@google.com>
> >> ---
> >>  drivers/md/raid10.c | 9 -
> >>  1 file changed, 4 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> >> index 063c43d83b72..158ebdff782c 100644
> >> --- a/drivers/md/raid10.c
> >> +++ b/drivers/md/raid10.c
> >> @@ -4654,11 +4654,10 @@ static int handle_reshape_read_error(struct mddev 
> >> *mddev,
> >>   /* Use sync reads to get the blocks from somewhere else */
> >>   int sectors = r10_bio->sectors;
> >>   struct r10conf *conf = mddev->private;
> >> - struct {
> >> - struct r10bio r10_bio;
> >> - struct r10dev devs[conf->copies];
> >> - } on_stack;
> >> - struct r10bio *r10b = _stack.r10_bio;
> >> + char on_stack_r10_bio[sizeof(struct r10bio) +
> >> +   conf->copies * sizeof(struct r10dev)]
> >> +   __aligned(__alignof__(struct r10bio));
> >> + struct r10bio *r10b = (struct r10bio *)on_stack_r10_bio;
> >>   int slot = 0;
> >>   int idx = 0;
> >>   struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;
> >
> >
> > That's disgusting. Why not fix LLVM to support this?
> 
> IIUC there's only a handful of VLAIS instances in LLVM code, why not
> just drop them for the sake of better code portability?
> (To quote Linus, "this feature is an abomination":
> https://lkml.org/lkml/2013/9/23/500)

Be that as it may; what you construct above is disgusting. Surely the
code can be refactored to not look like dog vomit?

Also; its not immediately obvious conf->copies is 'small' and this
doesn't blow up the stack; I feel that deserves a comment somewhere.




Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array

2017-03-17 Thread Peter Zijlstra
On Thu, Mar 16, 2017 at 05:15:19PM -0700, Michael Davidson wrote:
> Replace a variable length array in a struct by allocating
> the memory for the entire struct in a char array on the stack.
> 
> Signed-off-by: Michael Davidson 
> ---
>  drivers/md/raid10.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 063c43d83b72..158ebdff782c 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -4654,11 +4654,10 @@ static int handle_reshape_read_error(struct mddev 
> *mddev,
>   /* Use sync reads to get the blocks from somewhere else */
>   int sectors = r10_bio->sectors;
>   struct r10conf *conf = mddev->private;
> - struct {
> - struct r10bio r10_bio;
> - struct r10dev devs[conf->copies];
> - } on_stack;
> - struct r10bio *r10b = _stack.r10_bio;
> + char on_stack_r10_bio[sizeof(struct r10bio) +
> +   conf->copies * sizeof(struct r10dev)]
> +   __aligned(__alignof__(struct r10bio));
> + struct r10bio *r10b = (struct r10bio *)on_stack_r10_bio;
>   int slot = 0;
>   int idx = 0;
>   struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec;


That's disgusting. Why not fix LLVM to support this?


Re: [PATCH 5/7] x86, boot, LLVM: Use regparm=0 for memcpy and memset

2017-03-17 Thread Peter Zijlstra
On Thu, Mar 16, 2017 at 05:15:18PM -0700, Michael Davidson wrote:
> Use the standard regparm=0 calling convention for memcpy and
> memset when building with clang.
> 
> This is a work around for a long standing clang bug
> (see https://llvm.org/bugs/show_bug.cgi?id=3997) where
> clang always uses the standard regparm=0 calling convention
> for any implcit calls to memcpy and memset that it generates
> (eg for structure assignments and initialization) even if an
> alternate calling convention such as regparm=3 has been specified.

Seriously, fix LLVM already.


Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-15 Thread Peter Zijlstra
On Thu, Dec 15, 2016 at 09:43:04PM +0100, Jason A. Donenfeld wrote:
> On Thu, Dec 15, 2016 at 9:31 PM, Hannes Frederic Sowa
>  wrote:
> > ARM64 and x86-64 have memory operations that are not vector operations
> > that operate on 128 bit memory.
> 
> Fair enough. imull I guess.

imull is into rdx:rax, not memory. I suspect he's talking about
cmpxchg16b.

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-15 Thread Peter Zijlstra
On Thu, Dec 15, 2016 at 07:50:36PM +0100, Jason A. Donenfeld wrote:
> There's no 32-bit platform
> that will trap on a 64-bit unaligned access because there's no such
> thing as a 64-bit access there. In short, we're fine.

ARMv7 LPAE is a 32bit architecture that has 64bit load/stores IIRC.

x86 has cmpxchg8b that can do 64bit things and very much wants the u64
aligned.

Also, IIRC we have a few platforms where u64 doesn't carry 8 byte
alignment, m68k or something like that, but yes, you likely don't care.

Just to make life interesting...
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: single_task_running() vs. preemption warnings (was Re: [PATCH] kvm: fix preemption warnings in kvm_vcpu_block)

2015-09-18 Thread Peter Zijlstra
On Thu, Sep 17, 2015 at 01:32:55PM -0700, Tim Chen wrote:
> I have no objection to change single_task_running to use
> raw_smp_processor_id.  The worker in mcryptd is bound to
> the cpu so it has no migration/preemption issue.  So it shouldn't care
> which smp_processor_id version is being used.  Yes, please add a comment
> to alert the user of this caveat should you change single_task_running.


We actually have raw_rq() for that, and the whole if thing looks rather
superfluous. So something like the below, except with a suitable comment
on and tested etc.. ;-)

---
 kernel/sched/core.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ab415aa15c4..f39c0498e284 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2666,10 +2666,7 @@ unsigned long nr_running(void)
  */
 bool single_task_running(void)
 {
-   if (cpu_rq(smp_processor_id())->nr_running == 1)
-   return true;
-   else
-   return false;
+   return raw_rq()->nr_running == 1;
 }
 EXPORT_SYMBOL(single_task_running);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: randconfig build error with next-20150529, in crypto/jitterentropy.c

2015-06-08 Thread Peter Zijlstra
Adding Stephan to Cc.

On Mon, 2015-06-08 at 20:25 +0800, Herbert Xu wrote:
 On Mon, Jun 08, 2015 at 03:21:09PM +0300, Andy Shevchenko wrote:
  On Fri, May 29, 2015 at 10:14 PM, Jim Davis jim.ep...@gmail.com wrote:
   Building with the attached random configuration file,
  
  Hit the very same error against next-20150605.
  
  The issue with that file we have no compiler optimization enabled. So,
  guys, how you would recommend to fix it?
 
 Stephan, can we get rid of the no optimisation requirement?

Would something like

#pragma GCC push_options
#pragma GCC optimize (-O0)
static __u64 jent_fold_time(struct rand_data *ec, __u64 time,
__u64 *folded, __u64 loop_cnt)
{
...
}
#pragma GCC pop_options

Be an option to allow the file to be compiled with regular optimizations
enabled?


--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/7] crypto: SHA1 multibuffer crypto hash infrastructure

2014-07-30 Thread Peter Zijlstra
On Tue, Jul 29, 2014 at 03:28:25PM -0700, Tim Chen wrote:
 Peter,
 
 Is my explanation adequate or you still have objection to the implementation? 
  I'm
 trying to decide here whether to extend our batch processing by
 the crypto daemon (prolong our busy period) 
 based on whether there are other tasks to run.  Pre-emption and
 resuming the crypto daemon is okay. Even if there's a pre-emption and
 context switch in between, we can still do extended processing if
 there's nothing else to run. Otherwise the cpu will go idle. 

I suppose.. I still don't really like it but what the heck.

That single_task_running thing is hard to use and a weird interface so
lets hope people don't find more 'creative' users for it.




pgpHWhQfy2LwI.pgp
Description: PGP signature


Re: [PATCH v5 2/7] crypto: SHA1 multibuffer crypto hash infrastructure

2014-07-25 Thread Peter Zijlstra
On Tue, Jul 22, 2014 at 03:09:32PM -0700, Tim Chen wrote:
 +/* Called in workqueue context, do one real cryption work (via
 + * req-complete) and reschedule itself if there are more work to
 + * do. */

You seem to manage the 'normal' comment style in other places, this one
'special' for a reason?

 +static void mcryptd_queue_worker(struct work_struct *work)
 +{
 + struct mcryptd_cpu_queue *cpu_queue;
 + struct crypto_async_request *req, *backlog;
 + int i;
 +
 + /*
 +  * Need to loop through more than once for multi-buffer to
 +  * be effective.
 +  */
 +
 + cpu_queue = container_of(work, struct mcryptd_cpu_queue, work);
 + i = 0;
 + while (i  MCRYPTD_BATCH || single_task_running()) {
 + /*
 +  * preempt_disable/enable is used to prevent
 +  * being preempted by mcryptd_enqueue_request()
 +  */
 + local_bh_disable();
 + preempt_disable();
 + backlog = crypto_get_backlog(cpu_queue-queue);
 + req = crypto_dequeue_request(cpu_queue-queue);
 + preempt_enable();
 + local_bh_enable();
 +
 + if (!req)
 + return;
 +
 + if (backlog)
 + backlog-complete(backlog, -EINPROGRESS);
 + req-complete(req, 0);
 + if (!cpu_queue-queue.qlen)
 + return;
 + ++i;
 + }
 + if (cpu_queue-queue.qlen)
 + queue_work(kcrypto_wq, cpu_queue-work);
 +}

Right, so I don't think you need that single_task_running() thing for
this. Also its a rather inconclusive test, a task can come in while
processing the request, preempt the processing, complete and by the time
we're back to check if we want to continue the loop its only us again.

So its actually misleading..

You could do that, but then you have to keep preemption disabled across
the entire thing, and break out on need_resched(), that would entirely
obviate the need for single_task_running(), but would increase the
preemption latency by however long the req processing takes, so that's
bad too.

But you can get similar 'fuzzy' semantics as above by using something
like:

static inline bool did_context_switch(unsigned long *nivcs)
{
if (*nivcs != current-nivcs) {
*nivcs = current-nivcs;
return true;
}
return false;
}

static void mcryptd_queue_worker()
{
...
unsigned long nivcs = current-nivcs;

...

while (!did_context_switch(nivcs) || i  MCRYPTD_BATCH)

...
}

It'll terminate earlier I suppose, its the inverse test. But I think you
want to terminate early if there's any activity.
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 3/7] crypto: SHA1 multibuffer crypto opportunistic flush

2014-07-25 Thread Peter Zijlstra
On Tue, Jul 22, 2014 at 03:09:35PM -0700, Tim Chen wrote:
 The crypto daemon can take advantage of available cpu
 cycles to flush any unfinished jobs if it is the
 only task running on the cpu, and there are no more crypto
 jobs to process.

You conveniently forgot to mention energy efficiency, and the problem
with short idle durations. Although the latter issue seems overshadowed
by the not so very robust idle detection.

In any case, same as the other patch, you can get similarly inconclusive
semantics without single_task_running().

 Signed-off-by: Tim Chen tim.c.c...@linux.intel.com
 ---
  crypto/mcryptd.c | 39 ---
  1 file changed, 36 insertions(+), 3 deletions(-)

 +/*
 + * Called in workqueue context, do one real cryption work (via
   * req-complete) and reschedule itself if there are more work to
 - * do. */
 + * do.
 + */

See, this should've been in the previous patch :-)
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/7] crypto: SHA1 multibuffer crypto hash infrastructure

2014-07-25 Thread Peter Zijlstra
On Tue, Jul 22, 2014 at 03:09:32PM -0700, Tim Chen wrote:
 This patch introduces the multi-buffer crypto daemon which is responsible
 for submitting crypto jobs in a work queue to the responsible multi-buffer
 crypto algorithm.  The idea of the multi-buffer algorihtm is to put
 data streams from multiple jobs in a wide (AVX2) register and then
 take advantage of SIMD instructions to do crypto computation on several
 buffers simultaneously.
 
 The multi-buffer crypto daemon is also responsbile for flushing the
 remaining buffers to complete the computation if no new buffers arrive
 for a while.

How is any of this SHA1 specific as the Subject says it is?
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu

2014-07-15 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 12:50:50PM -0700, Tim Chen wrote:

 There is a generic multi-buffer infrastructure portion that manages
 pulling and queuing jobs on the crypto workqueue, and it is separated out
 in patch 1 of the patchset.

There's one very weird multi-line comment in that patch.

 The other portions are algorithm specific that defines
 algorithm specific data structure and  does the crypto computation 
 for a particular algorithm, mostly in
 assemblies and C glue code.  The infrastructure code is 
 meant to be reused for other similar 
 multi-buffer algorithms.

The flushing part that uses the sched thing is sha1 specific, even
though it strikes me as not being so. Flushing buffers on idle seems
like a 'generic' thing.

 We use nr_running_cpu to check whether there are other tasks running on
 the *current* cpu, (not for another cpu),

And yet, the function allows you do to exactly that..

 to decide if we should flush
 and compute crypto jobs accumulated.  If there's nobody else running,
 we can take advantage of available cpu cycles on the cpu we are running 
 on to do computation on the existing jobs in a SIMD mannner. 
 Waiting a bit longer may accumulate more jobs to process in parallel
 in a single SIMD instruction, but will have more delay.  

So you already have an idle notifier (which is x86 only, we should fix
that I suppose), and you then double check there really isn't anything
else running.

How much, if anything, does that second check buy you? There's just not
a single word on that.

Also, there is not a word on the latency vs throughput tradeoff you
make. I can imagine that for very short idle durations you loose, not
win with this thing.

So for now I still see no reason for doing this.

Also, I wonder about SMT, the point of this is to make best use of the
SIMD pipelines, does it still make sense to use siblings at the same
time even though you're running hand crafted ASM to stuff the pipelines
to the brim? Should this thing be SMT aware and not gather queues for
both siblings?


pgpPiBvaLyXVd.pgp
Description: PGP signature


Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu

2014-07-15 Thread Peter Zijlstra
On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote:
 So you already have an idle notifier (which is x86 only, we should fix
 that I suppose), and you then double check there really isn't anything
 else running.

Note that we've already done a large part of the expense of going idle
by the time we call that idle notifier -- in specific, we've
reprogrammed the clock to stop the tick.

Its really wasteful to then generate work again, which means we have to
again reprogram the clock etc.


pgp6Nc_x_nwSe.pgp
Description: PGP signature


Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu

2014-07-15 Thread Peter Zijlstra
On Tue, Jul 15, 2014 at 04:45:25PM +0200, Mike Galbraith wrote:
 
 3.0.101-default3.753363 usecs/loop -- avg 3.770737 530.4 KHz   1.000
 3.14.10-default4.145348 usecs/loop -- avg 4.139987 483.1 KHz.910  
1.000
 3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz.866  
 .9511.000
 3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz.829  
 .911 .957
 
 3.0.101-smp3.692377 usecs/loop -- avg 3.690774 541.9 KHz   1.000
 3.14.10-smp4.010009 usecs/loop -- avg 4.009019 498.9 KHz.920
 3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz.950
 3.16.0-master  4.061003 usecs/loop -- avg 4.058244 492.8 KHz.909

Urgh,.. I need to go fix that :/


pgpReCZArzLnp.pgp
Description: PGP signature


Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu

2014-07-14 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 09:10:14AM -0700, Tim Chen wrote:
 On Mon, 2014-07-14 at 12:16 +0200, Peter Zijlstra wrote:
  On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote:
   This function will help a thread decide if it wants to to do work
   that can be delayed, to accumulate more tasks for more efficient
   batch processing later.
   
   However, if no other tasks are running on the cpu, it can take
   advantgae of the available cpu cycles to complete the tasks
   for immediate processing to minimize delay, otherwise it will yield.
  
  Ugh.. and ignore topology and everything else.
  
  Yet another scheduler on top of the scheduler.
  
  We have the padata muck, also only ever used by crypto.
  We have the workqueue nonsense, used all over the place
  And we have btrfs doing their own padata like muck.
  And I'm sure there's at least one more out there, just because.
  
  Why do we want yet another thing?
  
  I'm inclined to go NAK and get people to reduce the amount of async
  queueing and processing crap.
 
 The mult-buffer class of crypto algorithms is by nature
 asynchronous.  The algorithm gathers several crypto jobs, and
 put the buffer from each job in a data lane of the SIMD register.
 This allows for parallel processing and increases throughput.
 The gathering of the crypto jobs is an async process and
 queuing is necessary for this class of algorithm.

How is that related to me saying we've got too much of this crap
already?


pgpjuRya9kDxR.pgp
Description: PGP signature


Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu

2014-07-14 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 10:05:34AM -0700, Tim Chen wrote:
 I was trying to explain why the algorithm is implemented this way
 because of its batching nature.
 
 There is a whole class of async algorithm that can provide
 substantial speedup by doing batch processing and uses workqueue.
 The multi-buffer sha1 version has 2.2x speedup over existing
 AVX2 version, and can have even more speedup when AVX3 
 comes round.  Workqueue is a natural way to implement 
 this.  I don't think a throughput speedup of 2.2x is crap.
 
 We are not inventing anything new, but ask for a 
 very simple helper function to know if there's something else
 running on our cpu to help us make a better decision 
 of whether we should flush the batched jobs immediately.
 
 And also asynchronous crypto interface is already used substantially
 in crypto and has a well established infrastructure.  

The crap I was talking about is that there's a metric ton of 'async'
interfaces all different.

Your multi-buffer thing isn't generic either, it seems lmiited to sha1.
It does not reuse padata, it does not extend workqueues, it does not
remove the btrfs nonsense, it adds yet anotehr thing.


pgpnmUH25Ciz9.pgp
Description: PGP signature


Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu

2014-07-14 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 12:08:28PM -0700, Tim Chen wrote:
 On Mon, 2014-07-14 at 20:17 +0200, Peter Zijlstra wrote:

  Your multi-buffer thing isn't generic either, it seems lmiited to sha1.
 
 We actually have many other multi-buffer crypto algorithms already
 published for encryption and other IPSec usages.  So 
 multi-buffer algorithm is not just limited to SHA1.
 We hope to port those to the kernel crypto library eventually.
 http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
  

That's all nice and such; but the code as I've seen in these patches is
very much sha1 specific. The mb part isn't separated out.

  It does not reuse padata, 
 padata tries to speed things up by parallelizing jobs to *multiple*
 cpus.  Whereas multi-buffer tries to speed things up by speeding things
 up by using multiple data lanes in SIMD register in a *single* cpu.
 These two usages are complementary but not the same.

And if its single cpu, wth do you need that nr_running thing for another
cpu for?

Also, this difference wasn't clear to me.

I still loathe all the async work, because it makes a mockery of
accounting etc.. but that's a story for another day I suppose :-(




pgpco9662seN9.pgp
Description: PGP signature


Re: [PATCH] random: add blocking facility to urandom

2011-09-13 Thread Peter Zijlstra
On Mon, 2011-09-12 at 09:56 -0400, Jarod Wilson wrote:
 Thomas Gleixner wrote:

  Well, there is enough prove out there that the hardware you're using
  is a perfect random number generator by itself.
 
  So stop complaining about not having access to TPM chips if you can
  create an entropy source just by (ab)using the inherent randomness of
  modern CPU architectures to refill your entropy pool on the fly when
  the need arises w/o imposing completely unintuitive thresholds and
  user visible API changes.
 
 We started out going down that path:
 
 http://www.mail-archive.com/linux-crypto@vger.kernel.org/msg05778.html
 
 We hit a bit of a roadblock with it though.

Have you guys seen this work:

  http://lwn.net/images/conf/rtlws11/random-hardware.pdf


--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm: convert k{un}map_atomic(p, KM_type) to k{un}map_atomic(p)

2011-08-29 Thread Peter Zijlstra
On Fri, 2011-08-26 at 12:42 -0700, Andrew Morton wrote:
 Perhaps you could dust off your old patch and we'll bring it up to date?

most of it would be doing what mlin just did, so I took his patch and
went from there, the resulting delta is something like the below.

Completely untested...  crypto much improved since I last looked at it.

---
Whoever wrote aesni-intel_glue.c should be flogged, it did 3
kmap_atomic()s on the same km_type.. it works with the stacked kmap but
still we should not nest 3 kmaps, that's just wrong.

Crypto: mostly cleanup of the out/in_softirq() km_type selection and
cleanup of some of the scatterwalk stuff that had superfluous arguments
etc..

drbd: cleaned up more, could probably be sanitized even further but will
leave that to the maintainers.

mmc: looks like mlin made a boo-boo there, but it also looks like
tmio_mmc_kmap_atomic and co can be somewhat improved..

cassini: removal of useless cas_page_map/unmap wrappers

fc: removed some pointless km_type passing around

rtl8192u: again, someone should be flogged, private (slightly
'improved') copy of scatterwalk.[ch].

zram: removed put/get_ptr_atomic foo

bio: git grep __bio_kmap_atomic/kunmap_atomic turned up empty

---
 arch/x86/crypto/aesni-intel_glue.c   |   24 +-
 crypto/ahash.c   |4 -
 crypto/blkcipher.c   |8 +--
 crypto/ccm.c |4 -
 crypto/scatterwalk.c |9 +--
 crypto/shash.c   |8 +--
 drivers/block/drbd/drbd_bitmap.c |   40 ++---
 drivers/mmc/host/tmio_mmc_dma.c  |4 -
 drivers/mmc/host/tmio_mmc_pio.c  |8 +--
 drivers/net/cassini.c|   26 +--
 drivers/scsi/libfc/fc_fcp.c  |4 -
 drivers/scsi/libfc/fc_libfc.c|2 
 drivers/scsi/libfc/fc_lport.c|2 
 drivers/staging/rtl8192u/ieee80211/cipher.c  |4 -
 drivers/staging/rtl8192u/ieee80211/digest.c  |8 +--
 drivers/staging/rtl8192u/ieee80211/internal.h|   18 ---
 drivers/staging/rtl8192u/ieee80211/scatterwalk.c |   10 ++--
 drivers/staging/rtl8192u/ieee80211/scatterwalk.h |2 
 drivers/staging/zram/xvmalloc.c  |   54 ++-
 include/crypto/scatterwalk.h |   29 
 include/linux/bio.h  |   12 -
 21 files changed, 94 insertions(+), 186 deletions(-)

Index: linux-2.6/arch/x86/crypto/aesni-intel_glue.c
===
--- linux-2.6.orig/arch/x86/crypto/aesni-intel_glue.c
+++ linux-2.6/arch/x86/crypto/aesni-intel_glue.c
@@ -1106,12 +1106,12 @@ static int __driver_rfc4106_encrypt(stru
one_entry_in_sg = 1;
scatterwalk_start(src_sg_walk, req-src);
scatterwalk_start(assoc_sg_walk, req-assoc);
-   src = scatterwalk_map(src_sg_walk, 0);
-   assoc = scatterwalk_map(assoc_sg_walk, 0);
+   src = scatterwalk_map(src_sg_walk);
+   assoc = scatterwalk_map(assoc_sg_walk);
dst = src;
if (unlikely(req-src != req-dst)) {
scatterwalk_start(dst_sg_walk, req-dst);
-   dst = scatterwalk_map(dst_sg_walk, 0);
+   dst = scatterwalk_map(dst_sg_walk);
}
 
} else {
@@ -1135,11 +1135,11 @@ static int __driver_rfc4106_encrypt(stru
 * back to the packet. */
if (one_entry_in_sg) {
if (unlikely(req-src != req-dst)) {
-   scatterwalk_unmap(dst, 0);
+   kunmap_atomic(dst);
scatterwalk_done(dst_sg_walk, 0, 0);
}
-   scatterwalk_unmap(src, 0);
-   scatterwalk_unmap(assoc, 0);
+   kunmap_atomic(src);
+   kunmap_atomic(assoc);
scatterwalk_done(src_sg_walk, 0, 0);
scatterwalk_done(assoc_sg_walk, 0, 0);
} else {
@@ -1188,12 +1188,12 @@ static int __driver_rfc4106_decrypt(stru
one_entry_in_sg = 1;
scatterwalk_start(src_sg_walk, req-src);
scatterwalk_start(assoc_sg_walk, req-assoc);
-   src = scatterwalk_map(src_sg_walk, 0);
-   assoc = scatterwalk_map(assoc_sg_walk, 0);
+   src = scatterwalk_map(src_sg_walk);
+   assoc = scatterwalk_map(assoc_sg_walk);
dst = src;
if (unlikely(req-src != req-dst)) {
scatterwalk_start(dst_sg_walk, req-dst);
-   dst = scatterwalk_map(dst_sg_walk, 0);
+   dst = scatterwalk_map(dst_sg_walk);
}
 
} else {
@@ -1218,11 +1218,11 @@ static int