Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On Sat, Dec 02, 2017 at 11:15:14AM +, Ard Biesheuvel wrote: > On 2 December 2017 at 09:11, Ard Biesheuvelwrote: > > They consume the entire input in a single go, yes. But making it more > > granular than that is going to hurt performance, unless we introduce > > some kind of kernel_neon_yield(), which does a end+begin but only if > > the task is being scheduled out. > > > > For example, the SHA256 keeps 256 bytes of round constants in NEON > > registers, and reloading those from memory for each 64 byte block of > > input is going to be noticeable. The same applies to the AES code > > (although the numbers are slightly different) > > Something like below should do the trick I think (apologies for the > patch soup). I.e., check TIF_NEED_RESCHED at a point where only very > few NEON registers are live, and preserve/restore the live registers > across calls to kernel_neon_end + kernel_neon_begin. Would that work > for RT? Probably yes. The important point is that preempt latencies (and thus by extension NEON regions) are bounded and preferably small. Unbounded stuff (like depends on the amount of data fed) are a complete no-no for RT since then you cannot make predictions on how long things will take.
Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On Sat, Dec 02, 2017 at 09:11:46AM +, Ard Biesheuvel wrote: > On 2 December 2017 at 09:01, Peter Zijlstra <pet...@infradead.org> wrote: > > On Fri, Dec 01, 2017 at 09:19:22PM +, Ard Biesheuvel wrote: > >> Note that the remaining crypto drivers simply operate on fixed buffers, so > >> while the RT crowd may still feel the need to disable those (and the ones > >> below as well, perhaps), they don't call back into the crypto layer like > >> the ones updated by this series, and so there's no room for improvement > >> there AFAICT. > > > > Do these other drivers process all the blocks fed to them in one go > > under a single NEON section, or do they do a single fixed block per > > NEON invocation? > > They consume the entire input in a single go, yes. But making it more > granular than that is going to hurt performance, unless we introduce > some kind of kernel_neon_yield(), which does a end+begin but only if > the task is being scheduled out. A little something like this: https://lkml.kernel.org/r/20171201113235.6tmkwtov5cg2l...@hirez.programming.kicks-ass.net > For example, the SHA256 keeps 256 bytes of round constants in NEON > registers, and reloading those from memory for each 64 byte block of > input is going to be noticeable. The same applies to the AES code > (although the numbers are slightly different) Quite. We could augment the above function with a return value that says if we actually did a end/begin and registers were clobbered.
Re: [PATCH 0/5] crypto: arm64 - disable NEON across scatterwalk API calls
On Fri, Dec 01, 2017 at 09:19:22PM +, Ard Biesheuvel wrote: > Note that the remaining crypto drivers simply operate on fixed buffers, so > while the RT crowd may still feel the need to disable those (and the ones > below as well, perhaps), they don't call back into the crypto layer like > the ones updated by this series, and so there's no room for improvement > there AFAICT. Do these other drivers process all the blocks fed to them in one go under a single NEON section, or do they do a single fixed block per NEON invocation?
Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
On Fri, Mar 24, 2017 at 02:50:24PM +0100, Dmitry Vyukov wrote: > OK, I guess should not have referenced the llvm-linux page. > So here are reasons on our side that I am ready to vouch: > > - clang make it possible to implement KMSAN (dynamic detection of > uses of uninit memory) How does GCC make this impossible? > - better code coverage for fuzzing How so? Why can't the same be achieved using GCC? > - why simpler and faster development (e.g. we can port our user-space > hardening technologies -- CFI and SafeStack) That's just because you've already implemented this in clang, right? So less work for you. Not because its impossible.
Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
On Fri, Mar 24, 2017 at 02:47:15PM +0100, Dmitry Vyukov wrote: > > Seriously, you should have taken the hack the first time that this > > needs to be fixed. Just because this is a fairly uncommon construct > > in the kernel doesn't mean it is not in userspace. > > There is a reason why it is fairly uncommon in kernel. So first off; its not entirely clear that the code as it exists it correct. From a cursory reading of it and surrounding code, there is no actual upper limit on the variable. If I were stupid enough to make a raid with 64 devices I'd get a huge on-stack structure. Since you're touching it; you should check these things. And secondly, refactor the code to not look like dog vomit. You can do more than the absolute minimal patch to make it compile, I'm sure.
Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
On Fri, Mar 17, 2017 at 08:05:16PM +0100, Dmitry Vyukov wrote: > You can also find some reasons in the Why section of LLVM-Linux project: > http://llvm.linuxfoundation.org/index.php/Main_Page >From that: - LLVM/Clang is a fast moving project with many things fixed quickly and features added. So what's the deal with that 5 year old bug you want us to work around? Also, clang doesn't support asm cc flags output and a few other extensions last time I checked.
Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
On Fri, Mar 17, 2017 at 08:26:42PM +0100, Peter Zijlstra wrote: > On Fri, Mar 17, 2017 at 08:05:16PM +0100, Dmitry Vyukov wrote: > > You can also find some reasons in the Why section of LLVM-Linux project: > > http://llvm.linuxfoundation.org/index.php/Main_Page > > From that: > > - LLVM/Clang is a fast moving project with many things fixed quickly >and features added. > > So what's the deal with that 5 year old bug you want us to work around? > > Also, clang doesn't support asm cc flags output and a few other > extensions last time I checked. > Another great one: - BSD License (some people prefer this license to the GPL) Seems a very weak argument to make when talking about the Linux Kernel which is very explicitly GPLv2 (and not later).
Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
On Fri, Mar 17, 2017 at 11:52:01AM -0700, Michael Davidson wrote: > On Fri, Mar 17, 2017 at 5:44 AM, Peter Zijlstra <pet...@infradead.org> wrote: > > > > Be that as it may; what you construct above is disgusting. Surely the > > code can be refactored to not look like dog vomit? > > > > Also; its not immediately obvious conf->copies is 'small' and this > > doesn't blow up the stack; I feel that deserves a comment somewhere. > > > > I agree that the code is horrible. > > It is, in fact, exactly the same solution that was used to remove > variable length arrays in structs from several of the crypto drivers a > few years ago - see the definition of SHASH_DESC_ON_STACK() in > "crypto/hash.h" - I did not, however, hide the horrors in a macro > preferring to leave the implementation visible as a warning to whoever > might touch the code next. > > I believe that the actual stack usage is exactly the same as it was > previously. > > I can certainly wrap this up in a macro and add comments with > appropriately dire warnings in it if you feel that is both necessary > and sufficient. We got away with ugly in the past, so we should get to do it again?
Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
On Fri, Mar 17, 2017 at 01:31:23PM +0100, Alexander Potapenko wrote: > On Fri, Mar 17, 2017 at 1:08 PM, Peter Zijlstra <pet...@infradead.org> wrote: > > On Thu, Mar 16, 2017 at 05:15:19PM -0700, Michael Davidson wrote: > >> Replace a variable length array in a struct by allocating > >> the memory for the entire struct in a char array on the stack. > >> > >> Signed-off-by: Michael Davidson <m...@google.com> > >> --- > >> drivers/md/raid10.c | 9 - > >> 1 file changed, 4 insertions(+), 5 deletions(-) > >> > >> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > >> index 063c43d83b72..158ebdff782c 100644 > >> --- a/drivers/md/raid10.c > >> +++ b/drivers/md/raid10.c > >> @@ -4654,11 +4654,10 @@ static int handle_reshape_read_error(struct mddev > >> *mddev, > >> /* Use sync reads to get the blocks from somewhere else */ > >> int sectors = r10_bio->sectors; > >> struct r10conf *conf = mddev->private; > >> - struct { > >> - struct r10bio r10_bio; > >> - struct r10dev devs[conf->copies]; > >> - } on_stack; > >> - struct r10bio *r10b = _stack.r10_bio; > >> + char on_stack_r10_bio[sizeof(struct r10bio) + > >> + conf->copies * sizeof(struct r10dev)] > >> + __aligned(__alignof__(struct r10bio)); > >> + struct r10bio *r10b = (struct r10bio *)on_stack_r10_bio; > >> int slot = 0; > >> int idx = 0; > >> struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec; > > > > > > That's disgusting. Why not fix LLVM to support this? > > IIUC there's only a handful of VLAIS instances in LLVM code, why not > just drop them for the sake of better code portability? > (To quote Linus, "this feature is an abomination": > https://lkml.org/lkml/2013/9/23/500) Be that as it may; what you construct above is disgusting. Surely the code can be refactored to not look like dog vomit? Also; its not immediately obvious conf->copies is 'small' and this doesn't blow up the stack; I feel that deserves a comment somewhere.
Re: [PATCH 6/7] md/raid10, LLVM: get rid of variable length array
On Thu, Mar 16, 2017 at 05:15:19PM -0700, Michael Davidson wrote: > Replace a variable length array in a struct by allocating > the memory for the entire struct in a char array on the stack. > > Signed-off-by: Michael Davidson> --- > drivers/md/raid10.c | 9 - > 1 file changed, 4 insertions(+), 5 deletions(-) > > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c > index 063c43d83b72..158ebdff782c 100644 > --- a/drivers/md/raid10.c > +++ b/drivers/md/raid10.c > @@ -4654,11 +4654,10 @@ static int handle_reshape_read_error(struct mddev > *mddev, > /* Use sync reads to get the blocks from somewhere else */ > int sectors = r10_bio->sectors; > struct r10conf *conf = mddev->private; > - struct { > - struct r10bio r10_bio; > - struct r10dev devs[conf->copies]; > - } on_stack; > - struct r10bio *r10b = _stack.r10_bio; > + char on_stack_r10_bio[sizeof(struct r10bio) + > + conf->copies * sizeof(struct r10dev)] > + __aligned(__alignof__(struct r10bio)); > + struct r10bio *r10b = (struct r10bio *)on_stack_r10_bio; > int slot = 0; > int idx = 0; > struct bio_vec *bvec = r10_bio->master_bio->bi_io_vec; That's disgusting. Why not fix LLVM to support this?
Re: [PATCH 5/7] x86, boot, LLVM: Use regparm=0 for memcpy and memset
On Thu, Mar 16, 2017 at 05:15:18PM -0700, Michael Davidson wrote: > Use the standard regparm=0 calling convention for memcpy and > memset when building with clang. > > This is a work around for a long standing clang bug > (see https://llvm.org/bugs/show_bug.cgi?id=3997) where > clang always uses the standard regparm=0 calling convention > for any implcit calls to memcpy and memset that it generates > (eg for structure assignments and initialization) even if an > alternate calling convention such as regparm=3 has been specified. Seriously, fix LLVM already.
Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
On Thu, Dec 15, 2016 at 09:43:04PM +0100, Jason A. Donenfeld wrote: > On Thu, Dec 15, 2016 at 9:31 PM, Hannes Frederic Sowa >wrote: > > ARM64 and x86-64 have memory operations that are not vector operations > > that operate on 128 bit memory. > > Fair enough. imull I guess. imull is into rdx:rax, not memory. I suspect he's talking about cmpxchg16b. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function
On Thu, Dec 15, 2016 at 07:50:36PM +0100, Jason A. Donenfeld wrote: > There's no 32-bit platform > that will trap on a 64-bit unaligned access because there's no such > thing as a 64-bit access there. In short, we're fine. ARMv7 LPAE is a 32bit architecture that has 64bit load/stores IIRC. x86 has cmpxchg8b that can do 64bit things and very much wants the u64 aligned. Also, IIRC we have a few platforms where u64 doesn't carry 8 byte alignment, m68k or something like that, but yes, you likely don't care. Just to make life interesting... -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: single_task_running() vs. preemption warnings (was Re: [PATCH] kvm: fix preemption warnings in kvm_vcpu_block)
On Thu, Sep 17, 2015 at 01:32:55PM -0700, Tim Chen wrote: > I have no objection to change single_task_running to use > raw_smp_processor_id. The worker in mcryptd is bound to > the cpu so it has no migration/preemption issue. So it shouldn't care > which smp_processor_id version is being used. Yes, please add a comment > to alert the user of this caveat should you change single_task_running. We actually have raw_rq() for that, and the whole if thing looks rather superfluous. So something like the below, except with a suitable comment on and tested etc.. ;-) --- kernel/sched/core.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6ab415aa15c4..f39c0498e284 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2666,10 +2666,7 @@ unsigned long nr_running(void) */ bool single_task_running(void) { - if (cpu_rq(smp_processor_id())->nr_running == 1) - return true; - else - return false; + return raw_rq()->nr_running == 1; } EXPORT_SYMBOL(single_task_running); -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: randconfig build error with next-20150529, in crypto/jitterentropy.c
Adding Stephan to Cc. On Mon, 2015-06-08 at 20:25 +0800, Herbert Xu wrote: On Mon, Jun 08, 2015 at 03:21:09PM +0300, Andy Shevchenko wrote: On Fri, May 29, 2015 at 10:14 PM, Jim Davis jim.ep...@gmail.com wrote: Building with the attached random configuration file, Hit the very same error against next-20150605. The issue with that file we have no compiler optimization enabled. So, guys, how you would recommend to fix it? Stephan, can we get rid of the no optimisation requirement? Would something like #pragma GCC push_options #pragma GCC optimize (-O0) static __u64 jent_fold_time(struct rand_data *ec, __u64 time, __u64 *folded, __u64 loop_cnt) { ... } #pragma GCC pop_options Be an option to allow the file to be compiled with regular optimizations enabled? -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/7] crypto: SHA1 multibuffer crypto hash infrastructure
On Tue, Jul 29, 2014 at 03:28:25PM -0700, Tim Chen wrote: Peter, Is my explanation adequate or you still have objection to the implementation? I'm trying to decide here whether to extend our batch processing by the crypto daemon (prolong our busy period) based on whether there are other tasks to run. Pre-emption and resuming the crypto daemon is okay. Even if there's a pre-emption and context switch in between, we can still do extended processing if there's nothing else to run. Otherwise the cpu will go idle. I suppose.. I still don't really like it but what the heck. That single_task_running thing is hard to use and a weird interface so lets hope people don't find more 'creative' users for it. pgpHWhQfy2LwI.pgp Description: PGP signature
Re: [PATCH v5 2/7] crypto: SHA1 multibuffer crypto hash infrastructure
On Tue, Jul 22, 2014 at 03:09:32PM -0700, Tim Chen wrote: +/* Called in workqueue context, do one real cryption work (via + * req-complete) and reschedule itself if there are more work to + * do. */ You seem to manage the 'normal' comment style in other places, this one 'special' for a reason? +static void mcryptd_queue_worker(struct work_struct *work) +{ + struct mcryptd_cpu_queue *cpu_queue; + struct crypto_async_request *req, *backlog; + int i; + + /* + * Need to loop through more than once for multi-buffer to + * be effective. + */ + + cpu_queue = container_of(work, struct mcryptd_cpu_queue, work); + i = 0; + while (i MCRYPTD_BATCH || single_task_running()) { + /* + * preempt_disable/enable is used to prevent + * being preempted by mcryptd_enqueue_request() + */ + local_bh_disable(); + preempt_disable(); + backlog = crypto_get_backlog(cpu_queue-queue); + req = crypto_dequeue_request(cpu_queue-queue); + preempt_enable(); + local_bh_enable(); + + if (!req) + return; + + if (backlog) + backlog-complete(backlog, -EINPROGRESS); + req-complete(req, 0); + if (!cpu_queue-queue.qlen) + return; + ++i; + } + if (cpu_queue-queue.qlen) + queue_work(kcrypto_wq, cpu_queue-work); +} Right, so I don't think you need that single_task_running() thing for this. Also its a rather inconclusive test, a task can come in while processing the request, preempt the processing, complete and by the time we're back to check if we want to continue the loop its only us again. So its actually misleading.. You could do that, but then you have to keep preemption disabled across the entire thing, and break out on need_resched(), that would entirely obviate the need for single_task_running(), but would increase the preemption latency by however long the req processing takes, so that's bad too. But you can get similar 'fuzzy' semantics as above by using something like: static inline bool did_context_switch(unsigned long *nivcs) { if (*nivcs != current-nivcs) { *nivcs = current-nivcs; return true; } return false; } static void mcryptd_queue_worker() { ... unsigned long nivcs = current-nivcs; ... while (!did_context_switch(nivcs) || i MCRYPTD_BATCH) ... } It'll terminate earlier I suppose, its the inverse test. But I think you want to terminate early if there's any activity. -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 3/7] crypto: SHA1 multibuffer crypto opportunistic flush
On Tue, Jul 22, 2014 at 03:09:35PM -0700, Tim Chen wrote: The crypto daemon can take advantage of available cpu cycles to flush any unfinished jobs if it is the only task running on the cpu, and there are no more crypto jobs to process. You conveniently forgot to mention energy efficiency, and the problem with short idle durations. Although the latter issue seems overshadowed by the not so very robust idle detection. In any case, same as the other patch, you can get similarly inconclusive semantics without single_task_running(). Signed-off-by: Tim Chen tim.c.c...@linux.intel.com --- crypto/mcryptd.c | 39 --- 1 file changed, 36 insertions(+), 3 deletions(-) +/* + * Called in workqueue context, do one real cryption work (via * req-complete) and reschedule itself if there are more work to - * do. */ + * do. + */ See, this should've been in the previous patch :-) -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/7] crypto: SHA1 multibuffer crypto hash infrastructure
On Tue, Jul 22, 2014 at 03:09:32PM -0700, Tim Chen wrote: This patch introduces the multi-buffer crypto daemon which is responsible for submitting crypto jobs in a work queue to the responsible multi-buffer crypto algorithm. The idea of the multi-buffer algorihtm is to put data streams from multiple jobs in a wide (AVX2) register and then take advantage of SIMD instructions to do crypto computation on several buffers simultaneously. The multi-buffer crypto daemon is also responsbile for flushing the remaining buffers to complete the computation if no new buffers arrive for a while. How is any of this SHA1 specific as the Subject says it is? -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, Jul 14, 2014 at 12:50:50PM -0700, Tim Chen wrote: There is a generic multi-buffer infrastructure portion that manages pulling and queuing jobs on the crypto workqueue, and it is separated out in patch 1 of the patchset. There's one very weird multi-line comment in that patch. The other portions are algorithm specific that defines algorithm specific data structure and does the crypto computation for a particular algorithm, mostly in assemblies and C glue code. The infrastructure code is meant to be reused for other similar multi-buffer algorithms. The flushing part that uses the sched thing is sha1 specific, even though it strikes me as not being so. Flushing buffers on idle seems like a 'generic' thing. We use nr_running_cpu to check whether there are other tasks running on the *current* cpu, (not for another cpu), And yet, the function allows you do to exactly that.. to decide if we should flush and compute crypto jobs accumulated. If there's nobody else running, we can take advantage of available cpu cycles on the cpu we are running on to do computation on the existing jobs in a SIMD mannner. Waiting a bit longer may accumulate more jobs to process in parallel in a single SIMD instruction, but will have more delay. So you already have an idle notifier (which is x86 only, we should fix that I suppose), and you then double check there really isn't anything else running. How much, if anything, does that second check buy you? There's just not a single word on that. Also, there is not a word on the latency vs throughput tradeoff you make. I can imagine that for very short idle durations you loose, not win with this thing. So for now I still see no reason for doing this. Also, I wonder about SMT, the point of this is to make best use of the SIMD pipelines, does it still make sense to use siblings at the same time even though you're running hand crafted ASM to stuff the pipelines to the brim? Should this thing be SMT aware and not gather queues for both siblings? pgpPiBvaLyXVd.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Tue, Jul 15, 2014 at 11:50:45AM +0200, Peter Zijlstra wrote: So you already have an idle notifier (which is x86 only, we should fix that I suppose), and you then double check there really isn't anything else running. Note that we've already done a large part of the expense of going idle by the time we call that idle notifier -- in specific, we've reprogrammed the clock to stop the tick. Its really wasteful to then generate work again, which means we have to again reprogram the clock etc. pgp6Nc_x_nwSe.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Tue, Jul 15, 2014 at 04:45:25PM +0200, Mike Galbraith wrote: 3.0.101-default3.753363 usecs/loop -- avg 3.770737 530.4 KHz 1.000 3.14.10-default4.145348 usecs/loop -- avg 4.139987 483.1 KHz.910 1.000 3.15.4-default 4.355594 usecs/loop -- avg 4.351961 459.6 KHz.866 .9511.000 3.16.0-default 4.537279 usecs/loop -- avg 4.543532 440.2 KHz.829 .911 .957 3.0.101-smp3.692377 usecs/loop -- avg 3.690774 541.9 KHz 1.000 3.14.10-smp4.010009 usecs/loop -- avg 4.009019 498.9 KHz.920 3.15.4-smp 3.882398 usecs/loop -- avg 3.884095 514.9 KHz.950 3.16.0-master 4.061003 usecs/loop -- avg 4.058244 492.8 KHz.909 Urgh,.. I need to go fix that :/ pgpReCZArzLnp.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, Jul 14, 2014 at 09:10:14AM -0700, Tim Chen wrote: On Mon, 2014-07-14 at 12:16 +0200, Peter Zijlstra wrote: On Fri, Jul 11, 2014 at 01:33:04PM -0700, Tim Chen wrote: This function will help a thread decide if it wants to to do work that can be delayed, to accumulate more tasks for more efficient batch processing later. However, if no other tasks are running on the cpu, it can take advantgae of the available cpu cycles to complete the tasks for immediate processing to minimize delay, otherwise it will yield. Ugh.. and ignore topology and everything else. Yet another scheduler on top of the scheduler. We have the padata muck, also only ever used by crypto. We have the workqueue nonsense, used all over the place And we have btrfs doing their own padata like muck. And I'm sure there's at least one more out there, just because. Why do we want yet another thing? I'm inclined to go NAK and get people to reduce the amount of async queueing and processing crap. The mult-buffer class of crypto algorithms is by nature asynchronous. The algorithm gathers several crypto jobs, and put the buffer from each job in a data lane of the SIMD register. This allows for parallel processing and increases throughput. The gathering of the crypto jobs is an async process and queuing is necessary for this class of algorithm. How is that related to me saying we've got too much of this crap already? pgpjuRya9kDxR.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, Jul 14, 2014 at 10:05:34AM -0700, Tim Chen wrote: I was trying to explain why the algorithm is implemented this way because of its batching nature. There is a whole class of async algorithm that can provide substantial speedup by doing batch processing and uses workqueue. The multi-buffer sha1 version has 2.2x speedup over existing AVX2 version, and can have even more speedup when AVX3 comes round. Workqueue is a natural way to implement this. I don't think a throughput speedup of 2.2x is crap. We are not inventing anything new, but ask for a very simple helper function to know if there's something else running on our cpu to help us make a better decision of whether we should flush the batched jobs immediately. And also asynchronous crypto interface is already used substantially in crypto and has a well established infrastructure. The crap I was talking about is that there's a metric ton of 'async' interfaces all different. Your multi-buffer thing isn't generic either, it seems lmiited to sha1. It does not reuse padata, it does not extend workqueues, it does not remove the btrfs nonsense, it adds yet anotehr thing. pgpnmUH25Ciz9.pgp Description: PGP signature
Re: [PATCH v4 6/7] sched: add function nr_running_cpu to expose number of tasks running on cpu
On Mon, Jul 14, 2014 at 12:08:28PM -0700, Tim Chen wrote: On Mon, 2014-07-14 at 20:17 +0200, Peter Zijlstra wrote: Your multi-buffer thing isn't generic either, it seems lmiited to sha1. We actually have many other multi-buffer crypto algorithms already published for encryption and other IPSec usages. So multi-buffer algorithm is not just limited to SHA1. We hope to port those to the kernel crypto library eventually. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf That's all nice and such; but the code as I've seen in these patches is very much sha1 specific. The mb part isn't separated out. It does not reuse padata, padata tries to speed things up by parallelizing jobs to *multiple* cpus. Whereas multi-buffer tries to speed things up by speeding things up by using multiple data lanes in SIMD register in a *single* cpu. These two usages are complementary but not the same. And if its single cpu, wth do you need that nr_running thing for another cpu for? Also, this difference wasn't clear to me. I still loathe all the async work, because it makes a mockery of accounting etc.. but that's a story for another day I suppose :-( pgpco9662seN9.pgp Description: PGP signature
Re: [PATCH] random: add blocking facility to urandom
On Mon, 2011-09-12 at 09:56 -0400, Jarod Wilson wrote: Thomas Gleixner wrote: Well, there is enough prove out there that the hardware you're using is a perfect random number generator by itself. So stop complaining about not having access to TPM chips if you can create an entropy source just by (ab)using the inherent randomness of modern CPU architectures to refill your entropy pool on the fly when the need arises w/o imposing completely unintuitive thresholds and user visible API changes. We started out going down that path: http://www.mail-archive.com/linux-crypto@vger.kernel.org/msg05778.html We hit a bit of a roadblock with it though. Have you guys seen this work: http://lwn.net/images/conf/rtlws11/random-hardware.pdf -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mm: convert k{un}map_atomic(p, KM_type) to k{un}map_atomic(p)
On Fri, 2011-08-26 at 12:42 -0700, Andrew Morton wrote: Perhaps you could dust off your old patch and we'll bring it up to date? most of it would be doing what mlin just did, so I took his patch and went from there, the resulting delta is something like the below. Completely untested... crypto much improved since I last looked at it. --- Whoever wrote aesni-intel_glue.c should be flogged, it did 3 kmap_atomic()s on the same km_type.. it works with the stacked kmap but still we should not nest 3 kmaps, that's just wrong. Crypto: mostly cleanup of the out/in_softirq() km_type selection and cleanup of some of the scatterwalk stuff that had superfluous arguments etc.. drbd: cleaned up more, could probably be sanitized even further but will leave that to the maintainers. mmc: looks like mlin made a boo-boo there, but it also looks like tmio_mmc_kmap_atomic and co can be somewhat improved.. cassini: removal of useless cas_page_map/unmap wrappers fc: removed some pointless km_type passing around rtl8192u: again, someone should be flogged, private (slightly 'improved') copy of scatterwalk.[ch]. zram: removed put/get_ptr_atomic foo bio: git grep __bio_kmap_atomic/kunmap_atomic turned up empty --- arch/x86/crypto/aesni-intel_glue.c | 24 +- crypto/ahash.c |4 - crypto/blkcipher.c |8 +-- crypto/ccm.c |4 - crypto/scatterwalk.c |9 +-- crypto/shash.c |8 +-- drivers/block/drbd/drbd_bitmap.c | 40 ++--- drivers/mmc/host/tmio_mmc_dma.c |4 - drivers/mmc/host/tmio_mmc_pio.c |8 +-- drivers/net/cassini.c| 26 +-- drivers/scsi/libfc/fc_fcp.c |4 - drivers/scsi/libfc/fc_libfc.c|2 drivers/scsi/libfc/fc_lport.c|2 drivers/staging/rtl8192u/ieee80211/cipher.c |4 - drivers/staging/rtl8192u/ieee80211/digest.c |8 +-- drivers/staging/rtl8192u/ieee80211/internal.h| 18 --- drivers/staging/rtl8192u/ieee80211/scatterwalk.c | 10 ++-- drivers/staging/rtl8192u/ieee80211/scatterwalk.h |2 drivers/staging/zram/xvmalloc.c | 54 ++- include/crypto/scatterwalk.h | 29 include/linux/bio.h | 12 - 21 files changed, 94 insertions(+), 186 deletions(-) Index: linux-2.6/arch/x86/crypto/aesni-intel_glue.c === --- linux-2.6.orig/arch/x86/crypto/aesni-intel_glue.c +++ linux-2.6/arch/x86/crypto/aesni-intel_glue.c @@ -1106,12 +1106,12 @@ static int __driver_rfc4106_encrypt(stru one_entry_in_sg = 1; scatterwalk_start(src_sg_walk, req-src); scatterwalk_start(assoc_sg_walk, req-assoc); - src = scatterwalk_map(src_sg_walk, 0); - assoc = scatterwalk_map(assoc_sg_walk, 0); + src = scatterwalk_map(src_sg_walk); + assoc = scatterwalk_map(assoc_sg_walk); dst = src; if (unlikely(req-src != req-dst)) { scatterwalk_start(dst_sg_walk, req-dst); - dst = scatterwalk_map(dst_sg_walk, 0); + dst = scatterwalk_map(dst_sg_walk); } } else { @@ -1135,11 +1135,11 @@ static int __driver_rfc4106_encrypt(stru * back to the packet. */ if (one_entry_in_sg) { if (unlikely(req-src != req-dst)) { - scatterwalk_unmap(dst, 0); + kunmap_atomic(dst); scatterwalk_done(dst_sg_walk, 0, 0); } - scatterwalk_unmap(src, 0); - scatterwalk_unmap(assoc, 0); + kunmap_atomic(src); + kunmap_atomic(assoc); scatterwalk_done(src_sg_walk, 0, 0); scatterwalk_done(assoc_sg_walk, 0, 0); } else { @@ -1188,12 +1188,12 @@ static int __driver_rfc4106_decrypt(stru one_entry_in_sg = 1; scatterwalk_start(src_sg_walk, req-src); scatterwalk_start(assoc_sg_walk, req-assoc); - src = scatterwalk_map(src_sg_walk, 0); - assoc = scatterwalk_map(assoc_sg_walk, 0); + src = scatterwalk_map(src_sg_walk); + assoc = scatterwalk_map(assoc_sg_walk); dst = src; if (unlikely(req-src != req-dst)) { scatterwalk_start(dst_sg_walk, req-dst); - dst = scatterwalk_map(dst_sg_walk, 0); + dst = scatterwalk_map(dst_sg_walk); } } else { @@ -1218,11 +1218,11 @@ static int