Re: [akaros] Code Review - Add MSR device to devarch

'Davide Libenzi' via Akaros Mon, 23 Nov 2015 20:17:57 -0800

On Mon, Nov 23, 2015 at 12:42 PM, Barret Rhoden <[email protected]>
wrote:


>
> > From b8425da6e7813b9f0c6cb2e1ebe0124236e9eeba Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 11:25:39 -0800
> > Subject: Added completion data structure
> >
> > diff --git a/kern/src/completion.c b/kern/src/completion.c
> > new file mode 100644
> > index 000000000000..40e922ab8120
> > --- /dev/null
> > +++ b/kern/src/completion.c
> > @@ -0,0 +1,30 @@
> > +/* Copyright (c) 2015 Google Inc
> > + * Davide Libenzi <[email protected]>
> > + * See LICENSE for details.
> > + */
> > +
> > +#include <kthread.h>
> > +#include <completion.h>
> > +
> > +void completion_init(struct completion *comp, int count)
> > +{
> > +     cv_init(&comp->cv);
>
> If you want to signal completion from IRQ context, then we'll need these
> CVs to
> be irqsave.  In that case, you'll need to use the irqsave variants for
> most of
> the CV operations (e.g. cv_lock_irqsave).  If you don't, you'll run the
> risk of
> deadlocking, and CONFIG_SPINLOCK_DEBUG will complain.
>

Sure, let's turn them irqsave.



>
> > +     comp->count = count;
> > +}
> > +
> > +void completion_complete(struct completion *comp, int how_much)
> > +{
> > +     cv_lock(&comp->cv);
> > +     comp->count -= how_much;
> > +     if (comp->count <= 0)
> > +             __cv_broadcast(&comp->cv);
>
> Would it be a bug for comp->count to ever be less than 0?  (Depends on how
> this
> is used).
>

For this use, it would be a bug. it might be better to keep it such that
the count is a threshold, more than a limit.




>
> > +     cv_unlock(&comp->cv);
> > +}
> > +
> > +void completion_wait(struct completion *comp)
> > +{
> > +     cv_lock(&comp->cv);
> > +     if (comp->count > 0)
> > +             cv_wait(&comp->cv);
>
> This is fine as is.
>

This seems racy, in general. In this case, we never issue to the local CPU,
otherwise you could do cv_lock(), get an IRQ which does cv_lock_irqsave(),
and get a deadlock.
IMO that cv_lock() should be irqsave. Like this?

void completion_wait(struct completion *comp)
{
    int8_t state;

    cv_lock_irqsave(&comp->cv, &state);
    if (comp->count > 0)
        cv_wait_and_unlock(&comp->cv);
    enable_irqsave(&state);
}



>
> As an FYI, if you want to be a little faster, you can use
> cv_wait_and_unlock,
> then immediately return (careful of IRQs).  And/or you could do a racy
> read of
> comp->count to see if it's already 0 without grabbing the lock.
>

Yes, thanks.



> Subject: Added CPU set data structure
>
> Do we want to call these cpu_sets or core_sets?  In general, we've
> referred to
> logical processors as 'cores' or 'pcores' throughout the codebase.  Though
> we
> also have been a little inconsistent with it.
>

OK, renaming to core_set




>
> > diff --git a/kern/include/cpu_set.h b/kern/include/cpu_set.h
>
> > +#define CPUSET_INT_BITS (sizeof(cpu_set_int_t) * 8)
> > +
> > +typedef uint64_t cpu_set_int_t;
> > +
> > +struct cpu_set {
> > +     cpu_set_int_t cpus[(MAX_NUM_CORES + CPUSET_INT_BITS - 1) /
> CPUSET_INT_BITS];
>
> We have a helper macro for the array size calculation:
>         cpus[DIV_ROUND_UP(MAX_NUM_CORES, CPUSET_INT_BITS)];
>
> > +static inline void cpu_set_setcpu(struct cpu_set *cset, unsigned int
> cpuno)
> > +{
> > +     cset->cpus[cpuno / CPUSET_INT_BITS] |= 1 << (cpuno %
> CPUSET_INT_BITS);
> > +}
>
> This stuff all looks correct.  But we should use bitops for it instead of
> rolling our own.  We brought in many of Linux's bitops (k/i/bitops.h,
> k/a/x/bitops.h).  If there's a general-purpose bitop we're missing that you
> need, then we should add it to bitops.h
>

It is missing everything. get/set/clear
The operations in bitops seem to be more special operations, and targeting
a single unsigned long.
I added mem_get_bit, mem_set_bit, mem_clear_bit to bitops.h.




>
>
> > From 6781115d82ce87a64894ec773b0d69671ebb84ed Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 11:33:56 -0800
> > Subject: Added APIs to read and write MSR values on multiple CPUs
> >
>
> > --- a/kern/arch/x86/devarch.c
> > +++ b/kern/arch/x86/devarch.c
>
> > +uint64_t *cpuset_read_msr(const struct cpu_set *cset, uint32_t addr,
> > +                                               size_t *nvalues)
> > +{
> > +     int cpu = core_id();
> > +     struct smp_read_values srv;
> > +
> > +     srv.values = kzmalloc(num_cores * sizeof(*srv.values),
> KMALLOC_WAIT);
> > +     if (srv.values) {
>
> With KMALLOC_WAIT, kmalloc always succeeds.
>

This has already been changed in continuation branch. Will leave it as is
here.




>
> > +             completion_init(&srv.comp, cpu_set_remote_count(cset));
> > +             for (int i = 0; i < num_cores; i++) {
> > +                     if (cpu_set_getcpu(cset, i)) {
> > +                             if (i == cpu)
> > +                                     srv.values[i] = read_msr(addr);
> > +                             else
> > +                                     send_kernel_message(i,
> cpu_read_msr, (long) &srv, addr, 0,
> > +
>      KMSG_ROUTINE);
> > +                     }
> > +             }
> > +             completion_wait(&srv.comp);
>
> This whole block should be done with smp_do_in_cpus.
>

Ditto.


> +     completion_init(&comp, cpu_set_remote_count(cset));
> > +     for (int i = 0; i < num_cores; i++) {
> > +             if (cpu_set_getcpu(cset, i)) {
> > +                     if (i == cpu)
> > +                             write_msr(addr, value);
> > +                     else
> > +                             send_kernel_message(i, cpu_write_msr,
> (long) &comp, addr, value,
> > +
>  KMSG_ROUTINE);
> > +             }
> > +     }
> > +     completion_wait(&comp);
>
> This whole block should be done with smp_do_in_cpus.
>

Ditto.




>
> > From 008c4762727a39ad8e5c5dd2265317445ddb86bd Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 16:44:41 -0800
> > Subject: Cleaned up devarch.c code to remove unused code
>
> > -static char *devname(void)
> > -{
> > -     return archdevtab.name;
> > -}
>
> Not a big deal, but the reason I had this devname() helper is that many
> devices
> borrow code from each other.  devarch only used it once, but others use
> devname
> a bunch of times.  When that code gets reused for a new device, we can
> leave
> devname as is.  e.g. copying error messages around like:
>         set_error(ENOSYS, "Can't tap #%s file type %d", devname(), type);
>

There wasn't a single user of that, this is why I dropped it.



> -     Qmem,
> > +     Qrealmem,
>
> So in the last cleanup patch you removed realmem and make it read up to
> max_paddr bytes.  In this patch, you undid that.  Ideally, the parts of
> this patch that made that change would be squashed into the previous
> patch.
>
> You can do this with git rebase -i.  For this commit, choose "e" (edit),
> then:
> $ git reset HEAD^ # unstage this entire commit's changes
> $ git add -p # selectively pick what should be squashed
> $ git commit --amend # push those changes into the previous commit
>         or
>         git commit -m WIP-squash # make a separate commit, fix it up later
> $ git add -p # add everything for *this* commit
> $ git commit -c ca14627 # commit with the original message
> $ git rebase --continue
>
> >  uint64_t *cpuset_read_msr(const struct cpu_set *cset, uint32_t addr,
> >                                                 size_t *nvalues)
> >  {
> > -     int cpu = core_id();
> >       struct smp_read_values srv;
> >
> > -     srv.values = kzmalloc(num_cores * sizeof(*srv.values),
> KMALLOC_WAIT);
> > +     srv.addr = addr;
> > +     srv.values = kzmalloc(num_cores * sizeof(*srv.values), 0);
> >       if (srv.values) {
> > -             completion_init(&srv.comp, cpu_set_remote_count(cset));
> > -             for (int i = 0; i < num_cores; i++) {
> > -                     if (cpu_set_getcpu(cset, i)) {
> > -                             if (i == cpu)
> > -                                     srv.values[i] = read_msr(addr);
> > -                             else
> > -                                     send_kernel_message(i,
> cpu_read_msr, (long) &srv, addr, 0,
> > -
>      KMSG_ROUTINE);
> > -                     }
> > -             }
> > -             completion_wait(&srv.comp);
> > +             smp_do_in_cpus(cset, cpu_read_msr, &srv);
>
> This is exactly what I said to do for the earlier patch.  I'm glad you
> did it, but I would be even happier if you fixed up the git history so
> that it was like that originally.
>
> It may seem trivial, but not only does it keep the git history clean (so
> that we can easily see how the code evolved), but it saves me time.  I
> was making comments on code that had already been fixed.
>
> So now it looks like there's two parts of this commit that should be in
> other commits.  That's something that can be fixed with multiple runs of
> git rebase -i, as described above.
>

Will try. /shivering ...



>
> > @@ -397,10 +386,27 @@ static long archread(struct chan *c, void *a, long
> n, int64_t offset)
> >                       return n;
> >               case Qioalloc:
> >                       break;
> > -             case Qmem:
> > -                     printk("readmem %p %p %p %p %p\n", offset, a, n,
> KADDR(0),
> > -                                max_paddr);
> > -                     return readmem(offset, a, n, KADDR(0), max_paddr);
> > +             case Qrealmem:
> > +                     return readmem(offset, a, n, KADDR(0),
> REAL_MEM_SIZE);
>
> This change (realmen) is right next to Qmsr.  When you try to fix this
> with a rebase and do a git add -p, git won't be able to figure out that
> these are separate.  You'll have to specify 'e' and manually edit the
> hunk when you git add.  It should be pretty straightforward.
>

I will try. I might fsck it up 😀



> +             case Qmsr:
> > +                     cpu_set_init(&cset);
> > +                     cpu_set_fill_available(&cset);
> > +                     values = cpuset_read_msr(&cset, (uint32_t) offset,
> &nvalues);
> > +                     if (values) {
> > +                             if (n >= nvalues * sizeof(uint64_t)) {
> > +                                     if (memcpy_to_user_errno(current,
> a, values,
> > +
>               nvalues * sizeof(uint64_t)))
> > +                                             n = -1;
> > +                                     else
> > +                                             n = nvalues *
> sizeof(uint64_t);
> > +                             } else {
> > +                                     n = 0;
>
> So if the user doesn't ask for a large enough n, we tell them EOF?  Is
> this a bug/error on the user's part?
>

I am not sure this is a bug on the user part.
I changed to return ERANGE now.




>
> > From 0407977d44cc3cbcc8c24de1dca67676b6b6189f Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 20:11:28 -0800
> > Subject: Added test for devarch MSR file
>
> > --- /dev/null
> > +++ b/user/utest/devarch_file_test.c
> > @@ -0,0 +1,69 @@
> > +
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <unistd.h>
> > +#include <fcntl.h>
> > +#include <utest/utest.h>
> > +
> > +TEST_SUITE("DEVARCH");
> > +
> > +/* <--- Begin definition of test cases ---> */
> > +
> > +static bool test_msr(void)
> > +{
> > +     static const int max_cpus = 256;
>
> 256 should be MAX_NUM_CORES, or MAX_NUM_CORES + 1 if you want a buffer.
> To get the symbol, #include <ros/arch/arch.h>.
>

OK.




>
> > +     int fd = open("#arch/msr", O_RDWR);
> > +     uint64_t *values = malloc(max_cpus * sizeof(uint64_t));
> > +     uint64_t tmp = 0;
> > +     ssize_t n;
> > +
> > +     UT_ASSERT_M("Failed to open MSR device file (#arch/msr)", fd >= 0);
> > +     UT_ASSERT_M("Failed to allocated MSR values memory", values);
> > +
> > +     n = pread(fd, values, max_cpus * sizeof(uint64_t), 0x10);
> > +     UT_ASSERT_M("Failed to read MSR values from 0x10", n > 0);
> > +
> > +     n = pread(fd, values, max_cpus * sizeof(uint64_t), 0x309);
> > +     UT_ASSERT_M("Failed to read MSR values from 0x309", n > 0);
> > +
> > +     n = pwrite(fd, &tmp, sizeof(uint64_t), 0x309);
> > +     UT_ASSERT_M("Failed to write MSR values to 0x309",
> > +                             n == sizeof(uint64_t));
>
> Will these tests always be passing?  What are those MSRs, and why do we
> allow
> the user to read / write them?
>

The 0x309 is a counter value register. Well, if we need to test write(), we
need to write somewhere.
Here we write back the same value, but yes, it could be racy if other code
does that.



> +/* White list entries needs to be ordered by start address, and never
> overlap.
> > + */
> > +static const struct address_range msr_rd_wlist[] = {
> > +     ADDRESS_RANGE(0x00000000, 0xffffffff),
> > +};
>
> I like the use of a whitelist, but we need to actually pick what we want
> userspace to be able to read.  They could give us a garbage MSR and
> probably
> trigger a GPF or something nasty.
>

Basically, MSR varies from CPU family to CPU family.
In Linux, if you have permissions, you can read/write whatever you want.
Linux read msr code uses exception tables to catch errors.
Not sure what to do here.



>
> > +static const struct address_range msr_wr_wlist[] = {
> > +     ADDRESS_RANGE(0x000000c1, 0x000000c8),
> > +     ADDRESS_RANGE(0x00000186, 0x00000189),
> > +     ADDRESS_RANGE(0x00000199, 0x00000199),
> > +     ADDRESS_RANGE(0x00000309, 0x0000030b),
> > +     ADDRESS_RANGE(0x0000038d, 0x0000038d),
> > +     ADDRESS_RANGE(0x0000038f, 0x00000391),
> > +};
>
> What MSRs are these?  Given the sensitivity of this stuff, it'd be nice to
> use
> some sort of #define / name for MSRs instead of magic numbers.  I like the
> address range, but we lose out on explicitly marking what MSRs we are
> allowing.
> I think using real names (e.g. MSR_FOOBAR) and also using the range should
> be
> okay, so long as we only make ranges of similar, contiguous MSRs.
>

Ditto.
The ranges also, depends on the number of cores.
We could have the ranges empty, and populate them with boot cmdline options.
Keep in mind that I am planning not to use the MSR device for userspace
perf anymore.
Needs discussion on the separate email I sent.


Rebased branch according to comments so far. Missing some pending ...

-- 
You received this message because you are subscribed to the Google Groups 
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [akaros] Code Review - Add MSR device to devarch

Reply via email to