Looking at how linux does it, it's
/dev/cpu/#/msr, and you do reads/writes at offset for the msr. Further,
the offset is interpreted: a true byte offset into the "MSR address space"
would shift the msr # left 3 bits, since each MSR is 8 bytes, but they just
do the same thing we all do, and just use the offset as an offset into "msr
space", e.g. reading msr 58 it does a pread at offset 58, and you get eight
bytes.
So to read all the msrs you have a pread per core :-(. But if we had an iov
call that let us specify offset per iov then we could have a gather of all
the msrs in the kernel that we want for all cores in one vector read, and
we could even do crazy stuff like read multiple msrs per core for some set
of cores into an array ... I have a strange attraction for scatter/gather
IO. I.e. you could do this:
d0 = open("/dev/cpu/0/msr" ...)
d1 = open("/dev/cpu/1/msr" ...)
vread({d0, &buf[0], 16, 3a}, {d1, &buf[16], 16, 3a}) to read 3a and 3b in
cores 0 and 1
ron
On Mon, Nov 23, 2015 at 12:43 PM Barret Rhoden <[email protected]> wrote:
> Hi -
>
> On 2015-11-09 at 07:01 "'Davide Libenzi' via Akaros"
> <[email protected]> wrote:
> > This adds the "msr" device to devarch, allowing userspace software
> > (ie, CPU counters programming machinery) to interact with MSRs.
> > I also cleaned up that file a bit, dropping a few things that were
> > unused anywhere (there are some more in there).
> >
> >
> > https://github.com/brho/akaros/compare/master...dlibenzi:devarch_msr
> >
> >
> > The following changes since commit
> > 1165c2bda44b7f1fb3b776c0dc5b0fb4dd499961:
> >
> > Add networking unit tests (2015-11-03 12:00:38 -0500)
> >
> > are available in the git repository at:
> >
> > [email protected]:dlibenzi/akaros devarch_msr
> >
> > for you to fetch changes up to
> > bd69792de3d53f08b7b6e15de684b22744c52f62:
> >
> > Added test for devarch MSR file (2015-11-09 06:49:04 -0800)
>
> Comments below:
>
> > From b8425da6e7813b9f0c6cb2e1ebe0124236e9eeba Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 11:25:39 -0800
> > Subject: Added completion data structure
> >
> > diff --git a/kern/src/completion.c b/kern/src/completion.c
> > new file mode 100644
> > index 000000000000..40e922ab8120
> > --- /dev/null
> > +++ b/kern/src/completion.c
> > @@ -0,0 +1,30 @@
> > +/* Copyright (c) 2015 Google Inc
> > + * Davide Libenzi <[email protected]>
> > + * See LICENSE for details.
> > + */
> > +
> > +#include <kthread.h>
> > +#include <completion.h>
> > +
> > +void completion_init(struct completion *comp, int count)
> > +{
> > + cv_init(&comp->cv);
>
> If you want to signal completion from IRQ context, then we'll need these
> CVs to
> be irqsave. In that case, you'll need to use the irqsave variants for
> most of
> the CV operations (e.g. cv_lock_irqsave). If you don't, you'll run the
> risk of
> deadlocking, and CONFIG_SPINLOCK_DEBUG will complain.
>
> > + comp->count = count;
> > +}
> > +
> > +void completion_complete(struct completion *comp, int how_much)
> > +{
> > + cv_lock(&comp->cv);
> > + comp->count -= how_much;
> > + if (comp->count <= 0)
> > + __cv_broadcast(&comp->cv);
>
> Would it be a bug for comp->count to ever be less than 0? (Depends on how
> this
> is used).
>
> > + cv_unlock(&comp->cv);
> > +}
> > +
> > +void completion_wait(struct completion *comp)
> > +{
> > + cv_lock(&comp->cv);
> > + if (comp->count > 0)
> > + cv_wait(&comp->cv);
>
> This is fine as is.
>
> As an FYI, if you want to be a little faster, you can use
> cv_wait_and_unlock,
> then immediately return (careful of IRQs). And/or you could do a racy
> read of
> comp->count to see if it's already 0 without grabbing the lock.
>
> > + cv_unlock(&comp->cv);
> > +}
> > --
> > 2.6.0.rc2.230.g3dd15c0
> >
> > From 80128f764bdc236540355df797bf3fc550b93fb7 Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 11:27:01 -0800
> > Subject: Added CPU set data structure
>
> Do we want to call these cpu_sets or core_sets? In general, we've
> referred to
> logical processors as 'cores' or 'pcores' throughout the codebase. Though
> we
> also have been a little inconsistent with it.
>
> > diff --git a/kern/include/cpu_set.h b/kern/include/cpu_set.h
>
> > +#define CPUSET_INT_BITS (sizeof(cpu_set_int_t) * 8)
> > +
> > +typedef uint64_t cpu_set_int_t;
> > +
> > +struct cpu_set {
> > + cpu_set_int_t cpus[(MAX_NUM_CORES + CPUSET_INT_BITS - 1) /
> CPUSET_INT_BITS];
>
> We have a helper macro for the array size calculation:
> cpus[DIV_ROUND_UP(MAX_NUM_CORES, CPUSET_INT_BITS)];
>
> > +static inline void cpu_set_setcpu(struct cpu_set *cset, unsigned int
> cpuno)
> > +{
> > + cset->cpus[cpuno / CPUSET_INT_BITS] |= 1 << (cpuno %
> CPUSET_INT_BITS);
> > +}
>
> This stuff all looks correct. But we should use bitops for it instead of
> rolling our own. We brought in many of Linux's bitops (k/i/bitops.h,
> k/a/x/bitops.h). If there's a general-purpose bitop we're missing that you
> need, then we should add it to bitops.h
>
>
> > From 6781115d82ce87a64894ec773b0d69671ebb84ed Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 11:33:56 -0800
> > Subject: Added APIs to read and write MSR values on multiple CPUs
> >
>
> > --- a/kern/arch/x86/devarch.c
> > +++ b/kern/arch/x86/devarch.c
>
> > +uint64_t *cpuset_read_msr(const struct cpu_set *cset, uint32_t addr,
> > + size_t *nvalues)
> > +{
> > + int cpu = core_id();
> > + struct smp_read_values srv;
> > +
> > + srv.values = kzmalloc(num_cores * sizeof(*srv.values),
> KMALLOC_WAIT);
> > + if (srv.values) {
>
> With KMALLOC_WAIT, kmalloc always succeeds.
>
> > + completion_init(&srv.comp, cpu_set_remote_count(cset));
> > + for (int i = 0; i < num_cores; i++) {
> > + if (cpu_set_getcpu(cset, i)) {
> > + if (i == cpu)
> > + srv.values[i] = read_msr(addr);
> > + else
> > + send_kernel_message(i,
> cpu_read_msr, (long) &srv, addr, 0,
> > +
> KMSG_ROUTINE);
> > + }
> > + }
> > + completion_wait(&srv.comp);
>
> This whole block should be done with smp_do_in_cpus.
>
> > + *nvalues = num_cores;
> > + }
> > +
> > + return srv.values;
> > +}
> > +
> > +static void cpu_write_msr(uint32_t srcid, long a0, long a1, long a2)
> > +{
> > + struct completion *comp = (struct completion *) a0;
> > +
> > + write_msr((uint32_t) a1, (uint64_t) a2);
> > + completion_complete(comp, 1);
> > +}
> > +
> > +void cpuset_write_msr(const struct cpu_set *cset, uint32_t addr,
> uint64_t value)
> > +{
> > + int cpu = core_id();
> > + struct completion comp;
> > +
> > + completion_init(&comp, cpu_set_remote_count(cset));
> > + for (int i = 0; i < num_cores; i++) {
> > + if (cpu_set_getcpu(cset, i)) {
> > + if (i == cpu)
> > + write_msr(addr, value);
> > + else
> > + send_kernel_message(i, cpu_write_msr,
> (long) &comp, addr, value,
> > +
> KMSG_ROUTINE);
> > + }
> > + }
> > + completion_wait(&comp);
>
> This whole block should be done with smp_do_in_cpus.
>
> > From 008c4762727a39ad8e5c5dd2265317445ddb86bd Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 16:44:41 -0800
> > Subject: Cleaned up devarch.c code to remove unused code
>
> > -static char *devname(void)
> > -{
> > - return archdevtab.name;
> > -}
>
> Not a big deal, but the reason I had this devname() helper is that many
> devices
> borrow code from each other. devarch only used it once, but others use
> devname
> a bunch of times. When that code gets reused for a new device, we can
> leave
> devname as is. e.g. copying error messages around like:
> set_error(ENOSYS, "Can't tap #%s file type %d", devname(), type);
>
>
> > From ca1462749dbdd1c4d8fb34973d42da5a3972aebe Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 17:49:16 -0800
> > Subject: Plugged MSR read and write APIs into devarch MSR file
>
> > --- a/kern/arch/x86/devarch.c
> > +++ b/kern/arch/x86/devarch.c
> > @@ -19,12 +19,15 @@
> > #include <error.h>
> > #include <cpio.h>
> > #include <pmap.h>
> > +#include <umem.h>
> > #include <smp.h>
> > #include <ip.h>
> > #include <time.h>
> > #include <cpu_set.h>
> > #include <completion.h>
> >
> > +#define REAL_MEM_SIZE (1024 * 1024)
> > +
> > struct io_map {
> > struct io_map *next;
> > int reserved;
> > @@ -48,7 +51,7 @@ enum {
> > Qiow,
> > Qiol,
> > Qgdb,
> > - Qmem,
> > + Qrealmem,
>
> So in the last cleanup patch you removed realmem and make it read up to
> max_paddr bytes. In this patch, you undid that. Ideally, the parts of
> this patch that made that change would be squashed into the previous
> patch.
>
> You can do this with git rebase -i. For this commit, choose "e" (edit),
> then:
> $ git reset HEAD^ # unstage this entire commit's changes
> $ git add -p # selectively pick what should be squashed
> $ git commit --amend # push those changes into the previous commit
> or
> git commit -m WIP-squash # make a separate commit, fix it up later
> $ git add -p # add everything for *this* commit
> $ git commit -c ca14627 # commit with the original message
> $ git rebase --continue
>
> > uint64_t *cpuset_read_msr(const struct cpu_set *cset, uint32_t addr,
> > size_t *nvalues)
> > {
> > - int cpu = core_id();
> > struct smp_read_values srv;
> >
> > - srv.values = kzmalloc(num_cores * sizeof(*srv.values),
> KMALLOC_WAIT);
> > + srv.addr = addr;
> > + srv.values = kzmalloc(num_cores * sizeof(*srv.values), 0);
> > if (srv.values) {
> > - completion_init(&srv.comp, cpu_set_remote_count(cset));
> > - for (int i = 0; i < num_cores; i++) {
> > - if (cpu_set_getcpu(cset, i)) {
> > - if (i == cpu)
> > - srv.values[i] = read_msr(addr);
> > - else
> > - send_kernel_message(i,
> cpu_read_msr, (long) &srv, addr, 0,
> > -
> KMSG_ROUTINE);
> > - }
> > - }
> > - completion_wait(&srv.comp);
> > + smp_do_in_cpus(cset, cpu_read_msr, &srv);
>
> This is exactly what I said to do for the earlier patch. I'm glad you
> did it, but I would be even happier if you fixed up the git history so
> that it was like that originally.
>
> It may seem trivial, but not only does it keep the git history clean (so
> that we can easily see how the code evolved), but it saves me time. I
> was making comments on code that had already been fixed.
>
> So now it looks like there's two parts of this commit that should be in
> other commits. That's something that can be fixed with multiple runs of
> git rebase -i, as described above.
>
> > @@ -397,10 +386,27 @@ static long archread(struct chan *c, void *a, long
> n, int64_t offset)
> > return n;
> > case Qioalloc:
> > break;
> > - case Qmem:
> > - printk("readmem %p %p %p %p %p\n", offset, a, n,
> KADDR(0),
> > - max_paddr);
> > - return readmem(offset, a, n, KADDR(0), max_paddr);
> > + case Qrealmem:
> > + return readmem(offset, a, n, KADDR(0),
> REAL_MEM_SIZE);
>
> This change (realmen) is right next to Qmsr. When you try to fix this
> with a rebase and do a git add -p, git won't be able to figure out that
> these are separate. You'll have to specify 'e' and manually edit the
> hunk when you git add. It should be pretty straightforward.
>
> > + case Qmsr:
> > + cpu_set_init(&cset);
> > + cpu_set_fill_available(&cset);
> > + values = cpuset_read_msr(&cset, (uint32_t) offset,
> &nvalues);
> > + if (values) {
> > + if (n >= nvalues * sizeof(uint64_t)) {
> > + if (memcpy_to_user_errno(current,
> a, values,
> > +
> nvalues * sizeof(uint64_t)))
> > + n = -1;
> > + else
> > + n = nvalues *
> sizeof(uint64_t);
> > + } else {
> > + n = 0;
>
> So if the user doesn't ask for a large enough n, we tell them EOF? Is
> this a bug/error on the user's part?
>
> > From 0407977d44cc3cbcc8c24de1dca67676b6b6189f Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Sun, 8 Nov 2015 20:11:28 -0800
> > Subject: Added test for devarch MSR file
>
> > --- /dev/null
> > +++ b/user/utest/devarch_file_test.c
> > @@ -0,0 +1,69 @@
> > +
> > +#include <sys/stat.h>
> > +#include <sys/types.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <unistd.h>
> > +#include <fcntl.h>
> > +#include <utest/utest.h>
> > +
> > +TEST_SUITE("DEVARCH");
> > +
> > +/* <--- Begin definition of test cases ---> */
> > +
> > +static bool test_msr(void)
> > +{
> > + static const int max_cpus = 256;
>
> 256 should be MAX_NUM_CORES, or MAX_NUM_CORES + 1 if you want a buffer.
> To get the symbol, #include <ros/arch/arch.h>.
>
> > + int fd = open("#arch/msr", O_RDWR);
> > + uint64_t *values = malloc(max_cpus * sizeof(uint64_t));
> > + uint64_t tmp = 0;
> > + ssize_t n;
> > +
> > + UT_ASSERT_M("Failed to open MSR device file (#arch/msr)", fd >= 0);
> > + UT_ASSERT_M("Failed to allocated MSR values memory", values);
> > +
> > + n = pread(fd, values, max_cpus * sizeof(uint64_t), 0x10);
> > + UT_ASSERT_M("Failed to read MSR values from 0x10", n > 0);
> > +
> > + n = pread(fd, values, max_cpus * sizeof(uint64_t), 0x309);
> > + UT_ASSERT_M("Failed to read MSR values from 0x309", n > 0);
> > +
> > + n = pwrite(fd, &tmp, sizeof(uint64_t), 0x309);
> > + UT_ASSERT_M("Failed to write MSR values to 0x309",
> > + n == sizeof(uint64_t));
>
> Will these tests always be passing? What are those MSRs, and why do we
> allow
> the user to read / write them?
>
>
> > From e8cb42302b1ab90d6457356e3cc1063a499d3665 Mon Sep 17 00:00:00 2001
> > From: Davide Libenzi <[email protected]>
> > Date: Mon, 9 Nov 2015 16:33:48 -0800
> > Subject: Added whitelisting to MSR read/write code
>
> > --- a/kern/arch/x86/devarch.c
> > +++ b/kern/arch/x86/devarch.c
> > @@ -82,7 +80,19 @@ static struct dirtab archdir[Qmax] = {
> > {"realmem", {Qrealmem, 0}, 0, 0444},
> > {"msr", {Qmsr, 0}, 0, 0666},
> > };
> > -
> > +/* White list entries needs to be ordered by start address, and never
> overlap.
> > + */
> > +static const struct address_range msr_rd_wlist[] = {
> > + ADDRESS_RANGE(0x00000000, 0xffffffff),
> > +};
>
> I like the use of a whitelist, but we need to actually pick what we want
> userspace to be able to read. They could give us a garbage MSR and
> probably
> trigger a GPF or something nasty.
>
> > +static const struct address_range msr_wr_wlist[] = {
> > + ADDRESS_RANGE(0x000000c1, 0x000000c8),
> > + ADDRESS_RANGE(0x00000186, 0x00000189),
> > + ADDRESS_RANGE(0x00000199, 0x00000199),
> > + ADDRESS_RANGE(0x00000309, 0x0000030b),
> > + ADDRESS_RANGE(0x0000038d, 0x0000038d),
> > + ADDRESS_RANGE(0x0000038f, 0x00000391),
> > +};
>
> What MSRs are these? Given the sensitivity of this stuff, it'd be nice to
> use
> some sort of #define / name for MSRs instead of magic numbers. I like the
> address range, but we lose out on explicitly marking what MSRs we are
> allowing.
> I think using real names (e.g. MSR_FOOBAR) and also using the range should
> be
> okay, so long as we only make ranges of similar, contiguous MSRs.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Akaros" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
For more options, visit https://groups.google.com/d/optout.