from:"\"Richard W.M. Jones\""

Re: INFO: task hung in nbd_ioctl

2019-10-17 Thread Richard W.M. Jones

On Thu, Oct 17, 2019 at 10:47:59AM -0500, Mike Christie wrote:
> On 10/17/2019 09:03 AM, Richard W.M. Jones wrote:
> > On Tue, Oct 01, 2019 at 04:19:25PM -0500, Mike Christie wrote:
> >> Hey Josef and nbd list,
> >>
> >> I had a question about if there are any socket family restrictions for nbd?
> > 
> > In normal circumstances, in userspace, the NBD protocol would only be
> > used over AF_UNIX or AF_INET/AF_INET6.
> > 
> > There's a bit of confusion because netlink is used by nbd-client to
> > configure the NBD device, setting things like block size and timeouts
> > (instead of ioctl which is deprecated).  I think you don't mean this
> > use of netlink?
> 
> I didn't. It looks like it is just a bad test.
> 
> For the automated test in this thread the test created a AF_NETLINK
> socket and passed it into the NBD_SET_SOCK ioctl. That is what got used
> for the NBD_DO_IT ioctl.
> 
> I was not sure if the test creator picked any old socket and it just
> happened to pick one nbd never supported, or it was trying to simulate
> sockets that did not support the shutdown method.
> 
> I attached the automated test that got run (test.c).

I'd say it sounds like a bad test, but I'm not familiar with syzkaller
nor how / from where it generates these tests.  Did someone report a
bug and then syzkaller wrote this test?

Rich.

> > 
> >> The bug here is that some socket familys do not support the
> >> sock->ops->shutdown callout, and when nbd calls kernel_sock_shutdown
> >> their callout returns -EOPNOTSUPP. That then leaves recv_work stuck in
> >> nbd_read_stat -> sock_xmit -> sock_recvmsg. My patch added a
> >> flush_workqueue call, so for socket familys like AF_NETLINK in this bug
> >> we hang like we see below.
> >>
> >> I can just remove the flush_workqueue call in that code path since it's
> >> not needed there, but it leaves the original bug my patch was hitting
> >> where we leave the recv_work running which can then result in leaked
> >> resources, or possible use after free crashes and you still get the hang
> >> if you remove the module.
> >>
> >> It looks like we have used kernel_sock_shutdown for a while so I thought
> >> we might never have supported sockets that did not support the callout.
> >> Is that correct? If so then I can just add a check for this in
> >> nbd_add_socket and fix that bug too.
> > 
> > Rich.
> > 
> >> On 09/30/2019 05:39 PM, syzbot wrote:
> >>> Hello,
> >>>
> >>> syzbot found the following crash on:
> >>>
> >>> HEAD commit:bb2aee77 Add linux-next specific files for 20190926
> >>> git tree:   linux-next
> >>> console output: https://syzkaller.appspot.com/x/log.txt?x=13385ca360
> >>> kernel config:  https://syzkaller.appspot.com/x/.config?x=e60af4ac5a01e964
> >>> dashboard link:
> >>> https://syzkaller.appspot.com/bug?extid=24c12fa8d218ed26011a
> >>> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> >>> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12abc2a360
> >>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11712c0560
> >>>
> >>> The bug was bisected to:
> >>>
> >>> commit e9e006f5fcf2bab59149cb38a48a4817c1b538b4
> >>> Author: Mike Christie 
> >>> Date:   Sun Aug 4 19:10:06 2019 +
> >>>
> >>> nbd: fix max number of supported devs
> >>>
> >>> bisection log:  
> >>> https://syzkaller.appspot.com/x/bisect.txt?x=1226f3c560
> >>> final crash:
> >>> https://syzkaller.appspot.com/x/report.txt?x=1126f3c560
> >>> console output: https://syzkaller.appspot.com/x/log.txt?x=1626f3c560
> >>>
> >>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >>> Reported-by: syzbot+24c12fa8d218ed260...@syzkaller.appspotmail.com
> >>> Fixes: e9e006f5fcf2 ("nbd: fix max number of supported devs")
> >>>
> >>> INFO: task syz-executor390:8778 can't die for more than 143 seconds.
> >>> syz-executor390 D27432  8778   8777 0x4004
> >>> Call Trace:
> >>>  context_switch kernel/sched/core.c:3384 [inline]
> >>>  __schedule+0x828/0x1c20 kernel/sched/core.c:4065
> >>>  schedule+0xd9/0x260 kernel/sched/core.c:4132
> >>>  schedule_timeout+0x717/0xc50 kernel/time/timer.c:1871
> >>>  do_wai

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-17 Thread Richard W.M. Jones

On Mon, Feb 18, 2019 at 12:15:14AM +0100, Pavel Machek wrote:
> But Shaun reported it happened somehow often for them, so he might
> have a practical test case... better than my theories :-).

Yes, certainly not saying this isn't a problem.

I think the good news is the fix seems quite easy, ie. to add mlockall
and adjust the OOM killer score, as is done currently in the client:

https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L867-L885
https://github.com/NetworkBlockDevice/nbd/blob/3969c3f81a11a483f267a55ed6665d260dc9e1d2/nbd-client.c#L1219

For now I have added a note in the TODO file to follow up in case we
get a test case or reports of a problem:

https://github.com/libguestfs/nbdkit/commit/72e0afe2e280d895f68941677fafa559ddc3bb0d

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-17 Thread Richard W.M. Jones

So not to dispute that this could be a bug, but I couldn't cause a
deadlock.  I wonder if you can see something wrong with my method?

*** Set up ***

- kernel 5.0.0-0.rc3.git0.1.fc30.x86_64
- nbd-client 3.19-1.fc30
- nbdkit 1.11.5 (git commit ef9d1978ce28)

Baremetal machine was booted with mem=2G to artificially limit the
RAM.  The machine has 64G of swap.

# free -m
  totalusedfree  shared  buff/cache   available
Mem:   1806 3291383   0  931357
Swap: 65535 179   65356

*** Method ***

I started nbdkit as a 4G RAM disk:

  ./nbdkit memory size=4G

This is implemented as a sparse array with a 2 level page table, and
should allocate (using malloc) every time a new area of the disk is
written to.  Exact implementation is here:
https://github.com/libguestfs/nbdkit/tree/master/common/sparse

I started nbd-client using the -swap option which uses
mlockall(MCL_FUTURE) to lock the client into RAM.

  nbd-client -b 512 -swap localhost /dev/nbd0

I then created a filesystem on the RAM disk, mounted it, and copied a
3G file into it.  I tried this various ways, but the variation I was
eventually happy with was:

  mke2fs /dev/nbd0
  mount /dev/nbd0 /tmp/mnt

  dd if=/dev/zero of=/tmp/big bs=1M count=3000
  cp /tmp/big /tmp/mnt/big

I couldn't get any kind of deadlock or failure in this test.

(Note that if you repeat the test several times, in theory you could
delete the file and fstrim the filesystem, but when I was testing it
to be sure I unmounted everything and killed and restarted nbdkit
between each test.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-15 Thread Richard W.M. Jones

On Fri, Feb 15, 2019 at 10:41:26PM +, Richard W.M. Jones wrote:
> On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> > Hi!
> > 
> > I watched fosdem talk about
> > nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> > of warning: I'm not sure using it read-write on localhost is safe.
> > 
> > In particular, user application could create a lot of dirty data
> > quickly. If there's not enough memory for nbdkit (or nbd-client or
> > nbd-server), you might get a deadlock.
> 
> Thanks for the kind words about the talk.  I've added Wouter Verhelst
> & the NBD mailing list to CC.  Although I did the talk because the
> subject is interesting, how I actually use nbdkit / NBD is to talk to
> qemu and that's where I have most experience and where we (Red Hat)
> use it in production systems.
> 
> However in January I spent a lot of time exercising the NBD loop-mount
> + nbdkit case using fio in order to find contention / bottlenecks in
> our use of threads and locks.  I didn't notice any particular problems
> then, but it's possible my testing wasn't thorough enough.  Or that
> fio only creates small numbers of dirty pages (because of locality in
> its access patterns I guess?)
> 
> When you say it's not safe, what could happen?  What would we observe
> if it was going wrong?

Reading more carefully I see you said we'd observe a deadlock.  I
didn't see that, but again my testing of this wouldn't have been very
thorough.  When I have some time I'll try creating / spooling huge
files into an NBD loop mount to see if I can cause a deadlock.

Thanks, Rich.

> > Also note that nbd.txt in Documentation/blockdev/ points to
> > sourceforge; it should probably point to
> > https://github.com/NetworkBlockDevice/nbd ?
> 
> Wouter should be able to say what the correct link should be.
> 
> Rich.
> 
> -- 
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> libguestfs lets you edit virtual machines.  Supports shell scripting,
> bindings from many languages.  http://libguestfs.org

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: nbd, nbdkit, loopback mounts and memory management

2019-02-15 Thread Richard W.M. Jones

On Fri, Feb 15, 2019 at 08:19:54PM +0100, Pavel Machek wrote:
> Hi!
> 
> I watched fosdem talk about
> nbdkit... https://www.youtube.com/watch?v=9E5A608xJG0 . Nice. But word
> of warning: I'm not sure using it read-write on localhost is safe.
> 
> In particular, user application could create a lot of dirty data
> quickly. If there's not enough memory for nbdkit (or nbd-client or
> nbd-server), you might get a deadlock.

Thanks for the kind words about the talk.  I've added Wouter Verhelst
& the NBD mailing list to CC.  Although I did the talk because the
subject is interesting, how I actually use nbdkit / NBD is to talk to
qemu and that's where I have most experience and where we (Red Hat)
use it in production systems.

However in January I spent a lot of time exercising the NBD loop-mount
+ nbdkit case using fio in order to find contention / bottlenecks in
our use of threads and locks.  I didn't notice any particular problems
then, but it's possible my testing wasn't thorough enough.  Or that
fio only creates small numbers of dirty pages (because of locality in
its access patterns I guess?)

When you say it's not safe, what could happen?  What would we observe
if it was going wrong?

> Also note that nbd.txt in Documentation/blockdev/ points to
> sourceforge; it should probably point to
> https://github.com/NetworkBlockDevice/nbd ?

Wouter should be able to say what the correct link should be.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH v2] libata: Drop SanDisk SD7UB3Q*G1001 NOLPM quirk

2018-05-31 Thread Richard W.M. Jones

On Thu, May 31, 2018 at 01:21:07PM +0200, Hans de Goede wrote:
> Commit 184add2ca23c ("libata: Apply NOLPM quirk for SanDisk
> SD7UB3Q*G1001 SSDs") disabled LPM for SanDisk SD7UB3Q*G1001 SSDs.
> 
> This has lead to several reports of users of that SSD where LPM
> was working fine and who know have a significantly increased idle
> power consumption on their laptops.
> 
> Likely there is another problem on the T450s from the original
> reporter which gets exposed by the uncore reaching deeper sleep
> states (higher PC-states) due to LPM being enabled. The problem as
> reported, a hardfreeze about once a day, already did not sound like
> it would be caused by LPM and the reports of the SSD working fine
> confirm this. The original reporter is ok with dropping the quirk.
> 
> A X250 user has reported the same hard freeze problem and for him
> the problem went away after unrelated updates, I suspect some GPU
> driver stack changes fixed things.
> 
> TL;DR: The original reporters problem were triggered by LPM but not
> an LPM issue, so drop the quirk for the SSD in question.

As the reporter of the original issue, I agree with Hans's
analysis above, so ACK from me.

Rich.

> BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1583207
> Cc: sta...@vger.kernel.org
> Cc: Richard W.M. Jones 
> Cc: Lorenzo Dalrio 
> Reported-by: Lorenzo Dalrio 
> Signed-off-by: Hans de Goede 
> ---
> Changes in v2:
> -Rebase on 4.17-rc7
> ---
>  drivers/ata/libata-core.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 346b163f6e89..9bfd2f7e4542 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -4557,9 +4557,6 @@ static const struct ata_blacklist_entry 
> ata_device_blacklist [] = {
>   { "SAMSUNG MZMPC128HBFU-000MV", "CXM14M1Q", ATA_HORKAGE_NOLPM, },
>   { "SAMSUNG SSD PM830 mSATA *",  "CXM13D1Q", ATA_HORKAGE_NOLPM, },
>  
> - /* Sandisk devices which are known to not handle LPM well */
> - { "SanDisk SD7UB3Q*G1001",  NULL,   ATA_HORKAGE_NOLPM, },
> -
>   /* devices that don't properly handle queued TRIM commands */
>   { "Micron_M500IT_*","MU01", ATA_HORKAGE_NO_NCQ_TRIM |
>   ATA_HORKAGE_ZERO_AFTER_TRIM, },
> -- 
> 2.17.0

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:41:47AM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 10, 2017 at 10:35:11PM +0100, Richard W.M. Jones wrote:
> > On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote:
> > > Then we probably should fail probe if vq size is too small.
> > 
> > What does this mean?
> > 
> > Rich.
> 
> We must prevent driver from submitting s/g lists > vq size to device.

Fair point.  I would have thought (not knowing anything about how this
works in the kernel) that the driver should be able to split up
scatter-gather lists if they are larger than what the driver supports?

Anyway the commit message is derived from what Paolo told me on IRC,
so let's see what he says.

Rich.

> Either tell linux to avoid s/g lists that are too long, or
> simply fail request if this happens, or refuse to attach driver to device.
> 
> Later option would look something like this within probe:
> 
> for (i = VIRTIO_SCSI_VQ_BASE; i < num_vqs; i++)
>   if (vqs[i]->num < MAX_SG_USED_BY_LINUX)
>   goto err;
> 
> 
> I don't know what's MAX_SG_USED_BY_LINUX though.
> 
> > -- 
> > Richard Jones, Virtualization Group, Red Hat 
> > http://people.redhat.com/~rjones
> > Read my programming and virtualization blog: http://rwmj.wordpress.com
> > virt-builder quickly builds VMs from scratch
> > http://libguestfs.org/virt-builder.1.html

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:31:44AM +0300, Michael S. Tsirkin wrote:
> Then we probably should fail probe if vq size is too small.

What does this mean?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

On Fri, Aug 11, 2017 at 12:21:16AM +0300, Michael S. Tsirkin wrote:
> On Thu, Aug 10, 2017 at 05:40:34PM +0100, Richard W.M. Jones wrote:
> > If using indirect descriptors, you can make the total_sg as large as
> > you want.
> 
> That would be a spec violation though, even if it happens to
> work on current QEMU.
> 
> The spec says:
>   A driver MUST NOT create a descriptor chain longer than the Queue Size 
> of the device.
> 
> What prompted this patch?
> Do we ever encounter this situation?

This patch is needed because the following (2/2) patch will trigger
that BUG_ON if I set virtqueue_size=64 or any smaller value.

The precise backtrace is below.

Rich.

[4.029510] [ cut here ]
[4.030127] kernel BUG at drivers/virtio/virtio_ring.c:299!
[4.030834] invalid opcode:  [#1] SMP
[4.031340] Modules linked in: libcrc32c crc8 crc7 crc4 crc_itu_t virtio_pci 
virtio_mmio virtio_input virtio_balloon virtio_scsi nd_pmem nd_btt virtio_net 
virtio_crypto crypto_engine virtio_console virtio_rng virtio_blk virtio_ring 
virtio nfit crc32_generic crct10dif_pclmul crc32c_intel crc32_pclmul
[4.034606] CPU: 0 PID: 1 Comm: init Not tainted 4.13.0-rc4+ #100
[4.035354] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
[4.036770] task: 9a859e243300 task.stack: a731c00cc000
[4.037506] RIP: 0010:virtqueue_add_sgs+0x23d/0x460 [virtio_ring]
[4.038250] RSP: :a731c00cf6e0 EFLAGS: 00010097
[4.038898] RAX:  RBX: 0011 RCX: dd0680646c40
[4.039762] RDX:  RSI: a731c00cf7b8 RDI: 9a85945c4d68
[4.040634] RBP: a731c00cf748 R08: 9a85945c4a78 R09: 01080020
[4.041508] R10: a731c00cf788 R11: 9a859b3d3120 R12: a731c00cf7d0
[4.042382] R13: a731c00cf7d0 R14: 0001 R15: 9a859b3f8200
[4.043248] FS:  () GS:9a859e60() 
knlGS:
[4.044232] CS:  0010 DS:  ES:  CR0: 80050033
[4.044942] CR2: 7fcff02e931c CR3: 1d23b000 CR4: 003406f0
[4.045815] DR0:  DR1:  DR2: 
[4.046684] DR3:  DR6: fffe0ff0 DR7: 0400
[4.047559] Call Trace:
[4.047876]  virtscsi_add_cmd+0x1c9/0x280 [virtio_scsi]
[4.048528]  virtscsi_kick_cmd+0x38/0x90 [virtio_scsi]
[4.049161]  virtscsi_queuecommand+0x104/0x280 [virtio_scsi]
[4.049875]  virtscsi_queuecommand_single+0x38/0x40 [virtio_scsi]
[4.050628]  scsi_dispatch_cmd+0xf9/0x390
[4.051128]  scsi_queue_rq+0x5e5/0x6f0
[4.051602]  blk_mq_dispatch_rq_list+0x1ff/0x3c0
[4.052175]  blk_mq_sched_dispatch_requests+0x199/0x1f0
[4.052824]  __blk_mq_run_hw_queue+0x11d/0x1b0
[4.053382]  __blk_mq_delay_run_hw_queue+0x8d/0xa0
[4.053972]  blk_mq_run_hw_queue+0x14/0x20
[4.054485]  blk_mq_sched_insert_requests+0x96/0x120
[4.055095]  blk_mq_flush_plug_list+0x19d/0x410
[4.055661]  blk_flush_plug_list+0xf9/0x2b0
[4.056182]  blk_finish_plug+0x2c/0x40
[4.056655]  __do_page_cache_readahead+0x32e/0x400
[4.057261]  filemap_fault+0x2fb/0x890
[4.057731]  ? filemap_fault+0x2fb/0x890
[4.058220]  ? find_held_lock+0x3c/0xb0
[4.058714]  ext4_filemap_fault+0x34/0x50
[4.059212]  __do_fault+0x1e/0x110
[4.059644]  __handle_mm_fault+0x6b2/0x1080
[4.060167]  handle_mm_fault+0x178/0x350
[4.060662]  __do_page_fault+0x26e/0x510
[4.061152]  trace_do_page_fault+0x9d/0x290
[4.061677]  do_async_page_fault+0x51/0xa0
[4.062189]  async_page_fault+0x28/0x30
[4.062667] RIP: 0033:0x7fcff030a24f
[4.063113] RSP: 002b:7ffefc2ad078 EFLAGS: 00010206
[4.063760] RAX: 7fcff02e931c RBX: 7fcff050f660 RCX: 7fcff02e935c
[4.064648] RDX: 0664 RSI:  RDI: 7fcff02e931c
[4.065519] RBP: 7ffefc2ad320 R08: 7fcff02e931c R09: 00027000
[4.066392] R10: 7fcff02e9980 R11: 0206 R12: 7ffefc2ad0b0
[4.067263] R13: 7ffefc2ad408 R14: 0002 R15: 87f0
[4.068135] Code: af 01 c7 45 c8 01 00 00 00 45 31 ed e9 9b fe ff ff 31 db 
48 83 7d d0 00 0f 85 3f fe ff ff 0f 0b 48 8b 7d b8 e8 e5 fd 22 de eb 8b <0f> 0b 
0f 0b 44 89 6d a8 45 89 f5 48 8b 45 a0 44 8d 70 01 48 83 
[4.070506] RIP: virtqueue_add_sgs+0x23d/0x460 [virtio_ring] RSP: 
a731c00cf6e0
[4.071434] ---[ end trace 02532659840e2a64 ]---


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

[PATCH v2 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

If using indirect descriptors, you can make the total_sg as large as
you want.  If not, BUG is too serious because the function later
returns -ENOSPC.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones 
Reviewed-by: Paolo Bonzini 
---
 drivers/virtio/virtio_ring.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..27cbc1eab868 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+   }
 
if (desc) {
/* Use a single buffer which doesn't continue */
-- 
2.13.1

[PATCH v2 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.

2017-08-10 Thread Richard W.M. Jones

Since switching to blk-mq as the default in commit 5c279bd9e406
("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as
much kernel memory.

qemu currently allocates a fixed 128 entry virtqueue.  can_queue
currently is set to 1024.  But with indirect descriptors, each command
in the queue takes 1 virtqueue entry, so the number of commands which
can be queued is equal to the length of the virtqueue.

Note I intend to send a patch to qemu to allow the virtqueue size to
be configured from the qemu command line.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones 
---
 drivers/scsi/virtio_scsi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..7c28e8d4955a 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -818,7 +818,6 @@ static struct scsi_host_template 
virtscsi_host_template_single = {
.eh_timed_out = virtscsi_eh_timed_out,
.slave_alloc = virtscsi_device_alloc,
 
-   .can_queue = 1024,
.dma_boundary = UINT_MAX,
.use_clustering = ENABLE_CLUSTERING,
.target_alloc = virtscsi_target_alloc,
@@ -839,7 +838,6 @@ static struct scsi_host_template 
virtscsi_host_template_multi = {
.eh_timed_out = virtscsi_eh_timed_out,
.slave_alloc = virtscsi_device_alloc,
 
-   .can_queue = 1024,
.dma_boundary = UINT_MAX,
.use_clustering = ENABLE_CLUSTERING,
.target_alloc = virtscsi_target_alloc,
@@ -972,6 +970,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
-- 
2.13.1

[PATCH v2 0/2] virtio_scsi: Set can_queue based on size of virtqueue.

2017-08-10 Thread Richard W.M. Jones

v1 was here:

  https://lkml.org/lkml/2017/8/10/689

v1 -> v2:

  Remove .can_queue field from the templates.

Rich.

[PATCH 2/2] virtio: virtio_scsi: Set can_queue to the length of the virtqueue.

2017-08-10 Thread Richard W.M. Jones

Since switching to blk-mq as the default in commit 5c279bd9e406
("scsi: default to scsi-mq"), virtio-scsi LUNs consume about 10x as
much kernel memory.

qemu currently allocates a fixed 128 entry virtqueue.  can_queue
currently is set to 1024.  But with indirect descriptors, each command
in the queue takes 1 virtqueue entry, so the number of commands which
can be queued is equal to the length of the virtqueue.

Note I intend to send a patch to qemu to allow the virtqueue size to
be configured from the qemu command line.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones 
---
 drivers/scsi/virtio_scsi.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
-- 
2.13.1

[PATCH 0/2] virtio_scsi: Set can_queue based on size of virtqueue.

2017-08-10 Thread Richard W.M. Jones

Earlier discussion:
https://lkml.org/lkml/2017/8/4/601
"Increased memory usage with scsi-mq"

Downstream bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=1478201

[PATCH 1/2] virtio: Reduce BUG if total_sg > virtqueue size to WARN.

2017-08-10 Thread Richard W.M. Jones

If using indirect descriptors, you can make the total_sg as large as
you want.  If not, BUG is too serious because the function later
returns -ENOSPC.

Thanks Paolo Bonzini, Christoph Hellwig.

Signed-off-by: Richard W.M. Jones 
---
 drivers/virtio/virtio_ring.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..27cbc1eab868 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+   }
 
if (desc) {
/* Use a single buffer which doesn't continue */
-- 
2.13.1

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

OK this is looking a bit better now.

  With scsi-mq enabled:   175 disks
  virtqueue_size=64:  318 disks *
  virtqueue_size=16:  775 disks *
  With scsi-mq disabled: 1755 disks
* = new results

I also ran the whole libguestfs test suite with virtqueue_size=16
(with no failures shown).  As this tests many different disk I/O
operations, it gives me some confidence that things generally work.

Do you have any other comments about the patches?  I'm not sure I know
enough to write an intelligent commit message for the kernel patch.

Rich.


--- kernel patch ---

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5e1b548828e6..2d7509da9f39 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -296,7 +296,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
}
 #endif
 
-   BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
 
head = vq->free_head;
@@ -305,8 +304,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free)
desc = alloc_indirect(_vq, total_sg, gfp);
-   else
+   else {
desc = NULL;
+   WARN_ON_ONCE(total_sg > vq->vring.num && !vq->indirect);
+}
 
if (desc) {
/* Use a single buffer which doesn't continue */

--- qemu patch ---

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index eb639442d1..aadd99aad1 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev,
 s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE;
 s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE;
 
-s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl);
-s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt);
+s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl);
+s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt);
 for (i = 0; i < s->conf.num_queues; i++) {
-s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd);
+s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd);
 }
 }
 
@@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, 
Error **errp)
 
 static Property virtio_scsi_properties[] = {
 DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 
1),
+DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, 
parent_obj.conf.virtqueue_size, 128),
 DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
   0x),
 DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index de6ae5a9f6..e30a92d3e7 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig;
 
 struct VirtIOSCSIConf {
 uint32_t num_queues;
+uint32_t virtqueue_size;
 uint32_t max_sectors;
 uint32_t cmd_per_lun;
 #ifdef CONFIG_VHOST_SCSI



-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

On Thu, Aug 10, 2017 at 02:53:58PM +0200, Paolo Bonzini wrote:
> can_queue and cmd_per_lun are different.  can_queue should be set to the
> value of vq->vring.num where vq is the command virtqueue (the first one
> is okay if there's >1).
> 
> If you want to change it, you'll have to do so in QEMU.

Here are a couple more patches I came up with, the first to Linux,
the second to qemu.

There are a few problems ...

(1) Setting virtqueue_size to < 128 causes a BUG_ON failure in
virtio_ring.c:virtqueue_add at:

BUG_ON(total_sg > vq->vring.num);

I initially thought that I should also set cmd_per_lun to the same
value, but that doesn't help.  Is there some relationship between
virtqueue_size and other settings?

(2) Can/should the ctrl, event and cmd queue sizes be set to the same
values?  Or should ctrl & event be left at 128?

(3) It seems like it might be a problem that virtqueue_size is not
passed through the virtio_scsi_conf struct (which is ABI between the
kernel and qemu AFAICT and so not easily modifiable).  However I think
this is not a problem because virtqueue size is stored in the
Virtqueue Descriptor table, which is how the kernel gets the value.
Is that right?

Rich.


--- kernel patch ---

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..d6b4ff634c0d 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -972,6 +972,8 @@ static int virtscsi_probe(struct virtio_device *vdev)
if (err)
goto virtscsi_init_failed;
 
+   shost->can_queue = virtqueue_get_vring_size(vscsi->req_vqs[0].vq);
+
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;


--- qemu patch ---

diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c
index eb639442d1..aadd99aad1 100644
--- a/hw/scsi/virtio-scsi.c
+++ b/hw/scsi/virtio-scsi.c
@@ -867,10 +867,10 @@ void virtio_scsi_common_realize(DeviceState *dev,
 s->sense_size = VIRTIO_SCSI_SENSE_DEFAULT_SIZE;
 s->cdb_size = VIRTIO_SCSI_CDB_DEFAULT_SIZE;
 
-s->ctrl_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, ctrl);
-s->event_vq = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, evt);
+s->ctrl_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, ctrl);
+s->event_vq = virtio_add_queue(vdev, s->conf.virtqueue_size, evt);
 for (i = 0; i < s->conf.num_queues; i++) {
-s->cmd_vqs[i] = virtio_add_queue(vdev, VIRTIO_SCSI_VQ_SIZE, cmd);
+s->cmd_vqs[i] = virtio_add_queue(vdev, s->conf.virtqueue_size, cmd);
 }
 }
 
@@ -917,6 +917,7 @@ static void virtio_scsi_device_unrealize(DeviceState *dev, 
Error **errp)
 
 static Property virtio_scsi_properties[] = {
 DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 
1),
+DEFINE_PROP_UINT32("virtqueue_size", VirtIOSCSI, 
parent_obj.conf.virtqueue_size, 128),
 DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
   0x),
 DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h
index de6ae5a9f6..e30a92d3e7 100644
--- a/include/hw/virtio/virtio-scsi.h
+++ b/include/hw/virtio/virtio-scsi.h
@@ -48,6 +48,7 @@ typedef struct virtio_scsi_config VirtIOSCSIConfig;
 
 struct VirtIOSCSIConf {
 uint32_t num_queues;
+uint32_t virtqueue_size;
 uint32_t max_sectors;
 uint32_t cmd_per_lun;
 #ifdef CONFIG_VHOST_SCSI


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: Increased memory usage with scsi-mq

2017-08-10 Thread Richard W.M. Jones

On Wed, Aug 09, 2017 at 06:50:10PM +0200, Paolo Bonzini wrote:
> On 09/08/2017 18:01, Christoph Hellwig wrote:
> > On Mon, Aug 07, 2017 at 03:07:48PM +0200, Paolo Bonzini wrote:
> >> can_queue should depend on the virtqueue size, which unfortunately can
> >> vary for each virtio-scsi device in theory.  The virtqueue size is
> >> retrieved by drivers/virtio prior to calling vring_create_virtqueue, and
> >> in QEMU it is the second argument to virtio_add_queue.
> > 
> > Why is that unfortunate?  We don't even have to set can_queue in
> > the host template, we can dynamically set it on per-host basis.
> 
> Ah, cool, I thought allocations based on can_queue happened already in
> scsi_host_alloc, but they happen at scsi_add_host time.

I think I've decoded all that information into the patch below.

I tested it, and it appears to work: when I set cmd_per_lun on the
qemu command line, I see that the guest can add more disks:

  With scsi-mq enabled:   175 disks
  cmd_per_lun not set:177 disks  *
  cmd_per_lun=16: 776 disks  *
  cmd_per_lun=4: 1160 disks  *
  With scsi-mq disabled: 1755 disks
 * = new result

>From my point of view, this is a good result, but you should be warned
that I don't fully understand what's going on here and I may have made
obvious or not-so-obvious mistakes.

I tested the performance impact and it's not noticable in the
libguestfs case even with very small cmd_per_lun settings, but
libguestfs is largely serial and so this result won't be applicable to
guests in general.

Also, should the guest kernel validate cmd_per_lun to make sure it's
not too small or large?  And if so, what would the limits be?

Rich.

>From e923e49846189b2f55f3f02b70a290d4be237ed5 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" 
Date: Thu, 10 Aug 2017 12:21:47 +0100
Subject: [PATCH] scsi: virtio_scsi: Set can_queue based on cmd_per_lun passed
 by hypervisor.

Signed-off-by: Richard W.M. Jones 
---
 drivers/scsi/virtio_scsi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 9be211d68b15..b22591e9b16b 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -973,7 +973,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
goto virtscsi_init_failed;
 
cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
-   shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);
+   shost->cmd_per_lun = shost->can_queue = cmd_per_lun;
shost->max_sectors = virtscsi_config_get(vdev, max_sectors) ?: 0x;
 
/* LUNs > 256 are reported with format 1, so they go in the range
-- 
2.13.1


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Increased memory usage with scsi-mq

2017-08-07 Thread Richard W.M. Jones

On Mon, Aug 07, 2017 at 02:11:39PM +0200, Paolo Bonzini wrote:
> You could also add a module parameter to the driver, and set it to 64 on
> the kernel command line (there is an example in
> drivers/scsi/vmw_pvscsi.c of how to do it).

[Proviso: I've not tested the performance of difference values, nor do
I have any particular knowledge in this area]

can_queue is documented as:

 * This determines if we will use a non-interrupt driven
 * or an interrupt driven scheme.  It is set to the maximum number
 * of simultaneous commands a given host adapter will accept.

Wouldn't it be better to make it default to k * number of CPUs for
some small integer k?

I looked at the other scsi drivers and they set it to all kinds of
values.  1, small integers, large integers, configurable values.

Also I noticed this code in virtio_scsi.c:

cmd_per_lun = virtscsi_config_get(vdev, cmd_per_lun) ?: 1;
shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue);

but setting cmd_per_lun (as a qemu virtio-scsi-pci parameter) didn't
seem to make any difference to memory usage.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: Increased memory usage with scsi-mq

2017-08-05 Thread Richard W.M. Jones

On Sat, Aug 05, 2017 at 03:39:54PM +0200, Christoph Hellwig wrote:
> For now can you apply this testing patch to the guest kernel?
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 9be211d68b15..0cbe2c882e1c 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -818,7 +818,7 @@ static struct scsi_host_template 
> virtscsi_host_template_single = {
>   .eh_timed_out = virtscsi_eh_timed_out,
>   .slave_alloc = virtscsi_device_alloc,
>  
> - .can_queue = 1024,
> + .can_queue = 64,
>   .dma_boundary = UINT_MAX,
>   .use_clustering = ENABLE_CLUSTERING,
>   .target_alloc = virtscsi_target_alloc,
> @@ -839,7 +839,7 @@ static struct scsi_host_template 
> virtscsi_host_template_multi = {
>   .eh_timed_out = virtscsi_eh_timed_out,
>   .slave_alloc = virtscsi_device_alloc,
>  
> - .can_queue = 1024,
> + .can_queue = 64,
>   .dma_boundary = UINT_MAX,
>   .use_clustering = ENABLE_CLUSTERING,
>   .target_alloc = virtscsi_target_alloc,
> @@ -983,7 +983,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
>   shost->max_id = num_targets;
>   shost->max_channel = 0;
>   shost->max_cmd_len = VIRTIO_SCSI_CDB_SIZE;
> - shost->nr_hw_queues = num_queues;
> + shost->nr_hw_queues = 1;
>  
>  #ifdef CONFIG_BLK_DEV_INTEGRITY
>   if (virtio_has_feature(vdev, VIRTIO_SCSI_F_T10_PI)) {

Yes, that's an improvement, although it's still a little way off the
density possible the old way:

  With scsi-mq enabled:   175 disks
* With this patch:319 disks *
  With scsi-mq disabled: 1755 disks

Also only the first two hunks are necessary.  The kernel behaves
exactly the same way with or without the third hunk (ie. num_queues
must already be 1).

Can I infer from this that qemu needs a way to specify the can_queue
setting to the virtio-scsi driver in the guest kernel?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: Increased memory usage with scsi-mq

2017-08-05 Thread Richard W.M. Jones

On Sat, Aug 05, 2017 at 10:44:36AM +0200, Christoph Hellwig wrote:
> On Fri, Aug 04, 2017 at 10:00:47PM +0100, Richard W.M. Jones wrote:
> > I read your slides about scsi-mq and it seems like a significant
> > benefit to large machines, but could the out of the box defaults be
> > made more friendly for small memory machines?
> 
> The default inumber of queues and queue depth and thus memory usage is
> set by the LLDD.
> 
> Try to reduce the can_queue value in virtio_scsi and/or make sure
> you use the single queue variant in your VM (which should be tunable
> in qemu).

Thanks, this is interesting.

Virtio-scsi seems to have a few settable parameters that might be
related to this:

  DEFINE_PROP_UINT32("num_queues", VirtIOSCSI, parent_obj.conf.num_queues, 1),
  DEFINE_PROP_UINT32("max_sectors", VirtIOSCSI, parent_obj.conf.max_sectors,
0x),
  DEFINE_PROP_UINT32("cmd_per_lun", VirtIOSCSI, parent_obj.conf.cmd_per_lun,
128),

Unfortunately (assuming I'm setting them right - see below), none of
them have any effect on the number of disks that I can add to the VM.

I am testing them by placing them in the ‘-device virtio-scsi-pci’
parameter, ie. as a property of the controller, not a property of the
LUN, eg:

-device virtio-scsi-pci,cmd_per_lun=32,id=scsi \
-drive 
file=/home/rjones/d/libguestfs/tmp/libguestfshXImTv/scratch.1,cache=unsafe,format=raw,id=hd0,if=none
 \
-device scsi-hd,drive=hd0 \

The debugging output is a bit too large to attach to this email, but I
have placed it at the link below.  It contains (if you scroll down a
bit) the full qemu command line and full kernel output.

  http://oirase.annexia.org/tmp/bz1478201-log.txt

I can add some extra debugging into the kernel if you like.  Just
point me to the right place.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Increased memory usage with scsi-mq

2017-08-04 Thread Richard W.M. Jones

https://bugzilla.redhat.com/show_bug.cgi?id=1478201

We have a libguestfs test which adds 256 virtio-scsi disks to a qemu
virtual machine.  The VM has 500 MB of RAM, 1 vCPU and no swap.

This test has been failing for a little while.  It runs out of memory
during SCSI enumeration in early boot.

Tonight I bisected the cause to:

  5c279bd9e40624f4ab6e688671026d6005b066fa is the first bad commit
  commit 5c279bd9e40624f4ab6e688671026d6005b066fa
  Author: Christoph Hellwig 
  Date:   Fri Jun 16 10:27:55 2017 +0200

scsi: default to scsi-mq

Remove the SCSI_MQ_DEFAULT config option and default to the blk-mq I/O
path now that we had plenty of testing, and have I/O schedulers for
blk-mq.  The module option to disable the blk-mq path is kept around for
now.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Martin K. Petersen 

  :04 04 57ec7d5d2ba76592a695f533a69f747700c31966
  c79f6ecb070acc4fadf6fc05ca9ba32bc9c0c665 Mdrivers

I also wrote a small test to see the maximum number of virtio-scsi
disks I could add to the above VM.  The results were very surprising
(to me anyhow):

  With scsi-mq enabled:   175 disks
  With scsi-mq disabled: 1755 disks

I don't know why the ratio is almost exactly 10 times.

I read your slides about scsi-mq and it seems like a significant
benefit to large machines, but could the out of the box defaults be
made more friendly for small memory machines?

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues") causes crashes in guest

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 01:13:50PM +0800, Jason Wang wrote:
> >From 312859b596e83a2164a8430343d31fce2a5ad808 Mon Sep 17 00:00:00 2001
> From: Jason Wang 
> Date: Thu, 23 Mar 2017 13:07:16 +0800
> Subject: [PATCH] virtio_pci: fix out of bound access for msix_names
> 
> Signed-off-by: Jason Wang 

I tested this, and it does appear to fix the crashes in
vp_modern_find_vqs.  Therefore:

Tested-by: Richard W.M. Jones 

Thanks,

Rich.

>  drivers/virtio/virtio_pci_common.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_pci_common.c 
> b/drivers/virtio/virtio_pci_common.c
> index df548a6..5905349 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -147,7 +147,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>  {
>   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>   const char *name = dev_name(&vp_dev->vdev.dev);
> - int i, err = -ENOMEM, allocated_vectors, nvectors;
> + int i, j, err = -ENOMEM, allocated_vectors, nvectors;
>   unsigned flags = PCI_IRQ_MSIX;
>   bool shared = false;
>   u16 msix_vec;
> @@ -212,7 +212,7 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>   if (!vp_dev->msix_vector_map)
>   goto out_disable_config_irq;
>  
> - allocated_vectors = 1; /* vector 0 is the config interrupt */
> + allocated_vectors = j = 1; /* vector 0 is the config interrupt */
>   for (i = 0; i < nvqs; ++i) {
>   if (!names[i]) {
>   vqs[i] = NULL;
> @@ -236,18 +236,19 @@ static int vp_find_vqs_msix(struct virtio_device *vdev, 
> unsigned nvqs,
>   continue;
>   }
>  
> - snprintf(vp_dev->msix_names[i + 1],
> + snprintf(vp_dev->msix_names[j],
>sizeof(*vp_dev->msix_names), "%s-%s",
>dev_name(&vp_dev->vdev.dev), names[i]);
>   err = request_irq(pci_irq_vector(vp_dev->pci_dev, msix_vec),
> vring_interrupt, IRQF_SHARED,
> -   vp_dev->msix_names[i + 1], vqs[i]);
> +   vp_dev->msix_names[j], vqs[i]);
>   if (err) {
>   /* don't free this irq on error */
>   vp_dev->msix_vector_map[i] = VIRTIO_MSI_NO_VECTOR;
>   goto out_remove_vqs;
>   }
>   vp_dev->msix_vector_map[i] = msix_vec;
> + j++;
>  
>   /*
>* Use a different vector for each queue if they are available,
> -- 
> 2.7.4
> 


-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
> 
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for 
> virtqueues") causes crashes in guest"
> 
> thread fix the issue for you?

In brief, yes it does.  I followed up on that thread.

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:56:22PM +0100, Christoph Hellwig wrote:
> Does the patch from Jason in the
> 
> "[REGRESSION] 07ec51480b5e ("virtio_pci: use shared interrupts for 
> virtqueues") causes crashes in guest"
> 
> thread fix the issue for you?

I didn't see this thread before.  I'll check that out for you now.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: Random guest crashes since 5c34d002dcc7 ("virtio_pci: use shared interrupts for virtqueues")

2017-03-23 Thread Richard W.M. Jones

On Thu, Mar 23, 2017 at 03:51:25PM +0100, Thorsten Leemhuis wrote:
> Hi Christoph! Hi Michael!
> 
> (Mail roughly based on text from
> https://bugzilla.kernel.org/show_bug.cgi?id=194911 )
> 
> I'm seeing random crashes during boot every few boot attempts when
> running Linux 4.11-rc/mainline in a Fedora 26 guest under a CentOS7 host
> (CPU: Intel(R) Pentium(R) CPU G3220) using KVM. Sometimes when the guest
> actually booted the network did not work. To get some impressions of the
> crashes I got see this gallery:
> https://plus.google.com/+ThorstenLeemhuis/posts/FjyyGjNtrrG
> 
> Richard W.M. Jones and Adam Williamson see the same problems. See above
> bug for details. It seems they ran into the problem in the past few
> days, so I assume it's still present in mainline (I'm travelling
> currently and haven't had time for proper tests since last last Friday
> (pre-rc3); but I thought it's time to get the problem to the lists).
> 
> Long story short: Richard and I did bisections and we both found that
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=07ec51480b5e
> ("virtio_pci: use shared interrupts for virtqueues") is the first bad
> commit. Any idea what might be wrong? Do you need more details from us
> to fix this?

Laura Abbott posted a kernel RPM which works for me.  She has had to
revert quite a number of commits, which are detailed in this comment:

  https://bugzilla.redhat.com/show_bug.cgi?id=1430297#c7

Her reverting patch is also attached.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
>From 4d3cba0be27b20516eb765c2913bce93e73fe30e Mon Sep 17 00:00:00 2001
From: Laura Abbott 
Date: Wed, 22 Mar 2017 15:41:27 -0700
Subject: [PATCH] Revert a bunch of virtio commits

07ec51480b5e ("virtio_pci: use shared interrupts for virtqueues")
is linked to a bunch of issues. Unfortunately we can't just revert it by
itself. Revert it and dependency patches as well.

Revert "virtio: provide a method to get the IRQ affinity mask for a virtqueue"

This reverts commit bbaba479563910aaa51e59bb9027a09e396d3a3c.

Revert "virtio-console: avoid DMA from stack"

This reverts commit c4baad50297d84bde1a7ad45e50c73adae4a2192.

Revert "vhost: introduce O(1) vq metadata cache"

This reverts commit f889491380582b4ba2981cf0b0d7d6a40fb30ab7.

 Conflicts:
drivers/vhost/vhost.c

Revert "virtio_scsi: use virtio IRQ affinity"

This reverts commit 0d9f0a52c8b9f7a003fe1650b7d5fb8518efabe0.

Revert "virtio_blk: use virtio IRQ affinity"

This reverts commit ad71473d9c43725c917fc5a86d54ceb7001ee28c.

Revert "blk-mq: provide a default queue mapping for virtio device"

This reverts commit 73473427bb551686e4b68ecd99bfd27e6635286a.

Revert "virtio: allow drivers to request IRQ affinity when creating VQs"

This reverts commit fb5e31d970ce8b4941f03ed765d7dbefc39f22d9.

Revert "virtio_pci: simplify MSI-X setup"

This reverts commit 52a61516125fa9a21b3bdf4f90928308e2e5573f.

Revert "virtio_pci: don't duplicate the msix_enable flag in struct pci_dev"

This reverts commit 53a020c661741f3b87ad3ac6fa545088aaebac9b.

Revert "virtio_pci: use shared interrupts for virtqueues"

This reverts commit 07ec51480b5eb1233f8c1b0f5d7a7c8d1247c507.
---
 block/Kconfig  |   5 -
 block/Makefile |   1 -
 block/blk-mq-virtio.c  |  54 --
 drivers/block/virtio_blk.c |  14 +-
 drivers/char/virtio_console.c  |  14 +-
 drivers/crypto/virtio/virtio_crypto_core.c |   2 +-
 drivers/gpu/drm/virtio/virtgpu_kms.c   |   2 +-
 drivers/misc/mic/vop/vop_main.c|   2 +-
 drivers/net/caif/caif_virtio.c |   3 +-
 drivers/net/virtio_net.c   |   2 +-
 drivers/remoteproc/remoteproc_virtio.c |   3 +-
 drivers/rpmsg/virtio_rpmsg_bus.c   |   2 +-
 drivers/s390/virtio/kvm_virtio.c   |   3 +-
 drivers/s390/virtio/virtio_ccw.c   |   3 +-
 drivers/scsi/virtio_scsi.c | 127 +++--
 drivers/vhost/vhost.c  | 136 +++---
 drivers/vhost/vhost.h  |   8 -
 drivers/virtio/virtio_balloon.c|   3 +-
 drivers/virtio/virtio_input.c  |   3 +-
 drivers/virtio/virtio_mmio.c   |   3 +-
 drivers/virtio/virtio_pci_common.c | 287 +++--
 drivers/virtio/virtio_pci_common.h |  25 ++-
 drivers/virtio/virtio_pci_legacy.c |   3 +-
 drivers/virtio/virtio_pci_modern.c |  11 +-
 include/linux/blk-mq-virtio.h  |  10 -

Re: [PATCH RESEND] hwrng: core - don't pass stack allocated buffer to rng->read()

2016-10-21 Thread Richard W.M. Jones

On Fri, Oct 21, 2016 at 02:04:27PM -0700, Andy Lutomirski wrote:
> https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=6d4952d9d9d4dc2bb9c0255d95a09405a1e958f7

I have tested this one, and it also fixes the bug I was seeing.

Thanks Laszlo as well for his fix, and sorry for not finding the
patch above first.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Re: Warning from free_init_pages with large initrd

2016-09-27 Thread Richard W.M. Jones

On Tue, Sep 27, 2016 at 11:20:06AM +0100, Sitsofe Wheeler wrote:
> (See http://www.gossamer-threads.com/lists/linux/kernel/2534175 for
> the history of this thread )
> 
> On 26 September 2016 at 20:00, Randy Dunlap  wrote:
> >
> > but the warning in free_init_pages() is about alignment, not size...
> > Maybe the concatenation is bad?
> 
> What would l have to pull apart to be able to tell?
> 
> Having said that I've just noticed that newer versions of the script
> concatenate an ISO to the initrd (rather than another cpio) -
> https://github.com/rhinstaller/livecd-tools/commit/8067be50907da9461e442c11a664c89e066ccac6#diff-88c69e43bb69726c532af3a136cc50e8
> . Unfortunately after rebuilding the initrd with the
> livecd-iso-to-pxeboot version from Fedora 24 the warning persists.
> CC'ing a few of the tool's authors on this email.

I would just comment that I wrote livecd-iso-to-pxeboot for
a specific purpose -- virt-p2v -- where the size of the ISO
rarely exceeded around 100-200 MB, which is large for an initramfs,
but about an order of magnitude smaller than what you are trying.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-05-03 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

However if fips=1 as well, we ignore this flag as FIPS mode mandates
that the self-tests are run.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..fcd89fe 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (!fips_enabled && notests) {
+   printk_once(KERN_INFO "alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH v3] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-05-03 Thread Richard W.M. Jones

v2 -> v3:

 - Ignore the flag if FIPS mode is enabled.

v1 -> v2:

 - Use printk_once.  Because the serial console is so slow, printing
   the message multiple times actually consumed about 6ms extra later
   on during the boot.

- - -

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

Rich.

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 08:54:13AM -0700, Greg KH wrote:
> You are trying to take a generalized kernel and somehow "know" about the
> hardware ahead of time it is going to run on.  That seems like two
> conflicting requirements, don't you agree?

We would have the 8250 serial port in any kernel.  Even if Fedora
kernel maintainers allowed us to have specialized kernels for each
purpose, I would use the simple ISA serial port here because it allows
us to capture debug messages very early in the boot.  Alternatives
like virtio-console don't allow that.

The kernel does know what hardware it's running on - via the CPUID
hypervisor leaf.  It's also possible for us to tell the kernel about
the hardware using the command line, ACPI[*], DT, etc.  I'd really
like to tell the kernel this is a 16550A, not broken, you don't need
to spend time testing that.

There is prior art here: no_timer_check & lpj=..

Rich.

[*] Although ACPI is really slow, adding another 190ms, and for this
reason I have disabled it for now, but not investigated why it's so
slow.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 08:16:35AM -0700, Greg KH wrote:
> On Fri, Apr 29, 2016 at 09:10:06AM +0100, Richard W.M. Jones wrote:
> > On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote:
> > > On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote:
> > > > Currently autoconf spends 25ms (on my laptop) testing if the UART
> > > > exported to it by KVM is an 8250 without FIFO and/or with strange
> > > > quirks, which it obviously isn't.  Assume it is exported to us by a
> > > > hypervisor, it's a normal, working 16550A.
> > > > 
> > > > Signed-off-by: Richard W.M. Jones 
> > > > ---
> > > >  drivers/tty/serial/8250/8250_port.c | 7 +++
> > > >  1 file changed, 7 insertions(+)
> > > > 
> > > > diff --git a/drivers/tty/serial/8250/8250_port.c 
> > > > b/drivers/tty/serial/8250/8250_port.c
> > > > index 00ad2637..de19924 100644
> > > > --- a/drivers/tty/serial/8250/8250_port.c
> > > > +++ b/drivers/tty/serial/8250/8250_port.c
> > > > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
> > > > if (!port->iobase && !port->mapbase && !port->membase)
> > > > return;
> > > >  
> > > > +   /* Hypervisors always export working 16550A devices. */
> > > > +   if (cpu_has_hypervisor) {
> > > > +   up->port.type = PORT_16550A;
> > > > +   up->capabilities |= UART_CAP_FIFO;
> > > > +   return;
> > > > +   }
> > > 
> > > Have you audited vmware, virtualbox, and everyone else that provides a
> > > virtual uart device that it will work properly here?
> > > 
> > > qemu isn't all the world :)
> > 
> > Attached below is a slightly different approach.  If the user passes a
> > special flag on the kernel command line then we force 16550A and avoid
> > the 25ms delay.  Since the user chooses the flag, any concerns about
> > the behaviour of the hypervisor or use of VFIO should be moot.
> 
> No, no more module parameters, that's crazy, what happens when you have
> 64 serial ports in a system, which one is this option for?

In this (very special) case, the domain is running under qemu and
I know it only has one serial port that doesn't need probing.

What's the right way to avoid this 25ms delay?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 12:59:57PM +0200, Stephan Mueller wrote:
> Am Freitag, 29. April 2016, 11:07:43 schrieb Richard W.M. Jones:
> 
> Hi Richard,
[...]
> > +   if (notests) {
> 
> What about if (!fips_enabled && notests) ?
>
> I am not sure whether the kernel should prevent mistakes in user space. A 
> mistake would be when setting fips=1 and notests=1 as the FIPS mode mandates 
> the self tests.

(Sorry, I just posted v2 before I saw this message.)  I saw the FIPS
stuff and thought about that.  Should we prevent mistakes like that?
I really don't know.

Rich.

> > +   pr_info("alg: self-tests disabled\n");
> > +   return 0;
> > +   }
> > +
> > alg_test_descs_check_order();
> > 
> > if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
> 
> 
> Ciao
> Stephan
> -- 
> | Nimm das Recht weg - |
> |  was ist dann der Staat noch anderes als eine große Räuberbande? |

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..5c0664d 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (notests) {
+   printk_once(KERN_INFO "alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH v2] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

v1 -> v2:

 - Use printk_once.

Because the serial console is so slow, printing the message multiple
times consumed about 6ms extra later on in the boot.  Printing it only
once is both neater and avoids this extra overhead.

Rich.

- - -

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

Running self-tests for a short-lived KVM VM takes 28ms on my laptop.
This commit adds a flag 'cryptomgr.notests' which allows them to be
disabled.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/kernel-parameters.txt | 3 +++
 crypto/testmgr.c| 9 +
 2 files changed, 12 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 0b3de80..d4d5fb7 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -826,6 +826,9 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   cryptomgr.notests
+[KNL] Disable crypto self-tests
+
cs89x0_dma= [HW,NET]
Format: 
 
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index b86883a..dc613f2 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -35,6 +35,10 @@
 
 #include "internal.h"
 
+static bool notests;
+module_param(notests, bool, 0644);
+MODULE_PARM_DESC(notests, "disable crypto self-tests");
+
 #ifdef CONFIG_CRYPTO_MANAGER_DISABLE_TESTS
 
 /* a perfect nop */
@@ -3868,6 +3872,11 @@ int alg_test(const char *driver, const char *alg, u32 
type, u32 mask)
int j;
int rc;
 
+   if (notests) {
+   pr_info("alg: self-tests disabled\n");
+   return 0;
+   }
+
alg_test_descs_check_order();
 
if ((type & CRYPTO_ALG_TYPE_MASK) == CRYPTO_ALG_TYPE_CIPHER) {
-- 
2.7.4

[PATCH] crypto: Add a flag allowing the self-tests to be disabled at runtime.

2016-04-29 Thread Richard W.M. Jones

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 28ms (on my laptop) running crypto
algorithm self-tests.  Although it's possibe to disable these at
compile time, Fedora kernel maintainers want to maintain a single
kernel image for all uses.

So this commit adds a runtime flag which callers can set to skip the
self-tests in the fast container/virtualization case.

Rich.

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Thu, Apr 28, 2016 at 03:56:33PM -0700, Greg KH wrote:
> On Thu, Apr 28, 2016 at 11:18:33PM +0100, Richard W.M. Jones wrote:
> > Currently autoconf spends 25ms (on my laptop) testing if the UART
> > exported to it by KVM is an 8250 without FIFO and/or with strange
> > quirks, which it obviously isn't.  Assume it is exported to us by a
> > hypervisor, it's a normal, working 16550A.
> > 
> > Signed-off-by: Richard W.M. Jones 
> > ---
> >  drivers/tty/serial/8250/8250_port.c | 7 +++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/drivers/tty/serial/8250/8250_port.c 
> > b/drivers/tty/serial/8250/8250_port.c
> > index 00ad2637..de19924 100644
> > --- a/drivers/tty/serial/8250/8250_port.c
> > +++ b/drivers/tty/serial/8250/8250_port.c
> > @@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
> > if (!port->iobase && !port->mapbase && !port->membase)
> > return;
> >  
> > +   /* Hypervisors always export working 16550A devices. */
> > +   if (cpu_has_hypervisor) {
> > +   up->port.type = PORT_16550A;
> > +   up->capabilities |= UART_CAP_FIFO;
> > +   return;
> > +   }
> 
> Have you audited vmware, virtualbox, and everyone else that provides a
> virtual uart device that it will work properly here?
> 
> qemu isn't all the world :)

Attached below is a slightly different approach.  If the user passes a
special flag on the kernel command line then we force 16550A and avoid
the 25ms delay.  Since the user chooses the flag, any concerns about
the behaviour of the hypervisor or use of VFIO should be moot.

Rich.

>From 5694addf0e05de9c842274be8392ebce5dc24280 Mon Sep 17 00:00:00 2001
From: "Richard W.M. Jones" 
Date: Thu, 28 Apr 2016 23:08:54 +0100
Subject: [PATCH] 8250: Allow user to force 16550A UART and avoid probing.

Currently autoconf spends 25ms (on my laptop) testing the UART.

Allow the user to avoid this delay if they know that all serial ports
(eg on a virtual machine) are ordinary 16550A.  The user does this by
passing '8250_base.really_16550a' on the command line.

Signed-off-by: Richard W.M. Jones 
---
 drivers/tty/serial/8250/8250_port.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/tty/serial/8250/8250_port.c 
b/drivers/tty/serial/8250/8250_port.c
index 00ad2637..ac92f55 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -53,6 +53,10 @@
 #define DEBUG_AUTOCONF(fmt...) do { } while (0)
 #endif
 
+static bool really_16550a;
+module_param(really_16550a, bool, 0644);
+MODULE_PARM_DESC(really_16550a, "Don't probe, assume 16550A");
+
 #define BOTH_EMPTY (UART_LSR_TEMT | UART_LSR_THRE)
 
 /*
@@ -1171,6 +1175,12 @@ static void autoconfig(struct uart_8250_port *up)
if (!port->iobase && !port->mapbase && !port->membase)
return;
 
+   if (really_16550a) {
+   up->port.type = PORT_16550A;
+   up->capabilities |= UART_CAP_FIFO;
+   return;
+   }
+
DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ",
   serial_index(port), port->iobase, port->membase);
 
-- 
2.7.4



-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Re: [PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-29 Thread Richard W.M. Jones

On Fri, Apr 29, 2016 at 10:01:08AM +0300, Matwey V. Kornilov wrote:
> 2016-04-29 1:18 GMT+03:00 Richard W.M. Jones :
> > [This is an opinionated patch, mainly for discussion.]
> >
> > I'm trying to reduce the time taken in the kernel in initcalls, with
> > my aim being to reduce the current ~700ms spent in initcalls before
> > userspace, down to something like 100ms.  All times on my Broadwell-U
> > laptop, under virtualization.  The purpose of this is to be able to
> > launch VMs around containers with minimal overhead, like Intel Clear
> > Containers, but using standard distro kernels and qemu.
> >
> > Currently the kernel spends 25ms inspecting the UART that we passed to
> > it from qemu to find out whether it's an 8250/16550/16550A perhaps
> > with a non-working FIFO or other quirks.  Well, it isn't -- it's a
> > working emulated 16550A, with a FIFO and no quirks, and if it isn't,
> > we should fix qemu.
> >
> > So the patch detects if we're running virtualized (perhaps it should
> > only check for qemu/KVM?) and if so, shortcuts the tests.
> 
> Does anybody know, whether it is possible to pass through real
> hardware serial port to a guest? It seems to be as simple as to pass
> through an interrupt and memory IO ports.

In theory it seems like something VFIO could do.  Passing something as
low performance as a serial port through would seem to make little
sense though.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

[PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-28 Thread Richard W.M. Jones

[This is an opinionated patch, mainly for discussion.]

I'm trying to reduce the time taken in the kernel in initcalls, with
my aim being to reduce the current ~700ms spent in initcalls before
userspace, down to something like 100ms.  All times on my Broadwell-U
laptop, under virtualization.  The purpose of this is to be able to
launch VMs around containers with minimal overhead, like Intel Clear
Containers, but using standard distro kernels and qemu.

Currently the kernel spends 25ms inspecting the UART that we passed to
it from qemu to find out whether it's an 8250/16550/16550A perhaps
with a non-working FIFO or other quirks.  Well, it isn't -- it's a
working emulated 16550A, with a FIFO and no quirks, and if it isn't,
we should fix qemu.

So the patch detects if we're running virtualized (perhaps it should
only check for qemu/KVM?) and if so, shortcuts the tests.

Rich.

[PATCH] 8250: Hypervisors always export working 16550A UARTs.

2016-04-28 Thread Richard W.M. Jones

Currently autoconf spends 25ms (on my laptop) testing if the UART
exported to it by KVM is an 8250 without FIFO and/or with strange
quirks, which it obviously isn't.  Assume it is exported to us by a
hypervisor, it's a normal, working 16550A.

Signed-off-by: Richard W.M. Jones 
---
 drivers/tty/serial/8250/8250_port.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/tty/serial/8250/8250_port.c 
b/drivers/tty/serial/8250/8250_port.c
index 00ad2637..de19924 100644
--- a/drivers/tty/serial/8250/8250_port.c
+++ b/drivers/tty/serial/8250/8250_port.c
@@ -1171,6 +1171,13 @@ static void autoconfig(struct uart_8250_port *up)
if (!port->iobase && !port->mapbase && !port->membase)
return;
 
+   /* Hypervisors always export working 16550A devices. */
+   if (cpu_has_hypervisor) {
+   up->port.type = PORT_16550A;
+   up->capabilities |= UART_CAP_FIFO;
+   return;
+   }
+
DEBUG_AUTOCONF("ttyS%d: autoconf (0x%04lx, 0x%p): ",
   serial_index(port), port->iobase, port->membase);
 
-- 
2.7.4

Re: [PATCH v2 0/2] vfs: Define new syscall getumask.

2016-04-18 Thread Richard W.M. Jones

On Sun, Apr 17, 2016 at 06:57:36PM -0700, Josh Triplett wrote:
> O_NOUMASK seems potentially useful to support implementation of umask
> entirely in userspace, which also addresses thread-safety.  A program
> could read its process umask out at startup, handle umask entirely in
> userspace (including for threads), and only interact with the system
> umask after fork and before exec.

I had a look at O_NOUMASK and there are a few problems:

It's relatively easy to implement for open(2).  A few filesystems
implement their own open so I had to go into those filesystems and
modify how they handle current_umask too.  And FUSE support is tricky
so I passed on that.

The real problem is that mkdir/mkdirat/mknod/mknodat are affected by
umask, but there is no convenient flags parameter to pass the
O_NOUMASK flag.  So I think the patch only half-solves the problem.

I have a patch which needs a bit more testing, which I can post if you
think that's helpful, but I don't think it would be acceptable in its
current state.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Re: [PATCH v2] procfs: expose umask in /proc//status

2016-04-15 Thread Richard W.M. Jones

On Fri, Apr 15, 2016 at 03:13:10PM +0200, Michal Hocko wrote:
> On Thu 14-04-16 12:08:15, Richard W.M. Jones wrote:
> > It's not possible to read the process umask without also modifying it,
> > which is what umask(2) does.  A library cannot read umask safely,
> > especially if the main program might be multithreaded.
> 
> It would be helpful to describe the usecase a bit better. Who needs to
> read the umask and what for (e.g. what if the umask changes right after
> it has been read by the 3rd party?).
> 
> I am not opposing the patch as is but this exports a new information to
> the userspace and we will have to maintain it for ever, so we should
> better describe the usecase.

The use case is that we have endless trouble with people setting weird
umask() values (usually on the grounds of "security"), and then
everything breaking.  I'm on the hook to fix these.  We'd like to add
debugging to our program so we can dump out the umask in debug
reports.

Previous versions of the patch used a syscall so you could only read
your own umask.  That's all I need.  However there was quite a lot of
push-back from those, so this new version exports it in /proc.

See:

https://lkml.org/lkml/2016/4/13/704 [umask2]
https://lkml.org/lkml/2016/4/13/487 [getumask]

Rich.

> > Add a new status line ("Umask") in /proc//status.  It contains
> > the file mode creation mask (umask) in octal.  It is only shown for
> > tasks which have task->fs.
> > 
> > This patch is adapted from one originally written by Pierre Carrier.
> > 
> > Signed-off-by: Richard W.M. Jones 
> > ---
> >  Documentation/filesystems/proc.txt |  1 +
> >  fs/proc/array.c| 20 +++-
> >  2 files changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/filesystems/proc.txt 
> > b/Documentation/filesystems/proc.txt
> > index 7f5607a..e8d0075 100644
> > --- a/Documentation/filesystems/proc.txt
> > +++ b/Documentation/filesystems/proc.txt
> > @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
> >   TracerPid   PID of process tracing this process (0 if not)
> >   Uid Real, effective, saved set, and  file system 
> > UIDs
> >   Gid Real, effective, saved set, and  file system 
> > GIDs
> > + Umask   file mode creation mask
> >   FDSize  number of file descriptor slots currently 
> > allocated
> >   Groups  supplementary group list
> >   NStgid  descendant namespace thread group ID hierarchy
> > diff --git a/fs/proc/array.c b/fs/proc/array.c
> > index b6c00ce..88c7de1 100644
> > --- a/fs/proc/array.c
> > +++ b/fs/proc/array.c
> > @@ -83,6 +83,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
> > task_struct *tsk)
> > return task_state_array[fls(state)];
> >  }
> >  
> > +static inline int get_task_umask(struct task_struct *tsk)
> > +{
> > +   struct fs_struct *fs;
> > +   int umask = -ENOENT;
> > +
> > +   task_lock(tsk);
> > +   fs = tsk->fs;
> > +   if (fs)
> > +   umask = fs->umask;
> > +   task_unlock(tsk);
> > +   return umask;
> > +}
> > +
> >  static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
> > struct pid *pid, struct task_struct *p)
> >  {
> > struct user_namespace *user_ns = seq_user_ns(m);
> > struct group_info *group_info;
> > -   int g;
> > +   int g, umask;
> > struct task_struct *tracer;
> > const struct cred *cred;
> > pid_t ppid, tpid = 0, tgid, ngid;
> > @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, 
> > struct pid_namespace *ns,
> > ngid = task_numa_group_id(p);
> > cred = get_task_cred(p);
> >  
> > +   umask = get_task_umask(p);
> > +   if (umask >= 0)
> > +   seq_printf(m, "Umask:\t%#04o\n", umask);
> > +
> > task_lock(p);
> > if (p->files)
> > max_fds = files_fdtable(p->files)->max_fds;
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

[PATCH v2] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

v1 -> v2:

 - Change printf format to %#04o.

 - Retest and update examples accordingly.

--

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

For the library this allows me to read the umask from
/proc/self/status.

This patch is adapted from one originally written by Pierre Carrier:
https://lkml.org/lkml/2012/5/4/451

Example usage:

$ grep Umask /proc/1/status 
Umask:  0022
$ grep Umask /proc/2/status 
Umask:  0022
$ grep Umask /proc/self/status 
Umask:  0022
$ umask 002
$ grep Umask /proc/self/status 
Umask:  0002

Rich.

[PATCH v2] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

This patch is adapted from one originally written by Pierre Carrier.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/filesystems/proc.txt |  1 +
 fs/proc/array.c| 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 7f5607a..e8d0075 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
  TracerPid   PID of process tracing this process (0 if not)
  Uid Real, effective, saved set, and  file system UIDs
  Gid Real, effective, saved set, and  file system GIDs
+ Umask   file mode creation mask
  FDSize  number of file descriptor slots currently 
allocated
  Groups  supplementary group list
  NStgid  descendant namespace thread group ID hierarchy
diff --git a/fs/proc/array.c b/fs/proc/array.c
index b6c00ce..88c7de1 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -83,6 +83,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
task_struct *tsk)
return task_state_array[fls(state)];
 }
 
+static inline int get_task_umask(struct task_struct *tsk)
+{
+   struct fs_struct *fs;
+   int umask = -ENOENT;
+
+   task_lock(tsk);
+   fs = tsk->fs;
+   if (fs)
+   umask = fs->umask;
+   task_unlock(tsk);
+   return umask;
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
 {
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
-   int g;
+   int g, umask;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
pid_namespace *ns,
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
 
+   umask = get_task_umask(p);
+   if (umask >= 0)
+   seq_printf(m, "Umask:\t%#04o\n", umask);
+
task_lock(p);
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
-- 
2.7.4

Re: [PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)

2016-04-14 Thread Richard W.M. Jones

On Thu, Apr 14, 2016 at 10:34:48AM +0100, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> Add a new status line ("Umask") in /proc//status.  It contains
> the file mode creation mask (umask) in octal.  It is only shown for
> tasks which have task->fs.
> 
> For the library this allows me to read the umask from
> /proc/self/status.
> 
> This patch is adapted from one originally written by Pierre Carrier:
> https://lkml.org/lkml/2012/5/4/451

Sorry, I meant to add an example of what this looks like:

$ grep Umask /proc/1/status 
Umask:022
$ grep Umask /proc/2/status 
Umask:022
$ grep Umask /proc/self/status 
Umask:022
$ umask 002
$ grep Umask /proc/self/status 
Umask:02

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

[PATCH] procfs: expose umask in /proc//status

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

This patch is adapted from one originally written by Pierre Carrier.

Signed-off-by: Richard W.M. Jones 
---
 Documentation/filesystems/proc.txt |  1 +
 fs/proc/array.c| 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt 
b/Documentation/filesystems/proc.txt
index 7f5607a..e8d0075 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
  TracerPid   PID of process tracing this process (0 if not)
  Uid Real, effective, saved set, and  file system UIDs
  Gid Real, effective, saved set, and  file system GIDs
+ Umask   file mode creation mask
  FDSize  number of file descriptor slots currently 
allocated
  Groups  supplementary group list
  NStgid  descendant namespace thread group ID hierarchy
diff --git a/fs/proc/array.c b/fs/proc/array.c
index b6c00ce..03e8d3f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -83,6 +83,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
task_struct *tsk)
return task_state_array[fls(state)];
 }
 
+static inline int get_task_umask(struct task_struct *tsk)
+{
+   struct fs_struct *fs;
+   int umask = -ENOENT;
+
+   task_lock(tsk);
+   fs = tsk->fs;
+   if (fs)
+   umask = fs->umask;
+   task_unlock(tsk);
+   return umask;
+}
+
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
 {
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
-   int g;
+   int g, umask;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
pid_namespace *ns,
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
 
+   umask = get_task_umask(p);
+   if (umask >= 0)
+   seq_printf(m, "Umask:\t0%o\n", umask);
+
task_lock(p);
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
-- 
2.7.4

[PATCH] procfs: expose umask in /proc//status (formerly umask2, formerly getumask)

2016-04-14 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

Add a new status line ("Umask") in /proc//status.  It contains
the file mode creation mask (umask) in octal.  It is only shown for
tasks which have task->fs.

For the library this allows me to read the umask from
/proc/self/status.

This patch is adapted from one originally written by Pierre Carrier:
https://lkml.org/lkml/2012/5/4/451

Rich.

Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]

2016-04-14 Thread Richard W.M. Jones

On Thu, Apr 14, 2016 at 09:09:38AM +1000, Stephen Rothwell wrote:
> Hi Richard,
> 
> On Wed, 13 Apr 2016 20:05:33 +0100 "Richard W.M. Jones"  
> wrote:
> >
> > It's not possible to read the process umask without also modifying it,
> > which is what umask(2) does.  A library cannot read umask safely,
> > especially if the main program might be multithreaded.
> 
> I was wondering if you really need to read the umask, or would just a
> "ignore umask" flag to open{,at} do what you want?

This would be very useful, although I think being able to read umask
is also useful.

---

FWIW I am currently developing a patch to add umask to
/proc/PID/status.  Will post it shortly once I've tested it properly.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]

2016-04-13 Thread Richard W.M. Jones


On Wed, Apr 13, 2016 at 10:45:05PM +0200, Florian Weimer wrote:
> * H. Peter Anvin:
> 
> > I have to say I'm skeptic to the need for umask2() as opposed to
> > getumask().
> 
> I find the extension with a set-the-thread umask somewhat unlikely.
> How would a potential per-thread umask interact with CLONE_FS?
> Have a per-thread umask that, when active, overrides the global
> one, similar to what uselocale provides?  That seems rather messy,
> and I'm not aware of any precedent.

The flags parameter is for extensions we can't envisage now ...

... although since umask has been around since Unix V6 in 1978, we
might be waiting a long time.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

umask2 man page (was: Re: [PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask])

2016-04-13 Thread Richard W.M. Jones

UMASK(2)   Linux Programmer's Manual  UMASK(2)



NAME
   umask, umask2 - get and set file mode creation mask

SYNOPSIS
   #include 
   #include 

   mode_t umask(mode_t mask);

   #define _GNU_SOURCE
   #include 
   #include 
   #include 

   mode_t umask2(mode_t mask, int flags);

DESCRIPTION
   umask()  sets  the calling process's file mode creation mask (umask) to
   mask & 0777 (i.e., only the file permission bits of mask are used), and
   returns the previous value of the mask.

   If flags is 0, then umask2() is the same as umask().

   If flags is UMASK_GET_MASK then umask2() ignores the mask parameter and
   returns the process's current umask.  The process's current mask is not
   modified in this case.

   The  umask  is  used  by open(2), mkdir(2), and other system calls that
   create files to modify the permissions placed on newly created files or
   directories.   Specifically,  permissions  in  the umask are turned off
   from the mode argument to open(2) and mkdir(2).

   Alternatively, if the parent directory has a default ACL (see  acl(5)),
   the umask is ignored, the default ACL is inherited, the permission bits
   are set based on the inherited ACL, and permission bits absent  in  the
   mode  argument  are turned off.  For example, the following default ACL
   is equivalent to a umask of 022:

   u::rwx,g::r-x,o::r-x

   Combining the effect of this default ACL with a mode argument  of  0666
   (rw-rw-rw-), the resulting file permissions would be 0644 (rw-r--r--).

   The  constants  that should be used to specify mask are described under
   stat(2).

   The typical default value for the process  umask  is  S_IWGRP | S_IWOTH
   (octal  022).   In the usual case where the mode argument to open(2) is
   specified as:

   S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH

   (octal 0666) when creating a new file, the permissions on the resulting
   file will be:

   S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH

   (because 0666 & ~022 = 0644; i.e., rw-r--r--).

RETURN VALUE
   The  umask()  system call always succeeds and the previous value of the
   mask is returned.

   The umask2() system call returns the process's current  umask  on  suc‐
   cess.  On error it returns -1 and sets errno appropriately.

CONFORMING TO
   SVr4, 4.3BSD, POSIX.1-2001.

NOTES
   A  child  process created via fork(2) inherits its parent's umask.  The
   umask is left unchanged by execve(2).

   The umask setting also affects the permissions assigned  to  POSIX  IPC
   objects  (mq_open(3), sem_open(3), shm_open(3)), FIFOs (mkfifo(3)), and
   UNIX domain sockets (unix(7)) created by the process.  The  umask  does
   not  affect the permissions assigned to System V IPC objects created by
   the process (using msgget(2), semget(2), shmget(2)).

SEE ALSO
   chmod(2), mkdir(2), open(2), stat(2), acl(5)

COLOPHON
   This page is part of release 4.00 of the Linux  man-pages  project.   A
   description  of  the project, information about reporting bugs, and the
   latestversionofthispage,can be found at
   http://www.kernel.org/doc/man-pages/.



Linux 2016-04-13  UMASK(2)

Re: [PATCH v3 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

On Wed, Apr 13, 2016 at 11:52:53AM -0700, Davidlohr Bueso wrote:
> ENOMANPAGE

Where do man pages go?  In the man-pages project?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

[PATCH v4 3/3] vfs: selftests: Add test for umask2 system call.

2016-04-13 Thread Richard W.M. Jones

Signed-off-by: Richard W.M. Jones 
---
 tools/testing/selftests/Makefile  |  1 +
 tools/testing/selftests/fs/.gitignore |  1 +
 tools/testing/selftests/fs/Makefile   |  9 
 tools/testing/selftests/fs/umask2-tests.c | 77 +++
 4 files changed, 88 insertions(+)
 create mode 100644 tools/testing/selftests/fs/.gitignore
 create mode 100644 tools/testing/selftests/fs/Makefile
 create mode 100644 tools/testing/selftests/fs/umask2-tests.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b04afc3..9e2eb24 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += cpu-hotplug
 TARGETS += efivarfs
 TARGETS += exec
 TARGETS += firmware
+TARGETS += fs
 TARGETS += ftrace
 TARGETS += futex
 TARGETS += ipc
diff --git a/tools/testing/selftests/fs/.gitignore 
b/tools/testing/selftests/fs/.gitignore
new file mode 100644
index 000..057dced
--- /dev/null
+++ b/tools/testing/selftests/fs/.gitignore
@@ -0,0 +1 @@
+umask2-tests
diff --git a/tools/testing/selftests/fs/Makefile 
b/tools/testing/selftests/fs/Makefile
new file mode 100644
index 000..6f231d7
--- /dev/null
+++ b/tools/testing/selftests/fs/Makefile
@@ -0,0 +1,9 @@
+BINARIES := umask2-tests
+
+all: $(BINARIES)
+
+clean:
+   $(RM) $(BINARIES)
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/fs/umask2-tests.c 
b/tools/testing/selftests/fs/umask2-tests.c
new file mode 100644
index 000..3e01575
--- /dev/null
+++ b/tools/testing/selftests/fs/umask2-tests.c
@@ -0,0 +1,77 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifndef UMASK_GET_MASK
+#define UMASK_GET_MASK 1
+#endif
+
+static int umask2_(int mask, int flags)
+{
+#ifdef __NR_umask2
+   return syscall(__NR_umask2, mask, flags);
+#else
+   errno = ENOSYS;
+   return -1;
+#endif
+}
+
+int main(int argc, char **argv)
+{
+   int r;
+
+   /* umask2 available in current kernel? */
+   r = umask2_(0, UMASK_GET_MASK);
+   if (r == -1 && errno == ENOSYS) {
+   fprintf(stderr,
+   "umask2 not available in current kernel or headers, 
skipping test\n");
+   exit(0);
+   }
+
+   /* Check that old umask still works. */
+   r = umask(022);
+   if (r == -1) {
+   perror("umask");
+   exit(1);
+   }
+
+   /* Call umask2 to emulate old umask. */
+   r = umask2_(023, 0);
+   if (r == -1) {
+   perror("umask2");
+   exit(1);
+   }
+   if (r != 022) {
+   fprintf(stderr, "umask2: expected %o, got %o\n", 022, r);
+   exit(1);
+   }
+
+   /* Call umask2 to read current umask without modifying it. */
+   r = umask2_(0777, UMASK_GET_MASK);
+   if (r == -1) {
+   perror("umask2");
+   exit(1);
+   }
+   if (r != 023) {
+   fprintf(stderr, "umask2: expected %o, got %o\n", 023, r);
+   exit(1);
+   }
+
+   /* Call it again to make sure we didn't modify umask. */
+   r = umask2_(0777, UMASK_GET_MASK);
+   if (r == -1) {
+   perror("umask2");
+   exit(1);
+   }
+   if (r != 023) {
+   fprintf(stderr, "umask2: expected %o, got %o\n", 023, r);
+   exit(1);
+   }
+
+   exit(0);
+}
-- 
2.7.4

[PATCH v4 0/3] vfs: Define new syscall umask2 [formerly getumask]

2016-04-13 Thread Richard W.M. Jones

v3 -> v4:

 - Rename the syscall: getumask becomes umask2.

 - Add flags parameter, with one flag (UMASK_GET_MASK).

 - Expand the rationale for this change in the first commit message.

 - Add a selftest.

 - Retest everything.



It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

This patch series adds a new system call "umask2".  This adds a flags
parameter.  Specifying flags=UMASK_GET_MASK allows the umask of the
current process to be read without modifying it.

This leaves open the possibility in future of adding a per-thread
umask, set or read with other flags.  This is not implemented.

Another approach to this has been attempted before, adding something
to /proc, although it didn't go anywhere.  See:

  http://comments.gmane.org/gmane.linux.kernel/1292109

Another way to solve this would be to add a thread-safe getumask to
glibc.  Since glibc could own the mutex, this would permit libraries
linked to this glibc to read umask safely.  I should also note that
man-pages documents getumask(3), but no version of glibc has ever
implemented it.

Rich.

[PATCH v4 1/3] vfs: Define new syscall umask2.

2016-04-13 Thread Richard W.M. Jones

Before this change it was not possible to read the umask of the
current process from a shared library in a way that is safe if the
process uses multiple threads.

Define a variation of the umask system call with a flags parameter.
If flags=0 it behaves like the old umask call.  If flags=UMASK_GET_MASK
it reads the umask of the current process without modifying it.

Signed-off-by: Richard W.M. Jones 
---
 include/linux/syscalls.h  |  1 +
 include/uapi/asm-generic/fcntl.h  |  2 ++
 include/uapi/asm-generic/unistd.h |  4 +++-
 kernel/sys.c  | 18 +-
 4 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d795472..939e969 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -659,6 +659,7 @@ asmlinkage long sys_prlimit64(pid_t pid, unsigned int 
resource,
struct rlimit64 __user *old_rlim);
 asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
 asmlinkage long sys_umask(int mask);
+asmlinkage long sys_umask2(int mask, int flags);
 
 asmlinkage long sys_msgget(key_t key, int msgflg);
 asmlinkage long sys_msgsnd(int msqid, struct msgbuf __user *msgp,
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index e063eff..f6c5e5a 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -217,4 +217,6 @@ struct flock64 {
 };
 #endif
 
+#define UMASK_GET_MASK 1   /* umask2 should only return current umask */
+
 #endif /* _ASM_GENERIC_FCNTL_H */
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index 2622b33..8a61471 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -717,9 +717,11 @@ __SYSCALL(__NR_membarrier, sys_membarrier)
 __SYSCALL(__NR_mlock2, sys_mlock2)
 #define __NR_copy_file_range 285
 __SYSCALL(__NR_copy_file_range, sys_copy_file_range)
+#define __NR_umask2 286
+__SYSCALL(__NR_umask2, sys_umask2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 286
+#define __NR_syscalls 287
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys.c b/kernel/sys.c
index cf8ba54..cf60091 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1643,12 +1643,28 @@ COMPAT_SYSCALL_DEFINE2(getrusage, int, who, struct 
compat_rusage __user *, ru)
 }
 #endif
 
-SYSCALL_DEFINE1(umask, int, mask)
+static int do_umask(int mask, int flags)
 {
+   if (flags & ~UMASK_GET_MASK)
+   return -EINVAL;
+
+   if (flags & UMASK_GET_MASK)
+   return current_umask();
+
mask = xchg(¤t->fs->umask, mask & S_IRWXUGO);
return mask;
 }
 
+SYSCALL_DEFINE1(umask, int, mask)
+{
+   return do_umask(mask, 0);
+}
+
+SYSCALL_DEFINE2(umask2, int, mask, int, flags)
+{
+   return do_umask(mask, flags);
+}
+
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
struct fd exe;
-- 
2.7.4

[PATCH v4 2/3] x86: Wire up new umask2 system call on x86.

2016-04-13 Thread Richard W.M. Jones

Signed-off-by: Richard W.M. Jones 
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index b30dd81..ccf75ba 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -386,3 +386,4 @@
 377i386copy_file_range sys_copy_file_range
 378i386preadv2 sys_preadv2
 379i386pwritev2sys_pwritev2
+380i386umask2  sys_umask2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index cac6d17..e566c64 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -335,6 +335,7 @@
 326common  copy_file_range sys_copy_file_range
 32764  preadv2 sys_preadv2
 32864  pwritev2sys_pwritev2
+329common  umask2  sys_umask2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.7.4

Re: [PATCH v2 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

On Wed, Apr 13, 2016 at 11:41:45AM -0400, Colin Walters wrote:
> On Wed, Apr 13, 2016, at 08:57 AM, Richard W.M. Jones wrote:
> 
> > It's not possible to read the process umask without also modifying it,
> > which is what umask(2) does.  A library cannot read umask safely,
> > especially if the main program might be multithreaded.
> 
> I assume you just want to do this from a shared library so you can
> determine whether or not you need to call fchown() after making files
> and the like?  If that's the case it'd be good to note it in the commit
> message.

Yes, the use case is something like that.  I write a shared library
(libguestfs) and we get bug reports that turn out to be caused by odd
umask settings.  Of course we fix these on a case-by-case basis, but
we also want to include the current umask in debug output so that we
can identify the problem quickly in future reports.

Actually I wrote a rather involved getumask substitute:

  https://github.com/libguestfs/libguestfs/blob/master/src/launch.c#L477

It works by creating a temporary directory, writing a file inside that
directory with mode 0777, then calling fstat(2) to work out what mode
the kernel gave it.

It turns out this code is not even correct.  It was pointed out to me
that there is a filesystem umask mount option (and fmask, dmask too)
which stops this from working properly.

So it's a lot of work to read umask safely inside a shared library.

I will update the commit comment with a brief summary of the above.

> BTW...it might be a good idea to add a flags argument:
> https://lwn.net/Articles/585415/
>
> Did you consider calling this `umask2`, having the initial version
> only support retrieving it via a UMASK_GET flag, and lay the
> groundwork to support setting a threadsafe umask with a
> UMASK_SET_THREAD flag?

Can certainly do it like this if that is preferable.

For my needs, getumask as implemented now is sufficient.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

[PATCH v3 1/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

Define a system call for reading the current umask value.

Signed-off-by: Richard W.M. Jones 
---
 include/linux/syscalls.h  | 1 +
 include/uapi/asm-generic/unistd.h | 4 +++-
 kernel/sys.c  | 5 +
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d795472..e96e88f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -659,6 +659,7 @@ asmlinkage long sys_prlimit64(pid_t pid, unsigned int 
resource,
struct rlimit64 __user *old_rlim);
 asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
 asmlinkage long sys_umask(int mask);
+asmlinkage long sys_getumask(void);
 
 asmlinkage long sys_msgget(key_t key, int msgflg);
 asmlinkage long sys_msgsnd(int msqid, struct msgbuf __user *msgp,
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index 2622b33..e59e880 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -717,9 +717,11 @@ __SYSCALL(__NR_membarrier, sys_membarrier)
 __SYSCALL(__NR_mlock2, sys_mlock2)
 #define __NR_copy_file_range 285
 __SYSCALL(__NR_copy_file_range, sys_copy_file_range)
+#define __NR_getumask 286
+__SYSCALL(__NR_getumask, sys_getumask)
 
 #undef __NR_syscalls
-#define __NR_syscalls 286
+#define __NR_syscalls 287
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys.c b/kernel/sys.c
index cf8ba54..9db526c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1649,6 +1649,11 @@ SYSCALL_DEFINE1(umask, int, mask)
return mask;
 }
 
+SYSCALL_DEFINE0(getumask)
+{
+   return current_umask();
+}
+
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
struct fd exe;
-- 
2.7.4

[PATCH v3 2/2] x86: Wire up new getumask system call on x86.

2016-04-13 Thread Richard W.M. Jones

Signed-off-by: Richard W.M. Jones 
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index b30dd81..af0a032 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -386,3 +386,4 @@
 377i386copy_file_range sys_copy_file_range
 378i386preadv2 sys_preadv2
 379i386pwritev2sys_pwritev2
+380i386getumasksys_getumask
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index cac6d17..47c1579 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -335,6 +335,7 @@
 326common  copy_file_range sys_copy_file_range
 32764  preadv2 sys_preadv2
 32864  pwritev2sys_pwritev2
+329common  getumasksys_getumask
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.7.4

[PATCH v3 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

v2 -> v3:

 - Add the syscall to uapi/asm-generic/unistd.h.

 - Retest.

--

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

This patch series adds a trivial system call "getumask" which returns
the umask of the current process.

Another approach to this has been attempted before, adding something
to /proc, although it didn't go anywhere.  See:

  http://comments.gmane.org/gmane.linux.kernel/1292109

Another way to solve this would be to add a thread-safe getumask to
glibc.  Since glibc could own the mutex, this would permit libraries
linked to this glibc to read umask safely.

I should also note that man-pages documents getumask(3), but no
version of glibc has ever implemented it.

Typical test script:

#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
  int r = syscall(329);
  if (r == -1) {
perror("getumask");
exit(1);
  }
  printf("umask = %o\n", r);
  exit(0);
}

$ ./getumask 
umask = 22

Rich.

Re: [PATCH v2 1/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

On Wed, Apr 13, 2016 at 04:20:32PM +0300, Cyrill Gorcunov wrote:
> On Wed, Apr 13, 2016 at 01:57:51PM +0100, Richard W.M. Jones wrote:
> > Define a system call for reading the current umask value.
> > 
> > Signed-off-by: Richard W.M. Jones 
> 
> Btw don't we have to declare it in include/uapi/asm-generic/unistd.h as well?

Yes, I think I do.  I was following pwritev2 which wasn't added
to this file, but other recent system calls (mlock2, copy_file_range)
were added.

TBH the documentation for this file is not very clear...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

[PATCH v2 2/2] x86: Wire up new getumask system call on x86.

2016-04-13 Thread Richard W.M. Jones

Signed-off-by: Richard W.M. Jones 
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index b30dd81..af0a032 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -386,3 +386,4 @@
 377i386copy_file_range sys_copy_file_range
 378i386preadv2 sys_preadv2
 379i386pwritev2sys_pwritev2
+380i386getumasksys_getumask
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index cac6d17..47c1579 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -335,6 +335,7 @@
 326common  copy_file_range sys_copy_file_range
 32764  preadv2 sys_preadv2
 32864  pwritev2sys_pwritev2
+329common  getumasksys_getumask
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.7.4

[PATCH v2 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

v1 -> v2:

 - Use current_umask() instead of current->fs->umask.

 - Retested it.

--

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

This patch series adds a trivial system call "getumask" which returns
the umask of the current process.

Another approach to this has been attempted before, adding something
to /proc, although it didn't go anywhere.  See:

  http://comments.gmane.org/gmane.linux.kernel/1292109

Another way to solve this would be to add a thread-safe getumask to
glibc.  Since glibc could own the mutex, this would permit libraries
linked to this glibc to read umask safely.

I should also note that man-pages documents getumask(3), but no
version of glibc has ever implemented it.

Typical test script:

#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
  int r = syscall(329);
  if (r == -1) {
perror("getumask");
exit(1);
  }
  printf("umask = %o\n", r);
  exit(0);
}

$ ./getumask 
umask = 22

Rich.

[PATCH v2 1/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

Define a system call for reading the current umask value.

Signed-off-by: Richard W.M. Jones 
---
 include/linux/syscalls.h | 1 +
 kernel/sys.c | 5 +
 2 files changed, 6 insertions(+)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d795472..e96e88f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -659,6 +659,7 @@ asmlinkage long sys_prlimit64(pid_t pid, unsigned int 
resource,
struct rlimit64 __user *old_rlim);
 asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
 asmlinkage long sys_umask(int mask);
+asmlinkage long sys_getumask(void);
 
 asmlinkage long sys_msgget(key_t key, int msgflg);
 asmlinkage long sys_msgsnd(int msqid, struct msgbuf __user *msgp,
diff --git a/kernel/sys.c b/kernel/sys.c
index cf8ba54..9db526c 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1649,6 +1649,11 @@ SYSCALL_DEFINE1(umask, int, mask)
return mask;
 }
 
+SYSCALL_DEFINE0(getumask)
+{
+   return current_umask();
+}
+
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
struct fd exe;
-- 
2.7.4

[PATCH 0/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

It's not possible to read the process umask without also modifying it,
which is what umask(2) does.  A library cannot read umask safely,
especially if the main program might be multithreaded.

This patch series adds a trivial system call "getumask" which returns
the umask of the current process.

Another approach to this has been attempted before, adding something
to /proc, although it didn't go anywhere.  See:

  http://comments.gmane.org/gmane.linux.kernel/1292109

Another way to solve this would be to add a thread-safe getumask to
glibc.  Since glibc could own the mutex, this would permit libraries
linked to this glibc to read umask safely.

I should also note that man-pages documents getumask(3), but no
version of glibc has ever implemented it.

Typical test script:

#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
  int r = syscall(329);
  if (r == -1) {
perror("getumask");
exit(1);
  }
  printf("umask = %o\n", r);
  exit(0);
}

$ ./getumask 
umask = 22

Rich.

[PATCH 2/2] x86: Wire up new getumask system call on x86.

2016-04-13 Thread Richard W.M. Jones

Signed-off-by: Richard W.M. Jones 
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index b30dd81..af0a032 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -386,3 +386,4 @@
 377i386copy_file_range sys_copy_file_range
 378i386preadv2 sys_preadv2
 379i386pwritev2sys_pwritev2
+380i386getumasksys_getumask
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index cac6d17..47c1579 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -335,6 +335,7 @@
 326common  copy_file_range sys_copy_file_range
 32764  preadv2 sys_preadv2
 32864  pwritev2sys_pwritev2
+329common  getumasksys_getumask
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.7.4

[PATCH 1/2] vfs: Define new syscall getumask.

2016-04-13 Thread Richard W.M. Jones

Define a system call for reading the current umask value.

Signed-off-by: Richard W.M. Jones 
---
 include/linux/syscalls.h | 1 +
 kernel/sys.c | 5 +
 2 files changed, 6 insertions(+)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index d795472..e96e88f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -659,6 +659,7 @@ asmlinkage long sys_prlimit64(pid_t pid, unsigned int 
resource,
struct rlimit64 __user *old_rlim);
 asmlinkage long sys_getrusage(int who, struct rusage __user *ru);
 asmlinkage long sys_umask(int mask);
+asmlinkage long sys_getumask(void);
 
 asmlinkage long sys_msgget(key_t key, int msgflg);
 asmlinkage long sys_msgsnd(int msqid, struct msgbuf __user *msgp,
diff --git a/kernel/sys.c b/kernel/sys.c
index cf8ba54..026c146 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1649,6 +1649,11 @@ SYSCALL_DEFINE1(umask, int, mask)
return mask;
 }
 
+SYSCALL_DEFINE0(getumask)
+{
+   return current->fs->umask;
+}
+
 static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
 {
struct fd exe;
-- 
2.7.4

Re: Oops from calibrate_delay_is_known on qemu machine with Linux v4.5-1523-g271ecc5253e2

2016-03-19 Thread Richard W.M. Jones

On Thu, Mar 17, 2016 at 01:54:36PM -0400, Josh Boyer wrote:
> Hi Thomas,
> 
> We've had a report [1] of the mainline kernel crashing on a single-cpu
> QEMU machine (not kvm) in Fedora.  It looks as if the emulated machine
> is failing to provide a TSC and the calibrate_delay_is_known function
> is passing NULL to cpumask_any_but for the mask parameter.  At least
> that's all I've been able to discern thus far.
> 
> I was wondering if you had any insight into this issue, given your
> recent commit to change calibrate_delay_is_known to use
> topology_core_cpumask.  The backtrace is below.

Thanks for reporting this upstream Josh.

I've added to the bug a simple test script that reproduces the issue,
not 100% reliably by any means, but some of the time.  See:

  https://bugzilla.redhat.com/show_bug.cgi?id=1318596#c6

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Re: [PATCH] x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known()

2016-03-19 Thread Richard W.M. Jones

On Fri, Mar 18, 2016 at 08:48:06AM +0100, Thomas Gleixner wrote:
> Subject: x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known()
> From: Thomas Gleixner 
> Date: Fri, 18 Mar 2016 08:35:29 +0100
> 
> The topology_core_cpumask is used to find a neighbour cpu in
> calibrate_delay_is_known(). It might not be allocated at the first invocation
> of that function on the boot cpu, when CONFIG_CPUMASK_OFFSTACK is set.
> 
> The mask is allocated later in native_smp_prepare_cpus. As a consequence the
> underlying find_next_bit() call dereferences a NULL pointer.
> 
> Add a proper check to prevent this.
> 
> Reported-by: Richard W.M. Jones 
> Fixes: c25323c07345 "x86/tsc: Use topology functions"
> Signed-off-by: Thomas Gleixner 
> Cc: Josh Boyer 

I have tested the current upstream kernel (9dffdb38d) and was able to
reproduce the bug.  I then added this patch on top and it fixes the
problem for me.  Therefore:

Tested-by: Richard W.M. Jones 

Thanks, Rich.

> ---
>  arch/x86/kernel/tsc.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1306,11 +1306,15 @@ void __init tsc_init(void)
>  unsigned long calibrate_delay_is_known(void)
>  {
>   int sibling, cpu = smp_processor_id();
> + struct cpumask *mask = topology_core_cpumask(cpu);
>  
>   if (!tsc_disabled && !cpu_has(&cpu_data(cpu), X86_FEATURE_CONSTANT_TSC))
>   return 0;
>  
> - sibling = cpumask_any_but(topology_core_cpumask(cpu), cpu);
> + if (!mask)
> + return 0;
> +
> + sibling = cpumask_any_but(mask, cpu);
>   if (sibling < nr_cpu_ids)
>   return cpu_data(sibling).loops_per_jiffy;
>   return 0;

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Re: [PATCH v3 0/7] User namespace mount updates

2015-11-19 Thread Richard W.M. Jones

On Thu, Nov 19, 2015 at 03:49:00PM +0100, Richard Weinberger wrote:
> Am 19.11.2015 um 15:37 schrieb Colin Walters:
> > On Thu, Nov 19, 2015, at 02:53 AM, Richard Weinberger wrote:
> > 
> >> Erm, I don't want this in the kernel. That's why I've proposed the lklfuse 
> >> approach.
> > 
> > I already said this before but just to repeat, since I'm confused:
> > 
> > How would "lklfuse" be different from http://libguestfs.org/
> > which we at Red Hat (and a number of other organizations)
> > use quite widely now for build systems, debugging etc.
> 
> Currently libguestfs has a rather huge overhead because it
> boots a full virtual machine and hence a lot of communication
> is needed.
> With LKL you can use Linux as Library and link it to fuse.
> AFAIK Richard added already a LKL backend to libguestfs. :-)

Right.  For the longer story, see:

https://rwmj.wordpress.com/2015/11/07/linux-kernel-library-backend-for-libguestfs/#content

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 00/28] Linux Kernel Library

2015-11-07 Thread Richard W.M. Jones


I just pushed a (very early) WIP branch that contains changes to
libguestfs to add an LKL backend:

  https://github.com/rwmjones/libguestfs/tree/lkl

Read the README file in the libguestfs sources before starting,
followed by the instructions in the commit message:

  
https://github.com/rwmjones/libguestfs/commit/e38525f0b984d0a426f3348d95f2033673d4eaa4

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 00/28] Linux Kernel Library

2015-11-06 Thread Richard W.M. Jones

On Sat, Nov 07, 2015 at 01:35:36AM +0100, Richard Weinberger wrote:
> Am 04.11.2015 um 15:15 schrieb Octavian Purdila:
> > We could redefine the syscalls/libc symbols to call lkl_sys_ functions
> > in launch-lkl, e.g.:
> > 
> > int opendir(const char *path)
> > {
> >return lkl_opendir(new_path)
> > }
> 
> To get a better feeling how LKL behaves I've started with a tool
> to mount any Linux filesystem by FUSE.
> I.e. such that we can finally automount without root and bugs in filesystem
> code won't hurt that much.

guestmount already does this:

http://libguestfs.org/guestmount.1.html

By porting a small amount of code from the daemon/ directory, it could
do it using lkl too.  See:

http://www.gossamer-threads.com/lists/linux/kernel/2296116#2296116

Rich.

> lkl_sys_fstatat64() uses the type struct lkl_stat64. Where is it defined?
> git grep is unable to locate it.
> At least it seems to be incompatible with my local struct stat.
> 
> And why is there no lkl_sys_openat() syscall?
> 
> Thanks,
> //richard

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 00/28] Linux Kernel Library

2015-11-04 Thread Richard W.M. Jones

On Wed, Nov 04, 2015 at 01:24:03AM +0200, Octavian Purdila wrote:
> Thanks for the pointers Richard, I am going to take a look at it.

Now I've had a chance to look at some of the example LKL tools, here's
what this actually involves.  It's not actually a great deal of work,
it could probably be done in a day or two, but see my question about
`lkl_sys_*' below.

libguestfs (the library part) needs to talk over an RPC connection to
its daemon.  See diagram here:

http://libguestfs.org/guestfs-internals.1.html

The code in src/launch-{direct,libvirt,uml,...}.c sets up that
connection and runs the daemon -- normally inside a qemu wrapper, but
it could be inside UML.  For LKL I think it should just fork the
daemon directly.

The daemon would then be linked to LKL.

So really what's needed is a src/launch-lkl.c probably modelled after
one of these current backends:

https://github.com/libguestfs/libguestfs/blob/master/src/launch-uml.c
https://github.com/libguestfs/libguestfs/blob/master/src/launch-unix.c

and then recompile the daemon to link to LKL:

https://github.com/libguestfs/libguestfs/tree/master/daemon

and pass the list of disk images to the daemon, probably best to do
that on the guestfsd command line.

My only problem here: you can't just link to daemon to LKL, do you
have to change all of the system calls from `foo' to `lkl_sys_foo'?
That's an awful lot of #ifdefs ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 00/28] Linux Kernel Library

2015-11-03 Thread Richard W.M. Jones

On Tue, Nov 03, 2015 at 10:40:29PM +0100, Richard Weinberger wrote:
> On Tue, Nov 3, 2015 at 9:20 PM, Octavian Purdila
>  wrote:
> > LKL (Linux Kernel Library) is aiming to allow reusing the Linux kernel code
> > as extensively as possible with minimal effort and reduced maintenance
> > overhead.
> >
> > Examples of how LKL can be used are: creating userspace applications
> > (running on Linux and other operating systems) that can read or write Linux
> > filesystems or can use the Linux networking stack, creating kernel drivers
> > for other operating systems that can read Linux filesystems, bootloaders
> > support for reading/writing Linux filesystems, etc.
> >
> > With LKL, the kernel code is compiled into an object file that can be
> > directly linked by applications. The API offered by LKL is based on the
> > Linux system call interface.
> >
> > LKL is implemented as an architecture port in arch/lkl. It relies on host
> > operations defined by the application or a host library (tools/lkl/lib).
> >
> > The latest LKL version can be found at g...@github.com:lkl/linux.git
> 
> Or more copy&paste friendly: https://github.com/lkl/linux.git
> 
> > FAQ
> > ===
> >
> > Q: How is LKL different from UML?
> > A: UML provides a full OS environment (e.g. user/kernel separation, user
> > processes) and also has requirements (a filesystem, processes, etc.) that
> > makes it hard to use it for standalone applications. UML also relies
> > heavily on Linux hosts. On the other hand LKL is designed to be linked
> > directly with the application and hence does not have user/kernel
> > separation which makes it easier to use it in standalone applications.
> 
> So, this is a "liblinux" where applications are directly linked
> against the kernel.
> IOW system calls are plain function calls into the kernel?
> 
> This eliminates UML's most problematic areas, system call handling via 
> ptrace()
> and virtual memory management via SIGSEGV. :-)
> 
> > Q: How is LKL different from LibOS?
> > A: LibOS re-implements high-level kernel APIs for timers, softirqs,
> > scheduling, sysctl, SLAB/SLUB, etc. LKL behaves like any arch port,
> > implementing the arch level operations requested by the Linux kernel. LKL
> > also offers a host interface so that support for multiple hosts can be
> > easily implemented.
> 
> Yeah, these re-implementations are what I find most worrisome about LibOS.
> 
> >
> > Building LKL the host library and LKL applications
> > ==
> >
> > % cd tools/lkl
> > % make
> >
> > will build LKL as a object file, it will install it in tools/lkl/lib 
> > together
> > with the headers files in tools/lkl/include then will build the host 
> > library,
> > tests and a few of application examples:
> >
> > * tests/boot - a simple applications that uses LKL and exercises the basic
> > LKL APIs
> >
> > * fs2tar - a tool that converts a filesystem image to a tar archive
> >
> > * cptofs/cpfromfs - a tool that copies files to/from a filesystem image
> 
> Seeing forward to have a libguestfs port. :-)

Thanks - I was keeping an eye on libos (and on the NetBSD rump kernel
stuff before), ready to integrate them into libguestfs as soon as they
offered filesystem access.

It's easy to write a libguestfs-compatible backend, which brings all
the virt-* tools from libguestfs to the new code.  The UML one looks
like this:

https://github.com/libguestfs/libguestfs/blob/master/src/launch-uml.c

I'm dubious that a lib-based approach could support LVM, partioning,
ntfs-3g, qcow2, vmdk and all the other libguestfs stuff that relies on
userspace tools + qemu as well as just the kernel drivers.
Nevertheless a fast subset of libguestfs supporting just kernel
filesystem drivers could be useful.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH block/for-linus] block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg

2015-09-06 Thread Richard W.M. Jones

On Sat, Sep 05, 2015 at 03:47:36PM -0400, Tejun Heo wrote:
> While making the root blkg unconditional, ec13b1d6f0a0 ("blkcg: always
> create the blkcg_gq for the root blkcg") removed the part which clears
> q->root_blkg and ->root_rl.blkg during q exit.  This leaves the two
> pointers dangling after blkg_destroy_all().  blk-throttle exit path
> performs blkg traversals and dereferences ->root_blkg and can lead to
> the following oops.
> 
>  BUG: unable to handle kernel NULL pointer dereference at 0558
>  IP: [] __blkg_lookup+0x26/0x70
>  ...
>  task: 88001b4e2580 ti: 88001ac0c000 task.ti: 88001ac0c000
>  RIP: 0010:[]  [] __blkg_lookup+0x26/0x70
>  ...
>  Call Trace:
>   [] blk_throtl_drain+0x5a/0x110
>   [] blkcg_drain_queue+0x18/0x20
>   [] __blk_drain_queue+0xc0/0x170
>   [] blk_queue_bypass_start+0x61/0x80
>   [] blkcg_deactivate_policy+0x39/0x100
>   [] blk_throtl_exit+0x38/0x50
>   [] blkcg_exit_queue+0x3e/0x50
>   [] blk_release_queue+0x1e/0xc0
>  ...
> 
> While the bug is a straigh-forward use-after-free bug, it is tricky to
> reproduce because blkg release is RCU protected and the rest of exit
> path usually finishes before RCU grace period.
> 
> This patch fixes the bug by updating blkg_destro_all() to clear
> q->root_blkg and ->root_rl.blkg.
> 
> Signed-off-by: Tejun Heo 
> Reported-by: "Richard W.M. Jones" 
> Reported-by: Josh Boyer 
> Link: 
> http://lkml.kernel.org/g/ca+5pva5rzq0s4723n5rhbcxqa9t0cw8bppbekr_9amrowt2...@mail.gmail.com
> Fixes: ec13b1d6f0a0 ("blkcg: always create the blkcg_gq for the root blkcg")
> Cc: sta...@vger.kernel.org # v4.2+
> ---
> Hello,
> 
> Richard, can you please verify that this patch fixes the bug?

This patch managed 477 iterations before dying from an unrelated
reason in the test harness.  This is much better than before, so the
patch looks good to me.

Rich.

> Thanks a lot!
> 
>  block/blk-cgroup.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d6283b3..9cc48d1d 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -387,6 +387,9 @@ static void blkg_destroy_all(struct request_queue *q)
>   blkg_destroy(blkg);
>   spin_unlock(&blkcg->lock);
>   }
> +
> + q->root_blkg = NULL;
> + q->root_rl.blkg = NULL;
>  }
>  
>  /*

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: __blkg_lookup oops with 4.2-rcX

2015-09-05 Thread Richard W.M. Jones

On Sat, Sep 05, 2015 at 04:34:39PM +0100, Richard W.M. Jones wrote:
> [   52.259269] BUG: unable to handle kernel NULL pointer dereference at 
> 09c8
> [   52.259269] IP: [] __blkg_lookup+0x40/0xe0

And also:

$ addr2line -e 
/usr/lib/debug/lib/modules/4.3.0-0.rc0.git7.1.rwmj3.fc24.x86_64/vmlinux 
813f8b10
/usr/src/debug/kernel-4.2.fc24/linux-4.3.0-0.rc0.git7.1.rwmj3.fc24.x86_64/block/blk-cgroup.c:158

152 /*
153  * Hint didn't match.  Look up from the radix tree.  Note that 
the
154  * hint can only be updated under queue_lock as otherwise @blkg
155  * could have already been removed from blkg_tree.  The caller 
is
156  * responsible for grabbing queue_lock if @update_hint.
157  */
158 blkg = radix_tree_lookup(&blkcg->blkg_tree, q->id);
159 if (blkg && blkg->q == q) {
160 if (update_hint) {
161 lockdep_assert_held(q->queue_lock);
162 rcu_assign_pointer(blkcg->blkg_hint, blkg);
163 }
164 return blkg;
165 }

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: __blkg_lookup oops with 4.2-rcX

2015-09-05 Thread Richard W.M. Jones

On Fri, Sep 04, 2015 at 09:42:44PM +0100, Richard W.M. Jones wrote:
> On Fri, Sep 04, 2015 at 01:13:02PM -0400, Tejun Heo wrote:
> > The only struct which is large enough for 0xbb8 offset is
> > request_queue.  Hmm can you please try the brute force debug patch
> > below and report the kernel log after the crash?
> 
> So the good(?) news is this bug is not reproducible with the Fedora
> kernel 4.3.0-0.rc0.git7.1.fc24.x86_64 (tested both with and without
> your patch).
> 
> I'll keep it running overnight just in case ..

In fact the bug does happen with 4.3.0-0.rc0.git7.1.fc24.x86_64, but
it's a bit rarer than before.

I'm now using your patch, and the output from the patch is attached.
I hope that helps - let me know if there's anything else you'd want me
to try.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v
libguestfs: trace: disk_create "test1.img" "raw" 524288000
libguestfs: trace: disk_create = 0
libguestfs: trace: add_drive "test1.img" "format:raw" "cachemode:unsafe"
libguestfs: trace: add_drive = 0
libguestfs: trace: launch
libguestfs: trace: get_tmpdir
libguestfs: trace: get_tmpdir = "/tmp"
libguestfs: trace: version
libguestfs: trace: version = 
libguestfs: trace: get_backend
libguestfs: trace: get_backend = "libvirt"
libguestfs: launch: program=guestfish
libguestfs: launch: version=1.31.3fedora=24,release=1.fc24,libvirt
libguestfs: launch: backend registered: unix
libguestfs: launch: backend registered: uml
libguestfs: launch: backend registered: libvirt
libguestfs: launch: backend registered: direct
libguestfs: launch: backend=libvirt
libguestfs: launch: tmpdir=/tmp/libguestfsXVD2Sq
libguestfs: launch: umask=0002
libguestfs: launch: euid=1000
libguestfs: libvirt version = 1002019 (1.2.19)
libguestfs: guest random name = guestfs-dnsycku3ayrd2wns
libguestfs: [0ms] connect to libvirt
libguestfs: opening libvirt handle: URI = qemu:///session, auth = 
default+wrapper, flags = 0
libguestfs: successfully opened libvirt handle: conn = 0x5637883f0fc0
libguestfs: qemu version (reported by libvirt) = 2004000 (2.4.0)
libguestfs: [3ms] get libvirt capabilities
libguestfs: [00851ms] parsing capabilities XML
libguestfs: trace: get_backend_setting "force_tcg"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: get_backend_setting "internal_libvirt_label"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: get_backend_setting "internal_libvirt_imagelabel"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: get_backend_setting "internal_libvirt_norelabel_disks"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: [00853ms] build appliance
libguestfs: trace: get_cachedir
libguestfs: trace: get_cachedir = "/var/tmp"
libguestfs: [00853ms] begin building supermin appliance
libguestfs: [00853ms] run supermin
libguestfs: command: run: /usr/bin/supermin
libguestfs: command: run: \ --build
libguestfs: command: run: \ --verbose
libguestfs: command: run: \ --if-newer
libguestfs: command: run: \ --lock /var/tmp/.guestfs-1000/lock
libguestfs: command: run: \ --copy-kernel
libguestfs: command: run: \ -f ext2
libguestfs: command: run: \ --host-cpu x86_64
libguestfs: command: run: \ /usr/lib64/guestfs/supermin.d
libguestfs: command: run: \ -o /var/tmp/.guestfs-1000/appliance.d
supermin: version: 5.1.13
supermin: rpm: detected RPM version 4.13
supermin: package handler: fedora/rpm
supermin: acquiring lock on /var/tmp/.guestfs-1000/lock
supermin: if-newer: output does not need rebuilding
libguestfs: [00886ms] finished building supermin appliance
libguestfs: trace: disk_create "/tmp/libguestfsXVD2Sq/overlay1" "qcow2" -1 
"backingfile:/var/tmp/.guestfs-1000/appliance.d/root" "backingformat:raw"
libguestfs: command: run: qemu-img
libguestfs: command: run: \ create
libguestfs: command: run: \ -f qcow2
libguestfs: command: run: \ -o 
backing_file=/var/tmp/.guestfs-1000/appliance.d/root,backing_fmt=raw
libguestfs: command: run: \ /tmp/libguestfsXVD2Sq/overlay1
Formatting '/tmp/libguestfsXVD2Sq/overlay1', fmt=qcow2 size=4294967296 
backing_file='/var/tmp/.guestfs-1000/appliance.d/root' backing_fmt='raw' 
encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
libguestfs: trace: disk_create = 0
libguestfs: [01025ms] create libvirt XML
libguestfs: command: run: dmesg | grep -Eoh 'lpj=[[:digit:]]+'
libguestfs: read_lpj_from_dmesg: calculated lpj=3515548
libguestfs: trace: get_cachedir
libguestfs: tra

Re: __blkg_lookup oops with 4.2-rcX

2015-09-04 Thread Richard W.M. Jones

On Fri, Sep 04, 2015 at 01:13:02PM -0400, Tejun Heo wrote:
> The only struct which is large enough for 0xbb8 offset is
> request_queue.  Hmm can you please try the brute force debug patch
> below and report the kernel log after the crash?

So the good(?) news is this bug is not reproducible with the Fedora
kernel 4.3.0-0.rc0.git7.1.fc24.x86_64 (tested both with and without
your patch).

I'll keep it running overnight just in case ..

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: __blkg_lookup oops with 4.2-rcX

2015-09-04 Thread Richard W.M. Jones

On Fri, Sep 04, 2015 at 01:13:02PM -0400, Tejun Heo wrote:
> > [6.784689] BUG: unable to handle kernel NULL pointer dereference at 
> > 0bb8
> > [6.787605] IP: [] blk_throtl_drain+0x80/0x220
> 
> The only struct which is large enough for 0xbb8 offset is
> request_queue.  Hmm can you please try the brute force debug patch
> below and report the kernel log after the crash?

I'll test your patch very soon, after I've recompiled the kernel with it.

I just wanted to say that I was working on a better reproducer using a
newer kernel, and now I have got one.  It is this:

  guestfish -v -x < 50%) is attached.

To explain what the command above does:

(1) It creates a 500 MB sparse file called 'test1.img'.

(2) It creates a small, short-lived VM, using qemu, adding 'test1.img'
to the qemu command line.  Setting 'cachemode:unsafe' seems to be
either important, or makes the bug reproduce much more often.  This
corresponds to the qemu option '-drive file=test1.img,cache=unsafe'
and should be below the level that the kernel sees, so should make no
difference.

(3) It boots the VM.

(4) It runs some parted and LVM commands -- see the attached log file
for precisely what commands are run.

(5) The final command (lvremove) fails, with the stacktrace seen in
the attachment.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/
libguestfs: trace: disk_create "test1.img" "raw" 524288000
libguestfs: trace: disk_create = 0
libguestfs: trace: add_drive "test1.img" "format:raw" "cachemode:unsafe"
libguestfs: trace: add_drive = 0
libguestfs: trace: launch
libguestfs: trace: get_tmpdir
libguestfs: trace: get_tmpdir = "/tmp"
libguestfs: trace: version
libguestfs: trace: version = 
libguestfs: trace: get_backend
libguestfs: trace: get_backend = "libvirt"
libguestfs: launch: program=guestfish
libguestfs: launch: version=1.31.3fedora=24,release=1.fc24,libvirt
libguestfs: launch: backend registered: unix
libguestfs: launch: backend registered: uml
libguestfs: launch: backend registered: libvirt
libguestfs: launch: backend registered: direct
libguestfs: launch: backend=libvirt
libguestfs: launch: tmpdir=/tmp/libguestfsEfp5po
libguestfs: launch: umask=0002
libguestfs: launch: euid=1000
libguestfs: libvirt version = 1002019 (1.2.19)
libguestfs: guest random name = guestfs-zlr3zvogh5lilmw3
libguestfs: [0ms] connect to libvirt
libguestfs: opening libvirt handle: URI = qemu:///session, auth = 
default+wrapper, flags = 0
libguestfs: successfully opened libvirt handle: conn = 0x557415049ec0
libguestfs: qemu version (reported by libvirt) = 2004000 (2.4.0)
libguestfs: [3ms] get libvirt capabilities
libguestfs: [00670ms] parsing capabilities XML
libguestfs: trace: get_backend_setting "force_tcg"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: get_backend_setting "internal_libvirt_label"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: get_backend_setting "internal_libvirt_imagelabel"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: trace: get_backend_setting "internal_libvirt_norelabel_disks"
libguestfs: trace: get_backend_setting = NULL (error)
libguestfs: [00672ms] build appliance
libguestfs: trace: get_cachedir
libguestfs: trace: get_cachedir = "/var/tmp"
libguestfs: [00672ms] begin building supermin appliance
libguestfs: [00672ms] run supermin
libguestfs: command: run: /usr/bin/supermin
libguestfs: command: run: \ --build
libguestfs: command: run: \ --verbose
libguestfs: command: run: \ --if-newer
libguestfs: command: run: \ --lock /var/tmp/.guestfs-1000/lock
libguestfs: command: run: \ --copy-kernel
libguestfs: command: run: \ -f ext2
libguestfs: command: run: \ --host-cpu x86_64
libguestfs: command: run: \ /usr/lib64/guestfs/supermin.d
libguestfs: command: run: \ -o /var/tmp/.guestfs-1000/appliance.d
supermin: version: 5.1.13
supermin: rpm: detected RPM version 4.13
supermin: package handler: fedora/rpm
supermin: acquiring lock on /var/tmp/.guestfs-1000/lock
supermin: if-newer: output does not need rebuilding
libguestfs: [00692ms] finished building supermin appliance
libguestfs: trace: disk_create "/tmp/libguestfsEfp5po/overlay1" "qcow2" -1 
"backingfile:/var/tmp/.guestfs-1000/appliance.d/root" "backingformat:raw"
libguestfs: command: run: qemu-img
libguestfs: command: run: \ create
libguestfs: command: run: \ -f qcow2
libguestfs: command: run: \ -o 
backing_file=/var/tmp/.guestfs-1000/appliance.d/root,backing_fmt=raw
libguestfs: command: run: \ /tmp/libguestfsEfp5po/overlay1
Formatting '/tmp/libguestfsEfp5po/overlay1', fmt=qcow2 size=4294967296 
backing_file='/var/tmp/.guestfs-1000/appliance.d/root' backing_fmt='raw' 
encryption=off cluster_size=65536 lazy_refcounts

Re: __blkg_lookup oops with 4.2-rcX

2015-09-04 Thread Richard W.M. Jones


On Wed, Sep 02, 2015 at 11:32:55AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 02, 2015 at 10:53:07AM -0400, Tejun Heo wrote:
> > On Sun, Aug 30, 2015 at 08:30:41AM -0400, Josh Boyer wrote:
> > I think the offending commit is 776687bce42b ("block, blk-mq: draining
> > can't be skipped even if bypass_depth was non-zero").  It looks like
> > the patch makes shutdown path travel data structure which is already
> > destroyed.  Will post the fix soon.
> 
> Hmm... I can't reproduce it here or see how such oops would happen.
> 
> * Is the problem reproducible on v4.2?  If so, can you please describe
>   the steps to reproduce?  How is cgroup set up?

We have a test suite which does a lot of filesystem and device
operations, and this triggers it randomly (not reliably nor in the
same place every time, but still pretty frequently).

So .. I don't have steps that can reproduce it reliably unfortunately.

However I'm going to work on that now to see if I can create a
sequence of operations that triggers it some or all of the time.

> * Can you please run gdb or addr2line on it and report which line is
>   causing the oops?

Below is another stack trace that I just collected.  It came from a
test that does some hotplugging of a virtual machine.  The kernel this
time is 4.2.0-0.rc3.git4.1.fc24.x86_64 (which is a bit old - am also
going to upgrade to the newest kernel soon).

The addr2line output from this one is:

$ addr2line -e 
/usr/lib/debug/lib/modules/4.2.0-0.rc3.git4.1.fc24.x86_64/vmlinux 
814107a0
/usr/src/debug/kernel-4.1.fc24/linux-4.2.0-0.rc3.git4.1.fc24.x86_64/block/blk-throttle.c:1642

   1636 /*
   1637  * Drain each tg while doing post-order walk on the blkg tree, 
s   1637 o
   1638  * that all bios are propagated to td->service_queue.  It'd be
   1639  * better to walk service_queue tree directly but blkg walk is
   1640  * easier.
   1641  */
   1642 blkg_for_each_descendant_post(blkg, pos_css, 
td->queue->root_blkg)
   1643 tg_drain_bios(&blkg_to_tg(blkg)->service_queue);
   1644 

Rich.

[6.784689] BUG: unable to handle kernel NULL pointer dereference at 
0bb8
[6.787605] IP: [] blk_throtl_drain+0x80/0x220
[6.789797] PGD 0 
[6.790598] Oops:  [#1] SMP 
[6.791848] Modules linked in: kvm_intel kvm snd_pcsp snd_pcm snd_timer snd 
ghash_clmulni_intel soundcore joydev ata_generic serio_raw pata_acpi libcrc32c 
crc8 crc_itu_t crc_ccitt virtio_pci virtio_mmio virtio_input virtio_balloon 
virtio_scsi sym53c8xx scsi_transport_spi megaraid_sas megaraid_mbox megaraid_mm 
megaraid ideapad_laptop rfkill sparse_keymap video virtio_net virtio_gpu ttm 
drm_kms_helper drm virtio_console virtio_rng virtio_blk virtio_ring virtio 
crc32 crct10dif_pclmul crc32c_intel crc32_pclmul
[6.809710] CPU: 0 PID: 27 Comm: kworker/0:1 Not tainted 
4.2.0-0.rc3.git4.1.fc24.x86_64 #1
[6.812650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.8.2-20150714_191134- 04/01/2014
[6.816068] Workqueue: events_freezable virtscsi_handle_event [virtio_scsi]
[6.818588] task: 88001dfb3a00 ti: 88001d09 task.ti: 
88001d09
[6.821252] RIP: 0010:[]  [] 
blk_throtl_drain+0x80/0x220
[6.824302] RSP: :88001d0939d8  EFLAGS: 00010046
[6.826213] RAX:  RBX: 88001b8f6698 RCX: 00e0
[6.828743] RDX: 31e18f88fc458000 RSI:  RDI: 
[6.831292] RBP: 88001d093a08 R08:  R09: 
[6.833835] R10: 88001dfb3a00 R11: 81e58200 R12: 88001ba67200
[6.836380] R13: 88001b8f6698 R14: 88001b9ee1f0 R15: 88001b9ee0d0
[6.838920] FS:  () GS:88001ee0() 
knlGS:
[6.841781] CS:  0010 DS:  ES:  CR0: 8005003b
[6.843838] CR2: 0bb8 CR3: 180c4000 CR4: 06f0
[6.846383] Stack:
[6.847132]  81410756 88001b9ee1f0 88001d093a08 
88001b8f6698
[6.849950]  81ef5320  88001d093a28 
8140d5fd
[6.852746]  88001b8f6698 88001b8f6698 88001d093a58 
813e7839
[6.855562] Call Trace:
[6.856473]  [] ? blk_throtl_drain+0x36/0x220
[6.858581]  [] blkcg_drain_queue+0x2d/0x60
[6.860639]  [] __blk_drain_queue+0xc9/0x1a0
[6.862741]  [] ? blk_queue_bypass_start+0x68/0xb0
[6.865029]  [] blk_queue_bypass_start+0x72/0xb0
[6.867236]  [] blkcg_deactivate_policy+0x39/0x100
[6.869513]  [] cfq_exit_queue+0xd0/0xf0
[6.871481]  [] elevator_exit+0x31/0x50
[6.873423]  [] blk_release_queue+0x4e/0xc0
[6.875495]  [] kobject_release+0x7a/0x190
[6.877524]  [] kobject_put+0x2f/0x60
[6.879413]  [] blk_put_queue+0x15/0x20
[6.881351]  [] 
scsi_device_dev_release_usercontext+0xc4/0x120
[6.884010]  [] ? scsi_device_dev_release+0x20/0x20
[6.886297]  [] execute_in_proce

Re: [BUG 4.2-rc8] Interrupt occurs while apply_alternatives() is patching the handler

2015-09-04 Thread Richard W.M. Jones

On Thu, Sep 03, 2015 at 12:41:47PM +0200, Thomas Gleixner wrote:
> On Thu, 3 Sep 2015, Borislav Petkov wrote:
> > On Wed, Sep 02, 2015 at 11:11:55AM +0200, Thomas Gleixner wrote:
> > >  static void __init_or_module add_nops(void *insns, unsigned int len)
> > >  {
> > > + unsigned long flags;
> > > +
> > > + local_irq_save(flags);
> > >   while (len > 0) {
> > 
> > I guess you want to optimize the len==0 case to not disable interrupts
> > needlessly:
> > 
> > if (!len)
> > return;
> > 
> > local_irq_save(flags);
> > while (len > 0)
> > ...
> 
> Nah. I rather put the local_irq_save into optimize_nops(). All other
> callers of add_nops() are operating on a buffer and use text_poke
> after that. Aside of that optimize_nops() is missing a sync_core().
> 
> Updated patch below.

V2 of the patch managed ~ 5000 iterations overnight without hitting
the problem, so looks like it's a good fix.

Thanks, Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG 4.2-rc8] Interrupt occurs while apply_alternatives() is patching the handler

2015-09-03 Thread Richard W.M. Jones

On Thu, Sep 03, 2015 at 12:41:47PM +0200, Thomas Gleixner wrote:
> On Thu, 3 Sep 2015, Borislav Petkov wrote:
> > On Wed, Sep 02, 2015 at 11:11:55AM +0200, Thomas Gleixner wrote:
> > >  static void __init_or_module add_nops(void *insns, unsigned int len)
> > >  {
> > > + unsigned long flags;
> > > +
> > > + local_irq_save(flags);
> > >   while (len > 0) {
> > 
> > I guess you want to optimize the len==0 case to not disable interrupts
> > needlessly:
> > 
> > if (!len)
> > return;
> > 
> > local_irq_save(flags);
> > while (len > 0)
> > ...
> 
> Nah. I rather put the local_irq_save into optimize_nops(). All other
> callers of add_nops() are operating on a buffer and use text_poke
> after that. Aside of that optimize_nops() is missing a sync_core().
> 
> Updated patch below.

The V2 patch has got to 900 iterations without hitting the problem.
As that is a lot more than without the patch, you can add:

  Tested-by: Richard W.M. Jones 

I will leave it going overnight just in case.

Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG 4.2-rc8] Interrupt occurs while apply_alternatives() is patching the handler

2015-09-03 Thread Richard W.M. Jones

On Wed, Sep 02, 2015 at 08:05:12PM +0100, Richard W.M. Jones wrote:
> On Wed, Sep 02, 2015 at 11:11:55AM +0200, Thomas Gleixner wrote:
> > On Tue, 1 Sep 2015, Richard W.M. Jones wrote:
> > > On Sun, Aug 30, 2015 at 10:37:57PM -0400, Chuck Ebbert wrote:
> > > > This is from https://bugzilla.redhat.com/show_bug.cgi?id=1258223
> > > > 
> > > > [0.036000] BUG: unable to handle kernel paging request at 55501e06
> > > [...]
> > > > [0.036000]  [] ? add_nops+0x90/0xa0
> > > > [0.036000]  [] apply_alternatives+0x274/0x630
> > > > [0.036000]  [] ? wait_for_xmitr+0xa0/0xa0
> > > > [0.036000]  [] ? sprintf+0x1c/0x20
> > > > [0.036000]  [] ? irq_entries_start+0x698/0x698
> > > > [0.036000]  [] ? memcpy+0xb/0x30
> > > > [0.036000]  [] ? serial8250_set_termios+0x20/0x20
> > > [...]
> > > > Interrupt 0x30 occurred while the alternatives code was replacing the
> > > > initial 0x90,0x90,0x90 NOPs (from the ASM_CLAC macro) with the optimized
> > > > version, 0x8d,0x76,0x00. Only the first byte has been replaced so far,
> > > > and it makes a mess out of the insn decoding.
> > 
> > apply_alternatives() has two ways to modify the code:
> > 
> > 1) text_poke_early()
> > 
> > 2) optimize_nops()
> > 
> > The former disables interrupts, the latter not. The patch below should
> > fix the issue.
> 
> It has gone through about 1100 iterations so far without hitting the
> bug.  I'll leave it running overnight.

That ran ~ 4000 iterations overnight, so it seems to work.  You can
add:

  Tested-by: Richard W.M. Jones 

Thanks,

Rich.

> Rich.
> 
> > Thanks,
> > 
> > tglx
> > 
> > diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> > index c42827eb86cf..6a2f93e029f4 100644
> > --- a/arch/x86/kernel/alternative.c
> > +++ b/arch/x86/kernel/alternative.c
> > @@ -257,6 +257,9 @@ void __init arch_init_ideal_nops(void)
> >  /* Use this to add nops to a buffer, then text_poke the whole buffer. */
> >  static void __init_or_module add_nops(void *insns, unsigned int len)
> >  {
> > +   unsigned long flags;
> > +
> > +   local_irq_save(flags);
> > while (len > 0) {
> > unsigned int noplen = len;
> > if (noplen > ASM_NOP_MAX)
> > @@ -265,6 +268,7 @@ static void __init_or_module add_nops(void *insns, 
> > unsigned int len)
> > insns += noplen;
> > len -= noplen;
> > }
> > +   local_irq_restore(flags);
> >  }
> >  
> >  extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
> > 
> > 
> > 
> 
> -- 
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> virt-builder quickly builds VMs from scratch
> http://libguestfs.org/virt-builder.1.html

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG 4.2-rc8] Interrupt occurs while apply_alternatives() is patching the handler

2015-09-02 Thread Richard W.M. Jones

On Wed, Sep 02, 2015 at 11:11:55AM +0200, Thomas Gleixner wrote:
> On Tue, 1 Sep 2015, Richard W.M. Jones wrote:
> > On Sun, Aug 30, 2015 at 10:37:57PM -0400, Chuck Ebbert wrote:
> > > This is from https://bugzilla.redhat.com/show_bug.cgi?id=1258223
> > > 
> > > [0.036000] BUG: unable to handle kernel paging request at 55501e06
> > [...]
> > > [0.036000]  [] ? add_nops+0x90/0xa0
> > > [0.036000]  [] apply_alternatives+0x274/0x630
> > > [0.036000]  [] ? wait_for_xmitr+0xa0/0xa0
> > > [0.036000]  [] ? sprintf+0x1c/0x20
> > > [0.036000]  [] ? irq_entries_start+0x698/0x698
> > > [0.036000]  [] ? memcpy+0xb/0x30
> > > [0.036000]  [] ? serial8250_set_termios+0x20/0x20
> > [...]
> > > Interrupt 0x30 occurred while the alternatives code was replacing the
> > > initial 0x90,0x90,0x90 NOPs (from the ASM_CLAC macro) with the optimized
> > > version, 0x8d,0x76,0x00. Only the first byte has been replaced so far,
> > > and it makes a mess out of the insn decoding.
> 
> apply_alternatives() has two ways to modify the code:
> 
> 1) text_poke_early()
> 
> 2) optimize_nops()
> 
> The former disables interrupts, the latter not. The patch below should
> fix the issue.

It has gone through about 1100 iterations so far without hitting the
bug.  I'll leave it running overnight.

Rich.

> Thanks,
> 
>   tglx
> 
> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
> index c42827eb86cf..6a2f93e029f4 100644
> --- a/arch/x86/kernel/alternative.c
> +++ b/arch/x86/kernel/alternative.c
> @@ -257,6 +257,9 @@ void __init arch_init_ideal_nops(void)
>  /* Use this to add nops to a buffer, then text_poke the whole buffer. */
>  static void __init_or_module add_nops(void *insns, unsigned int len)
>  {
> + unsigned long flags;
> +
> + local_irq_save(flags);
>   while (len > 0) {
>   unsigned int noplen = len;
>   if (noplen > ASM_NOP_MAX)
> @@ -265,6 +268,7 @@ static void __init_or_module add_nops(void *insns, 
> unsigned int len)
>   insns += noplen;
>   len -= noplen;
>   }
> + local_irq_restore(flags);
>  }
>  
>  extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
> 
> 
> 

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUG 4.2-rc8] Interrupt occurs while apply_alternatives() is patching the handler

2015-08-31 Thread Richard W.M. Jones

On Sun, Aug 30, 2015 at 10:37:57PM -0400, Chuck Ebbert wrote:
> This is from https://bugzilla.redhat.com/show_bug.cgi?id=1258223
> 
> [0.036000] BUG: unable to handle kernel paging request at 55501e06
[...]
> [0.036000]  [] ? add_nops+0x90/0xa0
> [0.036000]  [] apply_alternatives+0x274/0x630
> [0.036000]  [] ? wait_for_xmitr+0xa0/0xa0
> [0.036000]  [] ? sprintf+0x1c/0x20
> [0.036000]  [] ? irq_entries_start+0x698/0x698
> [0.036000]  [] ? memcpy+0xb/0x30
> [0.036000]  [] ? serial8250_set_termios+0x20/0x20
[...]
> Interrupt 0x30 occurred while the alternatives code was replacing the
> initial 0x90,0x90,0x90 NOPs (from the ASM_CLAC macro) with the optimized
> version, 0x8d,0x76,0x00. Only the first byte has been replaced so far,
> and it makes a mess out of the insn decoding.

Chuck, thanks for reporting this.

I have only been able to reproduce this so far using qemu and TCG (not
KVM) which of course raises a range of questions: could it be a qemu
bug or a TCG bug?  Could it be that an atomic op is not correctly
implemented by qemu?  I will keep trying on KVM.  Because I don't have
a convenient server with 32 bit kernel and a serial port that I can
reboot thousands of times, I have not tried to reproduce on baremetal yet.

Here's how to reproduce it.  (The host can be x86-64)

(1) Grab the 32 bit Fedora kernel we are using from
https://kojipkgs.fedoraproject.org//packages/kernel/4.2.0/1.fc24/i686/kernel-core-4.2.0-1.fc24.i686.rpm
(from http://koji.fedoraproject.org/koji/buildinfo?buildID=681723)

(2) Unpack it to extract vmlinuz:

cd /tmp
rpm2cpio /mnt/scratch/kernel-core-4.2.0-1.fc24.i686.rpm | cpio -id
cp ./lib/modules/4.2.0-1.fc24.i686/vmlinuz .

(3) Boot the kernel under qemu/KVM.  The following single line command
repeatedly boots the kernel until the bug is hit:

while qemu-system-x86_64 -nographic -no-reboot -M accel=kvm:tcg -kernel vmlinuz 
-append 'console=ttyS0 panic=1' -serial stdio -monitor none >& log; ! grep 
add_nops log; do echo -n .; done

It takes many iterations (100s with TCG) to hit the bug.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: __blkg_lookup oops with 4.2-rcX

2015-08-30 Thread Richard W.M. Jones


On Sun, Aug 30, 2015 at 08:30:41AM -0400, Josh Boyer wrote:
> Hi Tejun,
> 
> Mike and Jeff suggested you be informed of the oops one of our
> community members is hitting in Fedora with 4.2-rcX.  I thought they
> had already sent this upstream to you, but apparently they didn't.
> 
> The latest oops is below.  That is with 4.2-rc8.  I believe the first
> report was against a merge window 4.2 kernel.  The full bug report is
> here: https://bugzilla.redhat.com/show_bug.cgi?id=1237136
> 
> I believe Mike and Jeff suspected the cgroup writeback patches.

Thanks Josh.

Also, I can test potential patches if you CC me on them.

Rich.

> josh
> 
> lvm vgchange -a n
>   /run/lvm/lvmetad.socket: connect failed: No such file or directory
>   WARNING: Failed to connect to lvmetad. Falling back to internal scanning.
> [   36.157672] BUG: unable to handle kernel NULL pointer dereference
> at 0558
> [   36.157672] IP: [] __blkg_lookup+0x26/0x70
> [   36.157672] PGD 0
> [   36.157672] Oops:  [#1] SMP
> [   36.157672] Modules linked in: kvm_amd kvm snd_pcsp snd_pcm
> snd_timer snd soundcore serio_raw ata_generic pata_acpi libcrc32c crc8
> crc_itu_t crc_ccitt virtio_pci virtio_mmio virtio_input virtio_balloon
> virtio_scsi sym53c8xx scsi_transport_spi megaraid_sas megaraid_mbox
> megaraid_mm megaraid ideapad_laptop rfkill sparse_keymap video
> virtio_net virtio_gpu ttm drm_kms_helper drm virtio_console virtio_rng
> virtio_blk virtio_ring virtio crc32
> [   36.157672] CPU: 0 PID: 248 Comm: lvm Not tainted
> 4.2.0-0.rc8.git0.1.fc23.x86_64 #1
> [   36.157672] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS 1.8.2-20150714_191134- 04/01/2014
> [   36.157672] task: 88001b4e2580 ti: 88001ac0c000 task.ti:
> 88001ac0c000
> [   36.157672] RIP: 0010:[]  []
> __blkg_lookup+0x26/0x70
> [   36.157672] RSP: 0018:88001ac0fa58  EFLAGS: 0046
> [   36.157672] RAX:  RBX:  RCX: 
> 
> [   36.157672] RDX:  RSI:  RDI: 
> 820176a0
> [   36.157672] RBP: 88001ac0fa78 R08: 88001ac0c000 R09: 
> 88001ce6e800
> [   36.157672] R10: 0002 R11: 0001eaa5 R12: 
> 88001cc57000
> [   36.157672] R13: 88001ccf99c8 R14: 88001ccf9f38 R15: 
> 88001ccba8d8
> [   36.157672] FS:  7f9a21711880() GS:88001f00()
> knlGS:
> [   36.157672] CS:  0010 DS:  ES:  CR0: 8005003b
> [   36.157672] CR2: 0558 CR3: 1b41d000 CR4: 
> 06f0
> [   36.157672] DR0:  DR1:  DR2: 
> 
> [   36.157672] DR3:  DR6:  DR7: 
> 
> [   36.157672] Stack:
> [   36.157672]  82017700 820176a0 88001cc57000
> 88001ccf99c8
> [   36.157672]  88001ac0faa8 8138d14a 88001cc570b8
> 88001ccf99c8
> [   36.157672]  81cb1c20  88001ac0fab8
> 8138a108
> [   36.157672] Call Trace:
> [   36.157672]  [] blk_throtl_drain+0x5a/0x110
> [   36.157672]  [] blkcg_drain_queue+0x18/0x20
> [   36.157672]  [] __blk_drain_queue+0xc0/0x170
> [   36.157672]  [] blk_queue_bypass_start+0x61/0x80
> [   36.157672]  [] blkcg_deactivate_policy+0x39/0x100
> [   36.157672]  [] blk_throtl_exit+0x38/0x50
> [   36.157672]  [] blkcg_exit_queue+0x3e/0x50
> [   36.157672]  [] blk_release_queue+0x1e/0xc0
> [   36.157672]  [] kobject_release+0x7a/0x190
> [   36.157672]  [] kobject_put+0x2f/0x60
> [   36.157672]  [] blk_cleanup_queue+0x111/0x140
> [   36.157672]  [] cleanup_mapped_device+0xdc/0x100
> [   36.157672]  [] __dm_destroy+0x161/0x260
> [   36.157672]  [] dm_destroy+0x13/0x20
> [   36.157672]  [] dev_remove+0x10d/0x170
> [   36.157672]  [] ? dev_suspend+0x280/0x280
> [   36.157672]  [] ctl_ioctl+0x232/0x4d0
> [   36.157672]  [] ? SYSC_semtimedop+0x2b0/0xeb0
> [   36.157672]  [] ? __switch_to+0x261/0x4b0
> [   36.157672]  [] dm_ctl_ioctl+0x13/0x20
> [   36.157672]  [] do_vfs_ioctl+0x295/0x470
> [   36.157672]  [] ? sem_security+0x9/0x10
> [   36.157672]  [] SyS_ioctl+0x79/0x90
> [   36.157672]  [] entry_SYSCALL_64_fastpath+0x12/0x71
> [   36.157672] Code: eb bf 0f 1f 00 66 66 66 66 90 55 48 89 e5 41 55
> 41 54 53 48 83 ec 08 48 8b 87 c8 00 00 00 48 85 c0 74 05 48 39 30 74
> 45 48 89 f3 <48> 63 b6 58 05 00 00 49 89 fd 48 8d bf b8 00 00 00 41 89
> d4 e8
> [   36.157672] RIP  [] __blkg_lookup+0x26/0x70
> [   36.157672]  RSP 
> [   36.157672] CR2: 0558
> [   36.157672] ---[ end trace a6310b2924d6c01e ]---

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.o

Re: [PATCH] arm64: annotate psci invoke functions as notrace

2015-02-24 Thread Richard W.M. Jones

On Wed, Feb 18, 2015 at 12:26:38PM -0500, Kyle McMartin wrote:
> Using GCC 5 to build the kernel with ftrace enabled, we encounter the
> following error as a result of the mcount prologue changing the expected
> register use of the function parameters,
> 
> /tmp/cc8Kpn7A.s: Assembler messages:
> /tmp/cc8Kpn7A.s:41: Error: .err encountered
> /tmp/cc8Kpn7A.s:42: Error: .err encountered
> /tmp/cc8Kpn7A.s:43: Error: .err encountered
> /tmp/cc8Kpn7A.s:101: Error: .err encountered
> /tmp/cc8Kpn7A.s:102: Error: .err encountered
> /tmp/cc8Kpn7A.s:103: Error: .err encountered
> scripts/Makefile.build:257: recipe for target 'arch/arm64/kernel/psci.o' 
> failed
> 
> Fix this by annotating the function as notrace, to suppress the
> generation of profiling prologues and epilogues on the function.
> 
> Signed-off-by: Kyle McMartin 
> 
> --- a/arch/arm64/kernel/psci.c
> +++ b/arch/arm64/kernel/psci.c
> @@ -113,7 +113,7 @@ static void psci_power_state_unpack(u32 power_state,
>   * The following two functions are invoked via the invoke_psci_fn pointer
>   * and will not be inlined, allowing us to piggyback on the AAPCS.
>   */
> -static noinline int __invoke_psci_fn_hvc(u64 function_id, u64 arg0, u64 arg1,
> +static noinline notrace int __invoke_psci_fn_hvc(u64 function_id, u64 arg0, 
> u64 arg1,
>u64 arg2)
>  {
>   asm volatile(
> @@ -128,7 +128,7 @@ static noinline int __invoke_psci_fn_hvc(u64 function_id, 
> u64 arg0, u64 arg1,
>   return function_id;
>  }
>  
> -static noinline int __invoke_psci_fn_smc(u64 function_id, u64 arg0, u64 arg1,
> +static noinline notrace int __invoke_psci_fn_smc(u64 function_id, u64 arg0, 
> u64 arg1,
>u64 arg2)
>  {
>   asm volatile(

I need this patch in order to compile the upstream kernel on aarch64
using gcc 5.  Can it not be added temporarily while the longer term
fix, whatever that is, is worked out?

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/2] target: Add documentation on the target userspace pass-through driver

2014-08-31 Thread Richard W.M. Jones

On Sun, Aug 31, 2014 at 12:49:26PM -0700, Andy Grover wrote:
> Thanks for the feedback. I am undoubtedly too close to the details,
> because I thought I *was* explaining things :)

Yeah, sorry it came across as a bit harsh.

Benoit did explain it to me so I understood it in the end (I think!)

> This doc is for people like you -- tech-savvy but unfamiliar with
> this specific area. Would you be so kind as to point out exactly the
> terms this document should explain? Should it explain SCSI and SCSI
> commands? What a SCSI target is? Say "target implementations" rather
> than "target solutions"? Do I need some ASCII art?

So I can only speak for myself here, but I'm pretty familiar with
iSCSI, using it, and some of the internals -- in fact I'm using the
Linux kernel target daily.

> TCM Userspace Design
> In addition to modularizing the transport protocol used for carrying
> SCSI commands ("fabrics"), the Linux kernel target, LIO, also
> modularizes the actual data storage as well.  These are referred to
> as "backstores" or "storage engines".

Reading this several times, I now think I get what it's trying to say,
but I think it needs to introduces the terms (as the Economist style
does).  Something like this:

  "TCM is the new name for LIO, an in-kernel iSCSI target (server).
  Existing TCM targets run in the kernel.  TCMU (TCM in Userspace)
  allows userspace programs to be written which act as iSCSI targets.
  This document describes the design.

  The existing kernel provides modules for different SCSI transport
  protocols.  TCM also modularizes the data storage.  There are
  existing modules for file, block device, RAM or using another SCSI
  device as storage.  These are called "backstores" or "storage
  engines".  These built-in modules are implemented entirely as kernel
  code."

And hopefully having defined a bit of background, the rest of the
document just flows nicely:

> These backstores cover the most common use cases, but not all. One new
> use case that other non-kernel target solutions, such as tgt, are able
> to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
> target then serves as a translator, allowing initiators to store data
> in these non-traditional networked storage systems, while still only
> using standard protocols themselves.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/2] target: Add documentation on the target userspace pass-through driver

2014-08-30 Thread Richard W.M. Jones

On Tue, Jul 01, 2014 at 12:11:14PM -0700, Andy Grover wrote:
> Describes the driver and its interface to make it possible for user
> programs to back a LIO-exported LUN.
> 
> Signed-off-by: Andy Grover 
> ---
>  Documentation/target/tcmu-design.txt | 210 
> +++
>  1 file changed, 210 insertions(+)
>  create mode 100644 Documentation/target/tcmu-design.txt
> 
> diff --git a/Documentation/target/tcmu-design.txt 
> b/Documentation/target/tcmu-design.txt
> new file mode 100644
> index 000..200ff3e
> --- /dev/null
> +++ b/Documentation/target/tcmu-design.txt
> @@ -0,0 +1,210 @@
> +TCM Userspace Design
> +
> +
> +
> +Background:
> +
> +In addition to modularizing the transport protocol used for carrying
> +SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
> +the actual data storage as well. These are referred to as "backstores"
> +or "storage engines". The target comes with backstores that allow a
> +file, a block device, RAM, or another SCSI device to be used for the
> +local storage needed for the exported SCSI LUN. Like the rest of LIO,
> +these are implemented entirely as kernel code.
> +
> +These backstores cover the most common use cases, but not all. One new
> +use case that other non-kernel target solutions, such as tgt, are able
> +to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
> +target then serves as a translator, allowing initiators to store data
> +in these non-traditional networked storage systems, while still only
> +using standard protocols themselves.
> +
> +If the target is a userspace process, supporting these is easy. tgt,
> +for example, needs only a small adapter module for each, because the
> +modules just use the available userspace libraries for RBD and GLFS.
> +
> +Adding support for these backstores in LIO is considerably more
> +difficult, because LIO is entirely kernel code. Instead of undertaking
> +the significant work to port the GLFS or RBD APIs and protocols to the
> +kernel, another approach is to create a userspace pass-through
> +backstore for LIO, "TCMU".

It has to be said that this documentation is terrible.

Jumping "in medias res"[1] is great for fiction, awful for technical
documentation.

I would recommend the Economist Style Guide[2].  They always say
"Barak Obama, President of the United States" the first time he is
mentioned in an article, even though almost everyone knows who Barak
Obama is.

In this case you're leaping into something .. fabrics, LIO,
backstores, target solutions, ... aargh.  Explain what you mean by
each term and how it all fits together.

Thanks,
Rich.

[1] https://en.wikipedia.org/wiki/In_medias_res

[2] http://www.economist.com/styleguide/introduction

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] block: Fix cloning of discard/write same bios

2014-02-11 Thread Richard W.M. Jones

On Mon, Feb 10, 2014 at 05:45:50PM -0800, Kent Overstreet wrote:
> Immutable biovecs changed the way bio segments are treated in such a way that
> bio_for_each_segment() cannot now do what we want for discard/write same bios,
> since bi_size means something completely different for them.
> 
> Fortunately discard and write same bios never have more than a single biovec, 
> so
> bio_for_each_segment() is unnecessary and not terribly meaningful for them, 
> but
> we still have to special case them in a few places.
> 
> Signed-off-by: Kent Overstreet 

Tested-by: Richard W.M. Jones 

I have confirmed this fixes the bug described here:

https://bugzilla.redhat.com/show_bug.cgi?id=1061339

Rich.

> ---
>  fs/bio.c| 15 ++-
>  include/linux/bio.h | 11 +++
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/bio.c b/fs/bio.c
> index 75c49a3822..8754e7b6eb 100644
> --- a/fs/bio.c
> +++ b/fs/bio.c
> @@ -611,7 +611,6 @@ EXPORT_SYMBOL(bio_clone_fast);
>  struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
>struct bio_set *bs)
>  {
> - unsigned nr_iovecs = 0;
>   struct bvec_iter iter;
>   struct bio_vec bv;
>   struct bio *bio;
> @@ -638,10 +637,7 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
> gfp_mask,
>*__bio_clone_fast() anyways.
>*/
>  
> - bio_for_each_segment(bv, bio_src, iter)
> - nr_iovecs++;
> -
> - bio = bio_alloc_bioset(gfp_mask, nr_iovecs, bs);
> + bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
>   if (!bio)
>   return NULL;
>  
> @@ -650,9 +646,18 @@ struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t 
> gfp_mask,
>   bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
>   bio->bi_iter.bi_size= bio_src->bi_iter.bi_size;
>  
> + if (bio->bi_rw & REQ_DISCARD)
> + goto integrity_clone;
> +
> + if (bio->bi_rw & REQ_WRITE_SAME) {
> + bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
> + goto integrity_clone;
> + }
> +
>   bio_for_each_segment(bv, bio_src, iter)
>   bio->bi_io_vec[bio->bi_vcnt++] = bv;
>  
> +integrity_clone:
>   if (bio_integrity(bio_src)) {
>   int ret;
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 70654521da..05d2a0392f 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -250,6 +250,17 @@ static inline unsigned bio_segments(struct bio *bio)
>   struct bio_vec bv;
>   struct bvec_iter iter;
>  
> + /*
> +  * We special case discard/write same, because they interpret bi_size
> +  * differently:
> +  */
> +
> + if (bio->bi_rw & REQ_DISCARD)
> + return 1;
> +
> + if (bio->bi_rw & REQ_WRITE_SAME)
> + return 1;
> +
>   bio_for_each_segment(bv, bio, iter)
>   segs++;
>  
> -- 
> 1.8.5.3

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming blog: http://rwmj.wordpress.com
Fedora now supports 80 OCaml packages (the OPEN alternative to F#)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

2013-12-27 Thread Richard W.M. Jones

On Fri, Dec 27, 2013 at 07:01:48PM +0200, Gleb Natapov wrote:
> On Fri, Dec 27, 2013 at 06:52:48PM +0200, Gleb Natapov wrote:
> > On Fri, Dec 27, 2013 at 03:15:39PM +0100, Kashyap Chamarthy wrote:
> > > [. . .]
> > > 
> > > >> KVM does not emulate P-states at all. intel_pstate_init() calls
> > > >> intel_pstate_msrs_not_valid() before printing "Intel P-state driver
> > > >> initializing."  which suppose to fail since it checks that two reads of
> > > >> MSR_IA32_APERF return different values, but KVM does not emulate this 
> > > >> msr
> > > >> at all, so both calls should return zero (KVM suppose to inject #GP, 
> > > >> all rdmsrl
> > > >> are patched to be rdmsrl_safe in a guest).
> > > >>
> > > >> Anything interesting in host dmesg?
> > > 
> > > Heya Gleb,
> > > 
> > > Here's the relevant dmesg snippet (full dmesg, refer the attachment 
> > > below):
> > That's guest dmesg. What about host one? Can you ftrace the failure?
> > 
> Ugh, it looks like guest dmesg but there are KVM messages there too ("[
> 281.443662] kvm [2452]: vcpu0 unhandled rdmsr: 0xe8" is unhandled access
> to MSR_IA32_APERF I was talking about above), so I guess this is nested
> guest invocation?  Does it happen in non nested guest?

FWIW I could not reproduce the reported bug.  I am also using a guest
for testing, but *not* using nested KVM.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming blog: http://rwmj.wordpress.com
Fedora now supports 80 OCaml packages (the OPEN alternative to F#)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: intel_pstate divide error with v3.13-rc4-256-gb7000ad

2013-12-27 Thread Richard W.M. Jones

Probably the qemu command line is more interesting, which is in this
comment and reproduced below.

https://bugzilla.redhat.com/show_bug.cgi?id=1046317#c1

/usr/bin/qemu-kvm \
-global virtio-blk-pci.scsi=off \
-nodefconfig \
-enable-fips \
-nodefaults \
-display none \
-machine accel=kvm:tcg \
-cpu host,+kvmclock \
-m 500 \
-no-reboot \
-no-hpet \
-kernel /var/tmp/.guestfs-0/kernel.32370 \
-initrd /var/tmp/.guestfs-0/initrd.32370 \
-device virtio-scsi-pci,id=scsi \
-drive 
file=/tmp/libguestfswnVOx7/scratch.1,cache=unsafe,format=raw,id=hd0,if=none \
-device scsi-hd,drive=hd0 \
-drive 
file=/var/tmp/.guestfs-0/root.32370,snapshot=on,id=appliance,cache=unsafe,if=none
 \
-device scsi-hd,drive=appliance \
-device virtio-serial-pci \
-serial stdio \
-device sga \
-chardev socket,path=/tmp/libguestfswnVOx7/guestfsd.sock,id=channel0 \
-device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 \
-append 'panic=1 console=ttyS0 udevtimeout=600 no_timer_check acpi=off 
printk.time=1 cgroup_disable=memory root=/dev/sdb selinux=0 guestfs_verbose=1 
TERM=screen'

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [uml-devel] [PATCH] um: ubd: Add REQ_FLUSH suppport

2013-08-19 Thread Richard W.M. Jones

On Mon, Aug 19, 2013 at 12:13:14PM +0200, Geert Uytterhoeven wrote:
> Hi Richard,
> 
> On Mon, Aug 19, 2013 at 11:18 AM, Richard W.M. Jones  
> wrote:
> > On Sun, Aug 18, 2013 at 12:09:34AM +0200, Richard Weinberger wrote:
> >> UML's block device driver does not support write barriers,
> >> to support this this patch adds REQ_FLUSH suppport.
> >> Every time the block layer sends a REQ_FLUSH we fsync() now
> >> our backing file to guarantee data consistency.
> >
> > This fixes the sync problem I saw before.  So:
> 
> What's the impact on your performance figures?

It just moves the sync into the main process, thus making the
"write-no-upload" test time *increase* to what we expected (the same
as KVM).

This doesn't address the main performance issue, which appears to lie
in the serial port emulation.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] um: ubd: Add REQ_FLUSH suppport

2013-08-19 Thread Richard W.M. Jones

On Sun, Aug 18, 2013 at 12:09:34AM +0200, Richard Weinberger wrote:
> UML's block device driver does not support write barriers,
> to support this this patch adds REQ_FLUSH suppport.
> Every time the block layer sends a REQ_FLUSH we fsync() now
> our backing file to guarantee data consistency.

This fixes the sync problem I saw before.  So:

Tested-by: Richard W.M. Jones 

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [QUERY] lguest64

2013-08-08 Thread Richard W.M. Jones

On Wed, Jul 31, 2013 at 12:39:23PM +0300, Mike Rapoport wrote:
> On Tue, Jul 23, 2013 at 4:28 AM, Rusty Russell  wrote:
> > Yes, the subset of x86-64 machines for which there isn't hardware
> > virtualization support is pretty uninteresting.
> 
> There are plenty virtual machines in EC2, Rackspace, HP and other
> clouds that do not have hardware virtualization. I believe that
> running a hypervisor on them may be pretty interesting.

[Jumping in rather late]

The problem with basing this on lguest is that you would need to
implement a whole lot of stuff from qemu to make lguest really useful
as a modern hypervisor.  eg. qcow2 and a variety of other block
devices, kvmclock, virtio{-scsi,-net}.  Probably more, but just
implementing those will keep you going for a while.  It might also be
feasible to add lguest support to qemu.

However I think it's best to do nothing and use TCG mode in qemu.  TCG
is a bit slower than lguest or UML, but definitely not unusable.  It's
a drop-in replacement for qemu/KVM with all the same features, and it
works today.

We use and support TCG to make libguestfs work on EC2.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] md: Partially revert 2f6db2a7, which broke raid5

2013-05-30 Thread Richard W.M. Jones

On Thu, May 30, 2013 at 09:22:43AM +1000, NeilBrown wrote:
> On Wed, 29 May 2013 15:03:40 +0200 Jens Axboe  wrote:
> 
> > On Wed, May 29 2013, Richard W.M. Jones wrote:
> > > 
> > > On Sun, May 19, 2013 at 10:51:45AM -0700, Kent Overstreet wrote:
> > > > Sorry for the delay - been vacationing. Reproduced the original bug,
> > > > here's a patch that fixes it:
> > > > 
> > > > 
> > > > commit 402f5db3708b2062795a384a3d8397cf702e27bc
> > > > Author: Kent Overstreet 
> > > > Date:   Sun May 19 10:27:07 2013 -0700
> > > > 
> > > > raid5: Initialize bi_vcnt
> > > > 
> > > > The patch that converted raid5 to use bio_reset() forgot to 
> > > > initialize
> > > > bi_vcnt.
> > > > 
> > > > Signed-off-by: Kent Overstreet 
> > > > Cc: NeilBrown 
> > > > Cc: Jens Axboe 
> > > > Cc: linux-r...@vger.kernel.org
> > > > 
> > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > > > index 9359828..753f318 100644
> > > > --- a/drivers/md/raid5.c
> > > > +++ b/drivers/md/raid5.c
> > > > @@ -664,6 +664,7 @@ static void ops_run_io(struct stripe_head *sh, 
> > > > struct stripe_head_state *s)
> > > > if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
> > > > bi->bi_rw |= REQ_FLUSH;
> > > >  
> > > > +   bi->bi_vcnt = 1;
> > > > bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
> > > > bi->bi_io_vec[0].bv_offset = 0;
> > > > bi->bi_size = STRIPE_SIZE;
> > > > @@ -701,6 +702,7 @@ static void ops_run_io(struct stripe_head *sh, 
> > > > struct stripe_head_state *s)
> > > > else
> > > > rbi->bi_sector = (sh->sector
> > > >   + rrdev->data_offset);
> > > > +   rbi->bi_vcnt = 1;
> > > > rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
> > > > rbi->bi_io_vec[0].bv_offset = 0;
> > > > rbi->bi_size = STRIPE_SIZE;
> > > 
> > > Ditto to the previous follow up.  We've been tracking this
> > > bug for nearly a month:
> > > 
> > > https://bugzilla.redhat.com/show_bug.cgi?id=962079
> > > 
> > > Please include this (or the other) patch to fix it.
> > 
> > I'm assuming both Kent and I are waiting for Neil to pick it up. Neil, I
> > can include this in my next round going upstream, just let me know. It
> > should have been sent upstream a while back, sorry guys.
> > 
> 
> Seems you were waiting for me, and I was waiting for you :-)
> 
> Yes: please include it with your next round.  Thanks!

BTW I tested this patch and it works, so:

Tested-by: Richard W.M. Jones 

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] md: Partially revert 2f6db2a7, which broke raid5

2013-05-29 Thread Richard W.M. Jones


On Sun, May 19, 2013 at 10:51:45AM -0700, Kent Overstreet wrote:
> Sorry for the delay - been vacationing. Reproduced the original bug,
> here's a patch that fixes it:
> 
> 
> commit 402f5db3708b2062795a384a3d8397cf702e27bc
> Author: Kent Overstreet 
> Date:   Sun May 19 10:27:07 2013 -0700
> 
> raid5: Initialize bi_vcnt
> 
> The patch that converted raid5 to use bio_reset() forgot to initialize
> bi_vcnt.
> 
> Signed-off-by: Kent Overstreet 
> Cc: NeilBrown 
> Cc: Jens Axboe 
> Cc: linux-r...@vger.kernel.org
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 9359828..753f318 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -664,6 +664,7 @@ static void ops_run_io(struct stripe_head *sh, struct 
> stripe_head_state *s)
>   if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
>   bi->bi_rw |= REQ_FLUSH;
>  
> + bi->bi_vcnt = 1;
>   bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
>   bi->bi_io_vec[0].bv_offset = 0;
>   bi->bi_size = STRIPE_SIZE;
> @@ -701,6 +702,7 @@ static void ops_run_io(struct stripe_head *sh, struct 
> stripe_head_state *s)
>   else
>   rbi->bi_sector = (sh->sector
> + rrdev->data_offset);
> + rbi->bi_vcnt = 1;
>   rbi->bi_io_vec[0].bv_len = STRIPE_SIZE;
>   rbi->bi_io_vec[0].bv_offset = 0;
>   rbi->bi_size = STRIPE_SIZE;

Ditto to the previous follow up.  We've been tracking this
bug for nearly a month:

https://bugzilla.redhat.com/show_bug.cgi?id=962079

Please include this (or the other) patch to fix it.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Expose loops_per_jiffy through /proc/cpuinfo.

2013-03-01 Thread Richard W.M. Jones

On Fri, Mar 01, 2013 at 04:46:43PM +, Richard W.M. Jones wrote:
> Ignore this patch, it's obviously wrong.  Too late in the afternoon ...

Actually, NOT wrong.  You can't get HZ from userspace, so
exporting loops_per_jiffy like this is necessary.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: Expose loops_per_jiffy through /proc/cpuinfo.

2013-03-01 Thread Richard W.M. Jones


Ignore this patch, it's obviously wrong.  Too late in the afternoon ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86: Expose loops_per_jiffy through /proc/cpuinfo.

2013-03-01 Thread Richard W.M. Jones

From: "Richard W.M. Jones" 

When we run the current kernel using qemu with TCG (software emulation
of x86), adding the lpj= option to the guest kernel helps greatly with
clock stability especially when the host is heavily loaded.

Currently the calculated 'lpj=...' argument is printed by the kernel
during boot, but isn't available after boot (eg. if boot messages have
scrolled off the kernel message ring).  It is also not possible to
calculate lpj from available information, especially as non-root.

This adds lpj to /proc/cpuinfo information so it is always available.

Signed-off-by: Richard W.M. Jones 
---
 arch/x86/kernel/cpu/proc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index e280253..bf9c2e8 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -103,6 +103,7 @@ static int show_cpuinfo(struct seq_file *m, void *v)
seq_printf(m, "\nbogomips\t: %lu.%02lu\n",
   c->loops_per_jiffy/(50/HZ),
   (c->loops_per_jiffy/(5000/HZ)) % 100);
+   seq_printf(m, "lpj\t\t: %lu\n", c->loops_per_jiffy);
 
 #ifdef CONFIG_X86_64
if (c->x86_tlbsize > 0)
-- 
1.8.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 109 matches

Mail list logo