Re: vmx0: watchdog timeout on queue 2, no interrupts on BSP

2019-07-21 Thread Patrick Kelsey


> On Jul 21, 2019, at 4:17 PM, Andriy Gapon  wrote:
> 
>> On 20/07/2019 20:08, Patrick Kelsey wrote:
>> 
>> 
>> On Fri, Jul 19, 2019 at 10:07 AM Andriy Gapon > <mailto:a...@freebsd.org>> wrote:
>> 
>> 
>>Recently we experienced a strange problem.
>>We noticed a lot of these messages in the logs:
>>vmx0: watchdog timeout on queue 2
>>(always queue 2)
>>Also, we noticed that connections to some end points did not work at all
>>while others worked without problems.  I assume that that was because
>>specific flows got assigned to that queue 2.
>> 
>>Further investigation has shown that none of interrupts assigned to the
>>BSP has ever fired (since boot, of course).  That included vmx0:rx2 and
>>vmx0:tx2.  But also interrupts for other drivers as well.
>> 
>>Trying to get more information I rebooted the system and the problem
>>disappeared.
>> 
>>Has anyone seen anything like that?
>>Any thoughts on possible causes?
>>Any suggestions what to check if/when the problem reoccurs?
>> 
>>Thanks!
>> 
>> 
>> If you are running head at or after r347221 or stable/12 at or after
>> r349112, then this could be due to
>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118 (see Comment 4
>> - short story is that an iflib change has broken the vmx driver).
> 
> I am not sure if that bug could lead to all interrupts on the core
> getting disabled (for all drivers), and right at the boot time.

I am not sure either, but it’s the kind of bug that breaks the design of the 
vmx driver in such a way that its state can get corrupted to the point where 
the kernel can panic.  I haven’t fully analyzed the potential scope of memory 
corruption / hardware state corruption that can occur (because the fix for the 
issue is already apparent), so I am freely considering it to include elements 
beyond the device and driver itself.

If you are saying that zero vmx queue interrupts have occurred anywhere in the 
system, then I would rule out any connection to this as a prerequisite for the 
corruption to occur is having at least one such interrupt.

-Patrick
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: vmx0: watchdog timeout on queue 2, no interrupts on BSP

2019-07-20 Thread Patrick Kelsey
On Fri, Jul 19, 2019 at 10:07 AM Andriy Gapon  wrote:

>
> Recently we experienced a strange problem.
> We noticed a lot of these messages in the logs:
> vmx0: watchdog timeout on queue 2
> (always queue 2)
> Also, we noticed that connections to some end points did not work at all
> while others worked without problems.  I assume that that was because
> specific flows got assigned to that queue 2.
>
> Further investigation has shown that none of interrupts assigned to the
> BSP has ever fired (since boot, of course).  That included vmx0:rx2 and
> vmx0:tx2.  But also interrupts for other drivers as well.
>
> Trying to get more information I rebooted the system and the problem
> disappeared.
>
> Has anyone seen anything like that?
> Any thoughts on possible causes?
> Any suggestions what to check if/when the problem reoccurs?
>
> Thanks!
>
>
If you are running head at or after r347221 or stable/12 at or after r349112,
then this could be due to
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239118 (see Comment 4 -
short story is that an iflib change has broken the vmx driver).

-Patrick
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: posix_fallocate on ZFS

2018-02-13 Thread Patrick Kelsey
On Mon, Feb 12, 2018 at 12:04 PM, John Baldwin  wrote:

> On Saturday, February 10, 2018 01:46:33 PM Garrett Wollman wrote:
> > In article
> > ,
> > asom...@freebsd.org writes:
> >
> > >On Sat, Feb 10, 2018 at 10:28 AM, Willem Jan Withagen 
> > >wrote:
> >
> > >> Is there any expectation that this is going to fixed in any near
> future?
> >
> > >No.  It's fundamentally impossible to support posix_fallocate on a COW
> > >filesystem like ZFS.  Ceph should be taught to ignore an EINVAL result,
> > >since the system call is merely advisory.
> >
> > I don't think it's true that this is _fundamentally_ impossible.  What
> > the standard requires would in essence be a per-object refreservation.
> > ZFS supports refreservation, obviously, but not on a per-object basis.
> > Furthermore, there are mechanisms to preallocate blocks for things
> > like dumps.  So it *could* be done (as in, the concept is there), but
> > it may not be practical.  (And ultimately, there are ways in which the
> > administrator might manage the system that would defeat the desired
> > effect, but that's out of the standard's scope.)  Given the semantic
> > mismatch, though, I suspect it's unreasonable to expect anyone to
> > prioritize implementation of such a feature.
>
> I don't think posix_fallocate() can be compatible with COW.  Suppose you
> do reserve a fixed set of blocks.  That ensures the first write has a
> place to write, but not if you overwrite one of those blocks.  You'd have
> to reserve another block to maintain the reservation each time you wrote
> to a block, or you'd have to have a way to mark a file as not COW.  The
> first case isn't really any better than not using posix_fallocate() in the
> first place as you are still requiring writes to allocate blocks, and the
> second seems a bit fraught with peril as well if the application is
> expecting the non-COW'd file to be in sync with other files in the system
> since presumably non-COW'd files couldn't be snapshotted, etc.
>
>
I think Garrett's assessment that it is not fundamentally impossible, but
may not be felt to be worth implementing in any given file system for
practical reasons, is correct.  I say this having designed/implemented a
COW file system that was driven by customer pressure to do things that at
first pass one might declare represented an architectural contradiction,
but upon further reflection were entirely possible to do given sufficient
willingness to invest the effort and accept the accompanying trade-offs,
additional knobs to turn, etc.

In this case (posix_fallocate() + COW + snapshots), it could be implemented
with a per-object allocator that normally keeps at least one extra block
beyond the reservation requirement on hand, plus a snapshot operation that
in order to succeed has to be able to provision the local allocators of all
fallocated objects with enough additional blocks to maintain the no-fail
write guarantee post-snapshot.

-Patrick
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: CURRENT: FreeBSD not reporting AES-NI on Intel(R) Xeon(R) CPU E5-1650 v3

2017-03-17 Thread Patrick Kelsey
On Fri, Mar 17, 2017 at 1:31 PM, O. Hartmann  wrote:

> Am Fri, 17 Mar 2017 20:07:35 +0300
> Slawa Olhovchenkov  schrieb:
>
> > On Fri, Mar 17, 2017 at 05:53:24PM +0100, O. Hartmann wrote:
> >
> > > Am Fri, 17 Mar 2017 15:04:29 +0300
> > > Slawa Olhovchenkov  schrieb:
> > >
> > > > On Fri, Mar 17, 2017 at 12:36:25PM +0100, O. Hartmann wrote:
> > > >
> > > > > Running recent CURRENT on a Fujitsu Celsius M740 equipted with an
> Intel(R)
> > > > > Xeon(R) CPU E5-1650 v3 @ 3.50GHz CPU makes me some trouble.
> > > > >
> > > > > FreeBSD does not report the existence or availability of AES-NI
> feature, which
> > > > > is supposed to be a feature of this type of CPU:
> > > >
> > > > What reassons to detect AES-NI by FreeBSD?
> > >
> > > What do you mean? I do not understand! FreeBSD is supposed to read the
> CPUID and
> > > therefore the capabilities as every other OS, too. But there may some
> circumstances
> > > why FBSD won't. I do not know, that is the reason why I'm asking here.
> >
> > This sample can have disabled AES-NI by vendor, in BIOS, for example.
> > As I show by links this is posible.
> >
> > CPUID in you example don't show AES-NI capabilities, for example
> > 1650v4 w/ AES-NI
> >
> > CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz (3600.07-MHz K8-class CPU)
> >   Origin="GenuineIntel"  Id=0x406f1  Family=0x6  Model=0x4f  Stepping=1
> >   Features=0xbfebfbff APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,
> MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
> >   Features2=0x7ffefbff VMX,SMX,EST,TM2,SSSE3,SDBG,FMA,CX16,xTPR,PDCM,PCID,DCA,
> SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,
> OSXSAVE,AVX,F16C,RDRAND>
> >
>
>   ^^
> >   AMD Features=0x2c100800
> >   AMD Features2=0x121
> >   Structured Extended
> > Features=0x21cbfbb BMI2,ERMS,INVPCID,RTM,PQM,NFPUSG,PQE,RDSEED,ADX,SMAP,PROCTRACE>
> > XSAVE Features=0x1 VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,
> VID,PostIntr
> >   TSC: P-state invariant, performance statistics
> >
> > In you sample: "TSCDLT,XSAVE"
> >
> > May be AES-NI disabled by vendor and FreeBSD correct show this. Or some
> bug in FreeBSD,
> > AES-NI work and other OS show AES-NI capabilities.
> >
> > ___
> > freebsd-current@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscribe@
> freebsd.org"
>
> We have some LGA1151 XEON based 19 inch rack server, also equipted with
> Haswell
> E3-12XX-v3 XEONs and FreeBSD, also CURRENT, does show AES-NI.
>
> You're right, the vendor could have disabled AES-NI by intention - but
> they offered this
> box especially with AES-NI capabilities.
>
> See here:
>
>   http://freebsd.1045724.x6.nabble.com/r285947-broken-
> AESNI-support-No-aesni0-on-Intel-XEON-E5-1650-v3-on-Fujitsu-Celsius-M740-
> td6028895.html
>
> I feel a bit pissed off right now due to Fujitsu, because we started
> testing some
> encrypting features and I'd like to use AES-NI and I run into this issue
> again.
>
> I need to know that FreeBSD is not the issue with this specific CPU type.
> I'm still
> frustrated by that stupid comment "UNIX is not supoorted" I got that time
> then when I
> reported 2015 the issue to Fujitsu.
>
>
>
It's pretty straightforward to gain confidence that FreeBSD is not the
issue here.  The 'Features2=' line is printed by printcpuinfo() in
sys/x86/x86/identcpu.c based on the bits set in a variable called
cpu_feature2 (the printf is currently at line 802).  The value of
cpu_feature2 is set in identify_cpu() identcpu.c (for amd64, currently at
line 1401) based on the result of the cpuid instruction that is executed by
a call to do_cpuid(), which itself resides in sys/amd64/include/cpufunc.h.
In other words, a single asm instruction is executed and the set bits from
the result are printed.

Based on some poking around in open source bits (tianocore, coreboot), it
appears that AES-NI is something the BIOS can irreversibly
disable-until-next-reset by twiddling bits in the appropriate MSR
register.  There is no code that does this in FreeBSD on purpose, so there
would have to be a bug introduced in -CURRENT that somehow clobbers those
MSR bits early on - a bug that was also not merged to 11-STABLE (since
Slawa shows AESNI enabled on the same processor under 11-STABLE).

I will also say that I have dealt with a manufacturer of Xeon hardware in
Europe who will not provide a stock BIOS that allows you to enable AES-NI,
out of concerns over violating export/import rules governing encryption
technology.  With that vendor, you have to pass an end-user verification
and then they will make you a custom BIOS that gives you the option to
enable AES-NI.  It took quite some time working through the outer layers of
their 

Re: sysctl -a panic on VIMAGE kernels

2015-08-09 Thread Patrick Kelsey
On Sun, Aug 9, 2015 at 6:36 AM, Gleb Smirnoff gleb...@freebsd.org wrote:

 On Sun, Aug 09, 2015 at 12:28:22PM +0200, Kristof Provost wrote:
 K Hi,
 K
 K I’ve run into a reproducible panic on a VIMAGE kernel with ‘sysctl -a’.
 K
 K Relevant backtrace bits:
 K #8  0x80e7dd28 in trap (frame=0xfe01f16b26a0)
 K at /usr/src/sys/amd64/amd64/trap.c:426
 K #9  0x80e5e6a2 in calltrap ()
 K at /usr/src/sys/amd64/amd64/exception.S:235
 K #10 0x80cea67d in uma_zone_get_cur (zone=0x0)
 K at /usr/src/sys/vm/uma_core.c:3006
 K #11 0x80cec029 in sysctl_handle_uma_zone_cur (
 K oidp=0x818a7c90, arg1=0xfe00010c0438, arg2=0,
 K req=0xfe01f16b2868) at /usr/src/sys/vm/uma_core.c:3580
 K #12 0x80a28614 in sysctl_root_handler_locked
 (oid=0x818a7c90,
 K arg1=0xfe00010c0438, arg2=0, req=0xfe01f16b2868)
 K at /usr/src/sys/kern/kern_sysctl.c:183
 K #13 0x80a27d70 in sysctl_root (arg1=value optimized out,
 K arg2=value optimized out) at /usr/src/sys/kern/kern_sysctl.c:1694
 K #14 0x80a28372 in userland_sysctl (td=0x0,
 name=0xfe01f16b2930,
 K namelen=value optimized out, old=value optimized out,
 K oldlenp=value optimized out, inkernel=value optimized out,
 K new=value optimized out, newlen=value optimized out,
 K retval=value optimized out, flags=0)
 K at /usr/src/sys/kern/kern_sysctl.c:1798
 K #15 0x80a28144 in sys___sysctl (td=0xf8000b1e49a0,
 K uap=0xfe01f16b2a40) at /usr/src/sys/kern/kern_sysctl.c:1724
 K
 K In essence, what happens is that we end up in
 sysctl_handle_uma_zone_cur() and arg1 is a pointer to NULL,
 K so we call uma_zone_get_cur(zone); with zone == NULL.
 K
 K There’s been a bit of churn around tcp_reass_zone, and I think the
 latest version is wrong.
 K It marks the sysctl as CTLFLAG_VNET, but the exposed variable is not
 VNET_DEFINE().
 K
 K The following fixes it for me:
 K
 K diff --git a/sys/netinet/tcp_reass.c b/sys/netinet/tcp_reass.c
 K index 77d8940..3913ef3 100644
 K --- a/sys/netinet/tcp_reass.c
 K +++ b/sys/netinet/tcp_reass.c
 K @@ -84,7 +84,7 @@ SYSCTL_INT(_net_inet_tcp_reass, OID_AUTO,
 maxsegments, CTLFLAG_RDTUN,
 K  Global maximum number of TCP Segments in Reassembly Queue);
 K
 K  static uma_zone_t tcp_reass_zone;
 K -SYSCTL_UMA_CUR(_net_inet_tcp_reass, OID_AUTO, cursegments,
 CTLFLAG_VNET,
 K +SYSCTL_UMA_CUR(_net_inet_tcp_reass, OID_AUTO, cursegments, 0,
 K  tcp_reass_zone,
 K  Global number of TCP Segments currently in Reassembly Queue”);

 Right, if a variable isn't virtualized, the CTLFLAG_VNET must be removed.

 Patrick, how is your progress wuth improved reassembly?


Kristof, thanks for committing this patch.

Gleb, I expect to have a tcp reassembly patch up for review at some point
this week.

-Patrick
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: panic: UMA: Increase vm.boot_pages on Dell R920 r279210

2015-05-17 Thread Patrick Kelsey
On Sat, May 2, 2015 at 10:25 PM, Adrian Chadd adr...@freebsd.org wrote:

 hi,

 Hm, should we be upping this limit automatically? Can we get cpu
 counts or memory amount early enough in boot to have a hope of
 auto-tuning?

 64 seems low, 1024 seems high as a default. :)



What is it that's exhausting the boot_pages?  I'm semi-guessing it's the
number of vm radix tree nodes needed for the TiB of memory.  The only thing
I'm aware of (allow for ignorance here) that consumes boot_pages and scales
with the cpu count is the uma zone used for uma cache objects, but on amd64
this zone only needs 640 + cpus * 128 bytes, or about 4 pages for 120
cpus.  vm radix nodes are 144 bytes each on amd64, and by my
back-of-the-envelope calculations (using traces of non-vm-radix boot_page
use from another amd64 system), 64 boot_pages would be exhausted after
about 1000 vm radix nodes were allocated.  It would be interesting to know
how many boot_pages were actually required for this particular system.

In any event, since startup_alloc() is designed to exhaust all the
boot_pages before switching to the normal allocators, it doesn't seem
necessarily harmful to err on the high side either in bumping up the static
default or introducing an auto-tuned value (provided the excess is not so
perversely large that startup_alloc() isn't able to make use of an
embarrassment of pages due to zone creation timing and usage patterns).  We
know the number of cpus at the time boot_pages is put to use, but I don't
think  we know how much memory there is (and even less sure that even if we
did, we'd really want to try to estimate things the vm radix tree size in a
generic way).

Something like a default of boot_pages = max(64, 32 + k * cpus) might be
sufficient for k = 4 or 8 (gathering some data points would give a clue
here), and palatable since it is at a minimum the current value that's been
in use, and at the other end approaches a modest commitment of 16 or 32 KiB
per cpu in the worst case (unused and unreclaimed boot_pages with high cpu
count).

-Patrick





 On 24 March 2015 at 13:00, Keith White kwh...@site.uottawa.ca wrote:
  On Tue, 24 Mar 2015, Rui Paulo wrote:
 
  On Mar 24, 2015, at 04:19, kwh...@site.uottawa.ca wrote:
 
 
  I'm using /boot/loader.conf. Is there another place I should be doing
  this?
 
 
  No, that's correct, but apparently there's a problem: the RDTUN sysctl
 is
  not picked up early enough.  Can you try this patch?  I haven't really
  tested it. :-)
 
  diff --git a/sys/vm/vm_page.c b/sys/vm/vm_page.c
  index 79665ba..a764788 100644
  --- a/sys/vm/vm_page.c
  +++ b/sys/vm/vm_page.c
  @@ -134,8 +134,9 @@ long first_page;
  int vm_page_zero_count;
 
  static int boot_pages = UMA_BOOT_PAGES;
  -SYSCTL_INT(_vm, OID_AUTO, boot_pages, CTLFLAG_RDTUN, boot_pages, 0,
  -   number of pages allocated for bootstrapping the VM system);
  +SYSCTL_INT(_vm, OID_AUTO, boot_pages, CTLFLAG_RDTUN | CTLFLAG_NOFETCH,
  +boot_pages, 0,
  +number of pages allocated for bootstrapping the VM system);
 
  static int pa_tryrelock_restart;
  SYSCTL_INT(_vm, OID_AUTO, tryrelock_restart, CTLFLAG_RD,
  @@ -349,6 +350,7 @@ vm_page_startup(vm_offset_t vaddr)
  * Allocate memory for use when boot strapping the kernel memory
  * allocator.
  */
  +   TUNABLE_INT_FETCH(vm.boot_pages, boot_pages);
 new_end = end - (boot_pages * UMA_SLAB_SIZE);
 new_end = trunc_page(new_end);
 mapped = pmap_map(vaddr, new_end, end,
  @@ -443,7 +445,7 @@ vm_page_startup(vm_offset_t vaddr)
 
 
  --
  Rui Paulo
 
 
  Patch tried.  Success!
 
  I now get this after setting vm.boot_pages=1024 in /boot/loader.conf:
 
  Booting...
  GDB: no debug ports present
  KDB: debugger backends: ddb
  KDB: current backend: ddb
  Copyright (c) 1992-2015 The FreeBSD Project.
  Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
  The Regents of the University of California. All rights reserved.
  FreeBSD is a registered trademark of The FreeBSD Foundation.
  FreeBSD 11.0-CURRENT #1: Tue Mar 24 13:44:48 UTC 2015
  root@:/usr/obj/usr/src/sys/GENERIC amd64
  FreeBSD clang version 3.5.1 (tags/RELEASE_351/final 225668) 20150115
  WARNING: WITNESS option enabled, expect reduced performance.
  UMA startup boot_pages: 1024
  ...
 
  And can start all 120 processors.
 
  Thanks!
 
  ...keith
  --
  Keith White, genie.uottawa.ca engineering.uottawa.ca
  kwh...@uottawa.ca [+1 613 562 5800 x6681]
  ___
  freebsd-current@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-current
  To unsubscribe, send any mail to 
 freebsd-current-unsubscr...@freebsd.org
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

___

Re: _ftello() modification requires additional capsicum rights, breaking tcpdump and dhclient

2014-09-11 Thread Patrick Kelsey
On Wed, Sep 10, 2014 at 3:00 AM, Andrey Chernov a...@freebsd.org wrote:

 On 09.09.2014 21:53, Patrick Kelsey wrote:
  I don't think it is worth the trouble, as given the larger pattern of
  libc routines requiring multiple capsicum rights, it seems one will in
  general have to have libc implementation knowledge when using it in
  concert with capsicum.  For example, consider the limitfd() routine in
  kdump.c, which provides rights for the TIOCGETA ioctl to be used on
  stdout so the eventual call to isatty() via printf() will work as
 intended.
 
  I think the above kdump example is a good one for the subtle issues that
  can arise when using capsicum with libc.  That call to isatty() is via a
  widely-used internal libc routine __smakebuf().  __smakebuf() also calls
  __swhatbuf(), which in turn calls _fstat(), all to make sure that output
  to a tty is line buffered by default.  It would appear that programs
  that restrict rights on stdout without allowing CAP_IOCTL and CAP_FSTAT
  could be disabling the normally default line buffering when stdout is a
  tty.  kdump goes the distance, but dhclient does not (restricting stdout
  to CAP_WRITE only).
 
  In any event, the patch attached to my first message is seeming like the
  way to go.

 Well, then commit it (if capsicum team agrees).



Will do - thanks for the feedback.

-Patrick
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: _ftello() modification requires additional capsicum rights, breaking tcpdump and dhclient

2014-09-09 Thread Patrick Kelsey
On Mon, Sep 8, 2014 at 6:00 PM, Andrey Chernov a...@freebsd.org wrote:

 On 09.09.2014 1:13, Patrick Kelsey wrote:
  You make a godo point about the wider use of fcntl() in libc - aside
  from the rpc code, by my count there are 14 other entry points in libc
  that use fcntl in their implementation.  To experience breakage,
  programs that use those entry points would also have to be supplying
  them fds with restricted rights that do not include CAP_FCNTL.  By my
  count, there are currently only 12 programs in -current that call
  cap_rights_limit().  I don't think these counts inform us very well as
  to the presence and extent of any capsicum+libc issues similar to the
  one that I've raised.  Those 12 programs mentioned above would have to
  be audited to determine if any of the 15 libc entry points (including
  fcntl) that use fcntl are being used on those restricted fds without
  being granted CAP_FCNTL rights, and whether there are overt or potential
  failures occurring as a result.  Consider that the failure mode in
  tcpdump that I found requires that you be using multiple capture files
  with size-based rotation, otherwise all works fine.  Also consider that
  the failure mode in dhclient only occurs when a rewritten client lease
  file is smaller than its predecessor.

 Just to note by quick glance:
 tcpdump use fdopen(), so in some cases probably already broken without
 F_GETFL rights.
 openssh use fdopen(), so suspicious about F_GETFL too, but I don't
 traverse the order in which fdopen() and cap_rights_* there are applied.


I have now looked at all of the programs in -current that call
cap_rights_limit() (dhclient, hastd, ping, tcpdump, rwhod, ctld, iscsid,
kdump, rwho, units, uniq, and sshd) and examined them to see which file
descriptors cap_rights_limit() is invoked on, with what rights, and whether
libc functions that require fcntl rights (fcntl, fdopendir, fdopen,
freopen, fseek, ftell, popen, lockf, etc) are subsequently used on those
descriptors.  In most cases, the programs are simple and/or the application
of cap_rights_limit() is otherwise limited in scope, and it is easy to see
that they have sufficient rights on the restricted fds for the operations
performed on those fds.  This was a mostly manual inspection, and of course
I may have missed something, but I did not find any further issues related
to insufficient capsicum rights when using libc.

In the case of tcpdump, fdopen() is not used on a file descriptor whose
rights have been restricted via cap_rights_limit().

In the case of openssh, cap_rights_limit() is used by sshd to sandbox the
unprivileged child process when using privilege separation by restricting
the child's stdin, stdout, and stderr, the child's end of the socketpair
used to communicate with the privileged parent and the child's end of the
pipe used to log to the privileged parent.  fdopen() is not used on any of
those descriptors.


  I don't think that this read-only fcntl(F_GETFL) which doesn not
 modify
  anything deserves any special rights at all (i.e. can be just
 enabled by
  default in contrast to F_SETFL), but I am not capsicum expert.
 
  I don't think I am in a position to comment on the implications of
  permanent F_GETFL rights either.  I do think that the point about wider
  use of fcntl(F_GETFL) in libc does argue against making a CAP_FSEEK
  right in sys/capability.h, as it would appear users of capsicum and libc
  are more in need of a map of capsicum rights required by libc entry
  points than they are of convenience #defines.

 Theoretically it will be possible to get rid of fcntl(F_GETFL) in
 fseek(), but O_APPEND flag need to be stored somewhere in that case, and
 stdio _flags already have all bit occupied for 16bit short. So the price
 will be changing size of the main stdio structure __sFILE to add new
 space for flags, which is undesirable I think.


I don't think it is worth the trouble, as given the larger pattern of libc
routines requiring multiple capsicum rights, it seems one will in general
have to have libc implementation knowledge when using it in concert with
capsicum.  For example, consider the limitfd() routine in kdump.c, which
provides rights for the TIOCGETA ioctl to be used on stdout so the eventual
call to isatty() via printf() will work as intended.

I think the above kdump example is a good one for the subtle issues that
can arise when using capsicum with libc.  That call to isatty() is via a
widely-used internal libc routine __smakebuf().  __smakebuf() also calls
__swhatbuf(), which in turn calls _fstat(), all to make sure that output to
a tty is line buffered by default.  It would appear that programs that
restrict rights on stdout without allowing CAP_IOCTL and CAP_FSTAT could be
disabling the normally default line buffering when stdout is a tty.  kdump
goes the distance, but dhclient does not (restricting stdout to CAP_WRITE
only).

In any event, the patch attached to my first message is seeming like

_ftello() modification requires additional capsicum rights, breaking tcpdump and dhclient

2014-09-08 Thread Patrick Kelsey
In r268997, _ftello() was modified to use _fcntl(F_GETFL) in the
non-append, write-only path.  Consequently, programs that use _ftello()
(via ftell, fgetpos, fsetpos, fseek, rewind...) on non-append, write-only
files and that use capsicum to restrict capabilities on the associated fds
to [CAP_SEEK, CAP_WRITE] broke as all ftell() (and friends) calls on those
files fail with ENOTCAPABLE due to lack of CAP_FCNTL rights.  There appear
to be only two affected programs in the tree - tcpdump and dhclient.  This
affects both CURRENT and 10-STABLE (including 10.1-PRERELEASE)

tcpdump, when configured to write to capture files rotated by size, fails
to rotate and captures indefinitely to the first file in the series.  This
can be reproduced by a command such as: tcpdump -i ifname -C 1 -W 2 -w
packets -v

By inspection, dhclient will fail to trim old data from its client leases
file when rewriting that file with a lesser amount of data than it
currently contains.  See the ftruncate() call in
dhclient.c:rewrite_client_leases().

The attached patch adds CAP_FCNTL to the limited rights established for
non-append, write-only files used by tcpdump and dhclient.  It also
restricts the fcntl rights to CAP_FCNTL_GETFL.

The current need to have CAP_FCNTL rights in order to get or set the file
position on non-append, write-only files is subtle.  Perhaps part of the
answer is to define a CAP_FSEEK right in sys/capability.h that resolves to
CAP_SEEK|CAP_FCNTL, or to modify the CAP_SEEK description in rights(4) to
note the need for CAP_FCNTL when using ftell() and friends.

-Patrick


ftell_cap_rights.patch
Description: Binary data
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org

Re: _ftello() modification requires additional capsicum rights, breaking tcpdump and dhclient

2014-09-08 Thread Patrick Kelsey
On Mon, Sep 8, 2014 at 4:42 PM, Andrey Chernov a...@freebsd.org wrote:

 On 09.09.2014 0:28, Patrick Kelsey wrote:
  In r268997, _ftello() was modified to use _fcntl(F_GETFL) in the
  non-append, write-only path.  Consequently, programs that use _ftello()
  (via ftell, fgetpos, fsetpos, fseek, rewind...) on non-append,
  write-only files and that use capsicum to restrict capabilities on the
  associated fds to [CAP_SEEK, CAP_WRITE] broke as all ftell() (and
  friends) calls on those files fail with ENOTCAPABLE due to lack of
  CAP_FCNTL rights.  There appear to be only two affected programs in the
  tree - tcpdump and dhclient.  This affects both CURRENT and 10-STABLE
  (including 10.1-PRERELEASE)
 
  tcpdump, when configured to write to capture files rotated by size,
  fails to rotate and captures indefinitely to the first file in the
  series.  This can be reproduced by a command such as: tcpdump -i
  ifname -C 1 -W 2 -w packets -v
 
  By inspection, dhclient will fail to trim old data from its client
  leases file when rewriting that file with a lesser amount of data than
  it currently contains.  See the ftruncate() call in
  dhclient.c:rewrite_client_leases().
 
  The attached patch adds CAP_FCNTL to the limited rights established for
  non-append, write-only files used by tcpdump and dhclient.  It also
  restricts the fcntl rights to CAP_FCNTL_GETFL.
 
  The current need to have CAP_FCNTL rights in order to get or set the
  file position on non-append, write-only files is subtle.  Perhaps part
  of the answer is to define a CAP_FSEEK right in sys/capability.h that
  resolves to CAP_SEEK|CAP_FCNTL, or to modify the CAP_SEEK description in
  rights(4) to note the need for CAP_FCNTL when using ftell() and friends.
 
  -Patrick

 Stdio code use fcntl(F_GETFL) already in many places, f.e. fdopen(),
 freopen(). libc code in general use it in rpc code. According to your
 note, all that places are currently broken in anyway.


You make a godo point about the wider use of fcntl() in libc - aside from
the rpc code, by my count there are 14 other entry points in libc that use
fcntl in their implementation.  To experience breakage, programs that use
those entry points would also have to be supplying them fds with restricted
rights that do not include CAP_FCNTL.  By my count, there are currently
only 12 programs in -current that call cap_rights_limit().  I don't think
these counts inform us very well as to the presence and extent of any
capsicum+libc issues similar to the one that I've raised.  Those 12
programs mentioned above would have to be audited to determine if any of
the 15 libc entry points (including fcntl) that use fcntl are being used on
those restricted fds without being granted CAP_FCNTL rights, and whether
there are overt or potential failures occurring as a result.  Consider that
the failure mode in tcpdump that I found requires that you be using
multiple capture files with size-based rotation, otherwise all works fine.
Also consider that the failure mode in dhclient only occurs when a
rewritten client lease file is smaller than its predecessor.



 I don't think that this read-only fcntl(F_GETFL) which doesn not modify
 anything deserves any special rights at all (i.e. can be just enabled by
 default in contrast to F_SETFL), but I am not capsicum expert.


I don't think I am in a position to comment on the implications of
permanent F_GETFL rights either.  I do think that the point about wider use
of fcntl(F_GETFL) in libc does argue against making a CAP_FSEEK right in
sys/capability.h, as it would appear users of capsicum and libc are more in
need of a map of capsicum rights required by libc entry points than they
are of convenience #defines.

-Patrick
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org