Re: 1M routes or 1M arp entries

2017-08-14 Thread Simon Mages
Hi,

you may want to take a look into /etc/login.conf
login.conf(5), cap_mkdb(1)

In this file you can fiddle with you limit maxima
for login classes.

BR
Simon


2017-08-14 16:28 GMT+02:00, Hrvoje Popovski :
> On 14.8.2017. 16:03, Alexander Bluhm wrote:
>> On Mon, Aug 14, 2017 at 03:52:56PM +0200, Hrvoje Popovski wrote:
>>> # netstat -rnf inet
>>> netstat: Cannot allocate memory
>>
>> Have you tried to increase ulimit -d ?
>
> it seems that i can decrease it but not increase it, or i don't know how
> to do it properly :)
>
> # ulimit -d
> 33554432
>
> # ulimit -d 33554433
>
> # ulimit -d
> 33554432
>
>



Re: [PATCH] pcidump - Enhanced Capabilities

2017-05-19 Thread Simon Mages
Yes, this is correct. I missed those two somehow ...


2017-05-17 8:11 GMT+02:00, Jonathan Gray <j...@jsg.id.au>:
> On Thu, Mar 16, 2017 at 03:19:23PM +0100, Simon Mages wrote:
>> Hi,
>>
>> right now i got the chance to play a little bit with PCIe. I read some
>> parts of the spec
>> and was interessted what my PCIe devices can do. I also found out that
>> pcidump can
>> not display the Enhanced Capabilites.
>>
>> This patch enables pcidump to display them.
>>
>> I did not find a good list of descriptions for the different
>> capabilities, the one im using
>> in this patch was taken from the linux kernel. Is it possible to get a
>> complete list
>> somewhere?
>
> The patch as comitted skipped entries in the array so pcidump currently
> prints the wrong strings in some cases.
>
> const char *pci_enhanced_capnames[] = {
>   "Unknown",
>   "Advanced Error Reporting", /* 0x01 */
>   "Virtual Channel Capability",   /* 0x02 */
>   "Device Serial Number", /* 0x03 */
>   "Power Budgeting",  /* 0x04 */
>   "Root Complex Link Declaration",/* 0x05 */
>   "Root Complex Internal Link Control",   /* 0x06 */
>   "Root Complex Event Collector", /* 0x07 */
>   "Multi-Function VC Capability", /* 0x08 */
>   "Virtual Channel Capability",   /* 0x09 */
>   "Root Complex/Root Bridge", /* 0x0a */
>   "Vendor-Specific",  /* 0x0b */
>   "Config Access",/* 0x0c */
>   "Access Control Services",  /* 0x0d */
>   "Alternate Routing ID", /* 0x0e */
>   "Address Translation Services", /* 0x0f */
>   "Single Root I/O Virtualization",   /* 0x10 */
>   "Multi Root I/O Virtualization",/* 0x11 */
>   "Multicast",/* 0x12 */
>   "Page Request Interface",   /* 0x13 */
>   "Reserved for AMD", /* 0x14 */
>   "Resizable BAR",/* 0x15 */
>   "Dynamic Power Allocation", /* 0x16 */
>   "TPH Requester",/* 0x17 */
>   "Latency Tolerance Reporting",  /* 0x18 */
>   "Secondary PCIe Capability",/* 0x19 */
>   "Protocol Multiplexing",/* 0x1a */
>   "Process Address Space ID", /* 0x1b */
>   "Unknown",  /* 0x1c */
>   "Downstream Port Containment",  /* 0x1d */
>   "L1 PM",/* 0x1e */
>   "Precision Time Measurement",   /* 0x1f */
> };
>
> Index: pcidump.c
> ===
> RCS file: /cvs/src/usr.sbin/pcidump/pcidump.c,v
> retrieving revision 1.43
> diff -u -p -r1.43 pcidump.c
> --- pcidump.c 25 Mar 2017 07:33:46 -  1.43
> +++ pcidump.c 17 May 2017 06:06:56 -
> @@ -131,7 +131,9 @@ const char *pci_enhanced_capnames[] = {
>   "Secondary PCIe Capability",
>   "Protocol Multiplexing",
>   "Process Address Space ID",
> + "Unknown",
>   "Downstream Port Containment",
> + "L1 PM",
>   "Precision Time Measurement",
>  };
>
>



Re: [PATCH] pcidump - read and write registers

2017-03-27 Thread Simon Mages
Ah ok, good to know.

Thanks anyway.


2017-03-27 14:57 GMT+02:00, Mark Kettenis <mark.kette...@xs4all.nl>:
>> From: Simon Mages <mages.si...@googlemail.com>
>> Date: Mon, 27 Mar 2017 13:57:54 +0200
>>
>> Hi,
>>
>> right now i use the following diff to poke around in the PCIe config
>> space. This diff enables
>> pcidump to read and write to a register. So far i used this mainly to
>> play with the Advanced
>> Error Reporting Capability some devices have.
>>
>> $ pcidump 4:0:0:104
>>  4:0:0: Broadcom BCM5754
>> 0x0104: 0x0010
>> This bit indicates an "Unsupported Request Error", the register
>> queried here is the
>> "Uncorrectable Error Status Register".
>>
>> # pcidump 4:0:0:104:0x0010
>>  4:0:0: Broadcom BCM5754
>> 0x0104: 0x
>> pcidump shows the new value of the register after writing. By writing
>> a 1 to a status bit it
>> gets reset.
>>
>> I implemented a check for the current securelevel because writing to
>> /dev/pci is only possible
>> for a securelevel smaller than 1.
>>
>> I think this functionality can come in handy for people
>> writing/modifying device drivers.
>
> Sorry, but no.  This is not going to happen.  The kernel interface to
> write to pci config space is only there to support X.  And even that
> support is likely to disappear at some point.
>
>> Index: pcidump.8
>> ===
>> --- pcidump.816 Jul 2013 11:13:34 -  1.12
>> +++ pcidump.827 Mar 2017 11:27:35 -
>> @@ -26,7 +26,7 @@
>>  .Op Fl x | xx | xxx
>>  .Op Fl d Ar pcidev
>>  .Sm off
>> -.Op Ar bus : dev : func
>> +.Op Ar bus : dev : func [ : reg [ : val ] ]
>>  .Sm on
>>  .Nm pcidump
>>  .Fl r Ar file
>> @@ -69,16 +69,29 @@ Shows a hexadecimal dump of the full PCI
>>  Shows a hexadecimal dump of the PCIe extended config space.
>>  .It Xo
>>  .Sm off
>> -.Ar bus : dev : func
>> +.Ar bus : dev : func [ : reg [ : val ] ]
>>  .Sm on
>>  .Xc
>>  Show information about the PCI device specified by the tuple given on
>> -the command line.
>> +the command line. If
>> +.Pa reg
>> +is used, the value of this register in the configuration space of
>> +.Pa func
>> +gets printed. If
>> +.Pa val
>> +is used, the register specified by
>> +.Pa reg
>> +will be loaded with the value specified by
>> +.Pa val .
>>  If the
>>  .Fl d
>>  option is not given,
>>  .Pa /dev/pci
>>  is used.
>> +.It Xo
>> +.Xc
>> +The configuration space can only be written in a securelevel(7) lower
>> +than 1.
>>  .El
>>  .Sh FILES
>>  .Bl -tag -width /dev/pci* -compact
>> @@ -86,7 +99,8 @@ is used.
>>  Device files for accessing PCI domains.
>>  .El
>>  .Sh SEE ALSO
>> -.Xr pci 4
>> +.Xr pci 4 ,
>> +.Xr securelevel 7
>>  .Sh HISTORY
>>  The
>>  .Nm
>> Index: pcidump.c
>> ===
>> --- pcidump.c25 Mar 2017 07:33:46 -  1.43
>> +++ pcidump.c27 Mar 2017 11:24:10 -
>> @@ -19,6 +19,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>
>>  #include   /* need NULL for  */
>>
>> @@ -37,19 +39,27 @@
>>
>>  #define PCIDEV  "/dev/pci"
>>
>> +#define PCI_CONFIG_SPACE_BEGIN  0x0
>> +#define PCIE_CONFIG_SPACE_END   (PCIE_CONFIG_SPACE_SIZE - 1)
>> +#define PCI_CONFIG_ALIGNMENT0x4
>> +#define REG_ALIGNMENT_OK(x) ((x) % PCI_CONFIG_ALIGNMENT ? 0 : 1)
>> +
>>  #ifndef nitems
>>  #define nitems(_a)  (sizeof((_a)) / sizeof((_a)[0]))
>>  #endif
>>
>>  __dead void usage(void);
>> +int get_securelevel(void);
>>  void scanpcidomain(void);
>> -int probe(int, int, int);
>> +int probe(int, int, int, int, int);
>> +void chreg(int, int, int, int, int);
>>  void dump(int, int, int);
>>  void hexdump(int, int, int, int);
>> -const char *str2busdevfunc(const char *, int *, int *, int *);
>> +const char *str2busdevfunc(const char *, int *, int *, int *, int *, int
>> *);
>>  int pci_nfuncs(int, int);
>>  int pci_read(int, int, int, u_int32_t, u_int32_t *);
>>  int pci_readmask(int, int, int, u_int32_t, u_int32_t *);
>> +int pci_write(int, int, int, u_int32_t, u_int32_t);
>>  void dump_caplist(int, int, int, u_int8_t);
>> 

[PATCH] pcidump - read and write registers

2017-03-27 Thread Simon Mages
Hi,

right now i use the following diff to poke around in the PCIe config
space. This diff enables
pcidump to read and write to a register. So far i used this mainly to
play with the Advanced
Error Reporting Capability some devices have.

$ pcidump 4:0:0:104
 4:0:0: Broadcom BCM5754
0x0104: 0x0010
This bit indicates an "Unsupported Request Error", the register
queried here is the
"Uncorrectable Error Status Register".

# pcidump 4:0:0:104:0x0010
 4:0:0: Broadcom BCM5754
0x0104: 0x
pcidump shows the new value of the register after writing. By writing
a 1 to a status bit it
gets reset.

I implemented a check for the current securelevel because writing to
/dev/pci is only possible
for a securelevel smaller than 1.

I think this functionality can come in handy for people
writing/modifying device drivers.

Index: pcidump.8
===
--- pcidump.8   16 Jul 2013 11:13:34 -  1.12
+++ pcidump.8   27 Mar 2017 11:27:35 -
@@ -26,7 +26,7 @@
 .Op Fl x | xx | xxx
 .Op Fl d Ar pcidev
 .Sm off
-.Op Ar bus : dev : func
+.Op Ar bus : dev : func [ : reg [ : val ] ]
 .Sm on
 .Nm pcidump
 .Fl r Ar file
@@ -69,16 +69,29 @@ Shows a hexadecimal dump of the full PCI
 Shows a hexadecimal dump of the PCIe extended config space.
 .It Xo
 .Sm off
-.Ar bus : dev : func
+.Ar bus : dev : func [ : reg [ : val ] ]
 .Sm on
 .Xc
 Show information about the PCI device specified by the tuple given on
-the command line.
+the command line. If
+.Pa reg
+is used, the value of this register in the configuration space of
+.Pa func
+gets printed. If
+.Pa val
+is used, the register specified by
+.Pa reg
+will be loaded with the value specified by
+.Pa val .
 If the
 .Fl d
 option is not given,
 .Pa /dev/pci
 is used.
+.It Xo
+.Xc
+The configuration space can only be written in a securelevel(7) lower
+than 1.
 .El
 .Sh FILES
 .Bl -tag -width /dev/pci* -compact
@@ -86,7 +99,8 @@ is used.
 Device files for accessing PCI domains.
 .El
 .Sh SEE ALSO
-.Xr pci 4
+.Xr pci 4 ,
+.Xr securelevel 7
 .Sh HISTORY
 The
 .Nm
Index: pcidump.c
===
--- pcidump.c   25 Mar 2017 07:33:46 -  1.43
+++ pcidump.c   27 Mar 2017 11:24:10 -
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 #include  /* need NULL for  */

@@ -37,19 +39,27 @@

 #define PCIDEV "/dev/pci"

+#define PCI_CONFIG_SPACE_BEGIN 0x0
+#define PCIE_CONFIG_SPACE_END  (PCIE_CONFIG_SPACE_SIZE - 1)
+#define PCI_CONFIG_ALIGNMENT   0x4
+#define REG_ALIGNMENT_OK(x)((x) % PCI_CONFIG_ALIGNMENT ? 0 : 1)
+
 #ifndef nitems
 #define nitems(_a) (sizeof((_a)) / sizeof((_a)[0]))
 #endif

 __dead void usage(void);
+int get_securelevel(void);
 void scanpcidomain(void);
-int probe(int, int, int);
+int probe(int, int, int, int, int);
+void chreg(int, int, int, int, int);
 void dump(int, int, int);
 void hexdump(int, int, int, int);
-const char *str2busdevfunc(const char *, int *, int *, int *);
+const char *str2busdevfunc(const char *, int *, int *, int *, int *, int *);
 int pci_nfuncs(int, int);
 int pci_read(int, int, int, u_int32_t, u_int32_t *);
 int pci_readmask(int, int, int, u_int32_t, u_int32_t *);
+int pci_write(int, int, int, u_int32_t, u_int32_t);
 void dump_caplist(int, int, int, u_int8_t);
 void dump_pci_powerstate(int, int, int, uint8_t);
 void dump_pcie_linkspeed(int, int, int, uint8_t);
@@ -67,7 +77,8 @@ usage(void)
extern char *__progname;

fprintf(stderr,
-   "usage: %s [-v] [-x | -xx | -xxx] [-d pcidev] [bus:dev:func]\n"
+   "usage: %s [-v] [-x | -xx | -xxx] [-d pcidev]"
+   " [bus:dev:func[:reg[:val]]]\n"
"   %s -r file [-d pcidev] bus:dev:func\n",
__progname, __progname);
exit(1);
@@ -139,7 +150,7 @@ int
 main(int argc, char *argv[])
 {
int nfuncs;
-   int bus, dev, func;
+   int bus, dev, func, reg = -1, val = -1;
char pcidev[PATH_MAX] = PCIDEV;
char *romfile = NULL;
const char *errstr;
@@ -186,7 +197,10 @@ main(int argc, char *argv[])
dumpall = 0;

if (dumpall == 0) {
-   pcifd = open(pcidev, O_RDONLY, 0777);
+   if (get_securelevel() < 1)
+   pcifd = open(pcidev, O_RDWR, 0777);
+   else
+   pcifd = open(pcidev, O_RDONLY, 0777);
if (pcifd == -1)
err(1, "%s", pcidev);
} else {
@@ -207,7 +221,7 @@ main(int argc, char *argv[])
}

if (argc == 1) {
-   errstr = str2busdevfunc(argv[0], , , );
+   errstr = str2busdevfunc(argv[0], , , , , );
if (errstr != NULL)
errx(1, "\"%s\": %s", argv[0], errstr);

@@ -217,7 +231,7 @@ main(int argc, char *argv[])
else if (romfile)
error = dump_rom(bus, dev, func);
else
- 

[PATCH] pcidump - Enhanced Capabilities

2017-03-23 Thread Simon Mages
Hi,

on some machines i saw some unknown enhanced capabilities. After
looking into it i saw that
on some intel chipsets there actually is a capability with id 0x0.
This capability contains some
registers of the Advanced Error Reporting Capability but not all of
them. I guess intel choose
0x0 instead of 0x1 because there implementation contains not all of
the minimal Advanced
Error Reporting registers.

Anyway, i think it makes sense to print the enhanced capability id,
even if it is not in the list.
This way one does not have to look at the hexdump of pcidump -xxx to
figure out which
capability id the unknown capability has.

Index: usr.sbin/pcidump/pcidump.c
===
--- pcidump.c   16 Mar 2017 22:05:46 -  1.42
+++ pcidump.c   23 Mar 2017 15:12:07 -
@@ -392,6 +392,7 @@ void
 dump_pcie_enhanced_caplist(int bus, int dev, int func)
 {
u_int32_t reg;
+   u_int32_t capidx;
u_int16_t ptr;
u_int16_t ecap;

@@ -407,10 +408,12 @@ dump_pcie_enhanced_caplist(int bus, int

ecap = PCI_PCIE_ECAP_ID(reg);
if (ecap >= nitems(pci_enhanced_capnames))
-   ecap = 0;
+   capidx = 0;
+   else
+   capidx = ecap;

printf("\t0x%04x: Enhanced Capability 0x%02x: ", ptr, ecap);
-   printf("%s\n", pci_enhanced_capnames[ecap]);
+   printf("%s\n", pci_enhanced_capnames[capidx]);

ptr = PCI_PCIE_ECAP_NEXT(reg);


According to Rev. 3.0 of the PCIe spec, the last two bits are reserved
for future use. I do not
have access to the spec > Rev. 3.0.

Index: dev/pci/pcireg.h
===
--- dev/pci/pcireg.h22 Mar 2017 07:21:39 -  1.52
+++ dev/pci/pcireg.h23 Mar 2017 13:36:09 -
@@ -606,7 +606,7 @@ typedef u_int8_t pci_revision_t;
 #define PCI_PCIE_ECAP  0x100
 #definePCI_PCIE_ECAP_ID(x) (((x) & 0x))
 #define PCI_PCIE_ECAP_VER(x)   (((x) >> 16) & 0x0f)
-#definePCI_PCIE_ECAP_NEXT(x)   ((x) >> 20)
+#definePCI_PCIE_ECAP_NEXT(x)   (((x) >> 20) & 0xffc)
 #define PCI_PCIE_ECAP_LAST 0x0

 /*



[PATCH] pcidump - Enhanced Capabilities

2017-03-16 Thread Simon Mages
Hi,

right now i got the chance to play a little bit with PCIe. I read some
parts of the spec
and was interessted what my PCIe devices can do. I also found out that
pcidump can
not display the Enhanced Capabilites.

This patch enables pcidump to display them.

I did not find a good list of descriptions for the different
capabilities, the one im using
in this patch was taken from the linux kernel. Is it possible to get a
complete list
somewhere?

Here is an example output:
# pcidump -v
Domain /dev/pci0:
 0:0:0: Intel 82Q33 Host
0x: Vendor ID: 8086 Product ID: 29d0
0x0004: Command: 0006 Status: 2090
0x0008: Class: 06 Subclass: 00 Interface: 00 Revision: 02
0x000c: BIST: 00 Header Type: 00 Latency Timer: 00 Cache Line Size: 00
0x0010: BAR empty ()
0x0014: BAR empty ()
0x0018: BAR empty ()
0x001c: BAR empty ()
0x0020: BAR empty ()
0x0024: BAR empty ()
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1734 Product ID: 10fc
0x0030: Expansion ROM Base Address: 
0x0038: 
0x003c: Interrupt Pin: 00 Line: 00 Min Gnt: 00 Max Lat: 00
0x00e0: Capability 0x09: Vendor Specific
 0:1:0: Intel 82Q33 PCIE
0x: Vendor ID: 8086 Product ID: 29d1
0x0004: Command: 0104 Status: 0010
0x0008: Class: 06 Subclass: 04 Interface: 00 Revision: 02
0x000c: BIST: 00 Header Type: 01 Latency Timer: 00 Cache Line Size: 08
0x0010: 
0x0014: 
0x0018: Primary Bus: 0 Secondary Bus: 1 Subordinate Bus: 1
Secondary Latency Timer: 00
0x001c: I/O Base: f0 I/O Limit: 00 Secondary Status: 
0x0020: Memory Base: fff0 Memory Limit: 
0x0024: Prefetch Memory Base: fff1 Prefetch Memory Limit: 0001
0x0028: Prefetch Memory Base Upper 32 Bits: 
0x002c: Prefetch Memory Limit Upper 32 Bits: 
0x0030: I/O Base Upper 16 Bits:  I/O Limit Upper 16 Bits: 
0x0038: Expansion ROM Base Address: 
0x003c: Interrupt Pin: 01 Line: 0b Bridge Control: 
0x0088: Capability 0x0d: PCI-PCI
0x0080: Capability 0x01: Power Management
State: D0
0x0090: Capability 0x05: Message Signalled Interrupts (MSI)
0x00a0: Capability 0x10: PCI Express
Link Speed: 2.5 / 2.5 GT/s Link Width: x0 / x16
0x0100: Enhanced Capability 0x02: Virtual Channel Capability
0x0140: Enhanced Capability 0x05: Root Complex Link Declaration
 0:2:0: Intel 82Q33 Video
0x: Vendor ID: 8086 Product ID: 29d2
0x0004: Command: 0007 Status: 0090
0x0008: Class: 03 Subclass: 00 Interface: 00 Revision: 02
0x000c: BIST: 00 Header Type: 00 Latency Timer: 00 Cache Line Size: 00
0x0010: BAR mem 32bit addr: 0xd010/0x0008
0x0014: BAR io addr: 0x1c40/0x0008
0x0018: BAR mem prefetchable 32bit addr: 0xe000/0x1000
0x001c: BAR mem 32bit addr: 0xd000/0x0010
0x0020: BAR empty ()
0x0024: BAR empty ()
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1734 Product ID: 10fc
0x0030: Expansion ROM Base Address: 
0x0038: 
0x003c: Interrupt Pin: 01 Line: 0b Min Gnt: 00 Max Lat: 00
0x0090: Capability 0x05: Message Signalled Interrupts (MSI)
0x00d0: Capability 0x01: Power Management
State: D0
 0:26:0: Intel 82801I USB
0x: Vendor ID: 8086 Product ID: 2937
0x0004: Command: 0005 Status: 0290
0x0008: Class: 0c Subclass: 03 Interface: 00 Revision: 02
0x000c: BIST: 00 Header Type: 80 Latency Timer: 00 Cache Line Size: 00
0x0010: BAR empty ()
0x0014: BAR empty ()
0x0018: BAR empty ()
0x001c: BAR empty ()
0x0020: BAR io addr: 0x1820/0x0020
0x0024: BAR empty ()
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1734 Product ID: 10fd
0x0030: Expansion ROM Base Address: 
0x0038: 
0x003c: Interrupt Pin: 01 Line: 0a Min Gnt: 00 Max Lat: 00
0x0050: Capability 0x13: PCI Advanced Features
 0:26:1: Intel 82801I USB
0x: Vendor ID: 8086 Product ID: 2938
0x0004: Command: 0005 Status: 0290
0x0008: Class: 0c Subclass: 03 Interface: 00 Revision: 02
0x000c: BIST: 00 Header Type: 00 Latency Timer: 00 Cache Line Size: 00
0x0010: BAR empty ()
0x0014: BAR empty ()
0x0018: BAR empty ()
0x001c: BAR empty ()
0x0020: BAR io addr: 0x1840/0x0020
0x0024: BAR empty ()
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1734 Product ID: 10fd

Re: add support for multiple transmit queues on interfaces

2017-01-27 Thread Simon Mages
I did some tests.

The performance did not change.
I think this is the expected behaviour.

BR
Simon

2017-01-23 7:35 GMT+01:00, David Gwynne :
> hrvoje popovski hit a problem where the kernel would panic under load.
>
> i mistakenly called an interfaces qstart routine directly from
> if_enqueue rather than via the ifq serializer. this meant that txeof
> routines on network cards calling ifq_restart would cause the start
> routine to run concurrently, therefore causing corruption of the
> ring state.
>
> this diff fixes that.
>
> On Mon, Jan 23, 2017 at 01:09:57PM +1000, David Gwynne wrote:
>> the short explanation is that this lets interfaces allocate multiple
>> ifq structures that can be mapped to their transmit rings. the
>> mechanism for this is a driver calling if_attach_queues() after
>> theyve called if_attach().
>>
>> the long version is that this has if_enqueue access an array of
>> ifqueues on the interface instead of if_snd directly. the ifq is
>> picked by asking the queue discipline (priq or hfsc) to map an mbuf
>> to a slot in the if_ifqs array.
>>
>> to notify the driver that a particular queue needs to start ive
>> added a new function pointer to ifnet called if_qstart. if_qstart
>> takes an ifqueue * as an argument instead of an ifnet *, thereby
>> getting past the implicit behaviour that interfaces only have a
>> single ring.
>>
>> our drivers all have if_start routines that take ifnet pointers
>> though, so there's compatability for those where a default if_qstart
>> implementation calls if_start for those drivers. in the future
>> if_start will be replaced with if_qstart and we can rename it back
>> to if_start. until then, there's compat.
>>
>> drivers that provide their own if_qstart instead of an if_start
>> function notify the stack by setting IFXF_MPSAFE. a chunk of this
>> diff is changing the IFXF_MPSAFE drivers to set if_qstart instead
>> of if_start. note that this is a mechanical change, it does not add
>> multiple tx queues to these drivers.
>>
>> most of this is straightforward except for the hfsc handling. hfsc
>> needs to track all flows going over an interface, which means all
>> flows have to be serialised through hfsc. the mechanism in use
>> before this change was to swap the priq backend on if_snd with the
>> hfsc backend. the trick with this diff is that we still do that,
>> ie, we only change the first ifqueue on an interface over to hfsc.
>> this works because we use the ifqops on the first ifq to map packets
>> to any of them. because the hfsc map function unconditionally maps
>> packets to the first ifq, all packets end up going through the one
>> hfsc structure we set up. the rest of the ifqs remain set up as
>> priq, but dont get used for sending packets after hfsc has been
>> enabled. if we ever add another ifqops backend, this will have to
>> be rethought. until then this is an elegant hack.
>>
>> a consequence of this change is that we the ifnet if_start function
>> should not be called anymore. this isnt true at the moment because
>> of things like net80211 and ppp. they both queue management packets
>> onto a separate queue, but those separate queues are dequeued and
>> processed in the interfaces start routine. if we want to mark wifi
>> and ppp drivers as mpsafe (or get rid of separate if_start and
>> if_qstart routines) this will have to change.
>>
>> the guts of this change are in if_enqueue and if_attach_queues.
>>
>> ok?
>>
>
> Index: arch/octeon/dev/if_cnmac.c
> ===
> RCS file: /cvs/src/sys/arch/octeon/dev/if_cnmac.c,v
> retrieving revision 1.61
> diff -u -p -r1.61 if_cnmac.c
> --- arch/octeon/dev/if_cnmac.c5 Nov 2016 05:14:18 -   1.61
> +++ arch/octeon/dev/if_cnmac.c23 Jan 2017 06:32:59 -
> @@ -138,7 +138,7 @@ int   octeon_eth_ioctl(struct ifnet *, u_l
>  void octeon_eth_watchdog(struct ifnet *);
>  int  octeon_eth_init(struct ifnet *);
>  int  octeon_eth_stop(struct ifnet *, int);
> -void octeon_eth_start(struct ifnet *);
> +void octeon_eth_start(struct ifqueue *);
>
>  int  octeon_eth_send_cmd(struct octeon_eth_softc *, uint64_t, uint64_t);
>  uint64_t octeon_eth_send_makecmd_w1(int, paddr_t);
> @@ -303,7 +303,7 @@ octeon_eth_attach(struct device *parent,
>   ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
>   ifp->if_xflags = IFXF_MPSAFE;
>   ifp->if_ioctl = octeon_eth_ioctl;
> - ifp->if_start = octeon_eth_start;
> + ifp->if_qstart = octeon_eth_start;
>   ifp->if_watchdog = octeon_eth_watchdog;
>   ifp->if_hardmtu = OCTEON_ETH_MAX_MTU;
>   IFQ_SET_MAXLEN(>if_snd, max(GATHER_QUEUE_SIZE, IFQ_MAXLEN));
> @@ -704,8 +704,6 @@ octeon_eth_ioctl(struct ifnet *ifp, u_lo
>   error = 0;
>   }
>
> - if_start(ifp);
> -
>   splx(s);
>   return (error);
>  }
> @@ -923,13 +921,14 @@ done:
>  }
>
>  void
> -octeon_eth_start(struct ifnet *ifp)
> +octeon_eth_start(struct 

Re: Scheduler ping-pong with preempt()

2017-01-27 Thread Simon Mages
Hi,

i did my usual tests.

current:
req/s: 3898.20
variance: 0.84

current+diff:
req/s: 3928.80
variance: 0.45

With this diff the messurements have been much more stable. The
variance of the req/s
messurements is now a lot smaller. Also the performance has increased.

For the bandwidth/s case this diff did not change much. The variance
for those messurements
was slightly decreased though.

Overall, nice work, this diff works for me :)

2017-01-24 4:35 GMT+01:00, Martin Pieuchot :
> Userland threads are preempt()'d when hogging a CPU or when processing
> an AST.  Currently when such a thread is preempted the scheduler looks
> for an idle CPU and puts it on its run queue.  That means the number of
> involuntary context switch often result in a migration.
>
> This is not a problem per se and one could argue that if another CPU
> is idle it makes sense to move.  However with the KERNEL_LOCK() moving
> to another CPU won't necessarily allows the preempt()'d thread to run.
> It's even worse, it increases contention.
>
> If you add to this behavior the fact that sched_choosecpu() prefers idle
> CPUs in a linear order, meaning CPU0 > CPU1 > .. > CPUN, you'll
> understand that the set of idle CPUs will change every time preempt() is
> called.
>
> I believe this behavior affects kernel threads by side effect, since
> the set of idle CPU changes every time a thread is preempted.  With this
> diff the 'softnet' thread didn't move on a 2 CPUs machine during simple
> benchmarks.  Without, it plays ping-pong between CPU.
>
> The goal of this diff is to reduce the number of migrations.  You
> can compare the value of 'sched_nomigrations' and 'sched_nmigrations'
> with and without it.
>
> As usual, I'd like to know what's the impact of this diff on your
> favorite benchmark.  Please test and report back.
>
> Index: kern/kern_sched.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.44
> diff -u -p -r1.44 kern_sched.c
> --- kern/kern_sched.c 21 Jan 2017 05:42:03 -  1.44
> +++ kern/kern_sched.c 24 Jan 2017 03:08:23 -
> @@ -51,6 +51,8 @@ uint64_t sched_noidle;  /* Times we didn
>  uint64_t sched_stolen;   /* Times we stole proc from other cpus 
> */
>  uint64_t sched_choose;   /* Times we chose a cpu */
>  uint64_t sched_wasidle;  /* Times we came out of idle */
> +uint64_t sched_nvcsw;/* voluntary context switches */
> +uint64_t sched_nivcsw;   /* involuntary context switches */
>
>  #ifdef MULTIPROCESSOR
>  struct taskq *sbartq;
> Index: kern/kern_synch.c
> ===
> RCS file: /cvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.136
> diff -u -p -r1.136 kern_synch.c
> --- kern/kern_synch.c 21 Jan 2017 05:42:03 -  1.136
> +++ kern/kern_synch.c 24 Jan 2017 03:08:23 -
> @@ -296,6 +296,7 @@ sleep_finish(struct sleep_state *sls, in
>   if (sls->sls_do_sleep && do_sleep) {
>   p->p_stat = SSLEEP;
>   p->p_ru.ru_nvcsw++;
> + sched_nvcsw++;
>   SCHED_ASSERT_LOCKED();
>   mi_switch();
>   } else if (!do_sleep) {
> @@ -481,6 +482,7 @@ sys_sched_yield(struct proc *p, void *v,
>   p->p_stat = SRUN;
>   setrunqueue(p);
>   p->p_ru.ru_nvcsw++;
> + sched_nvcsw++;
>   mi_switch();
>   SCHED_UNLOCK(s);
>
> Index: kern/sched_bsd.c
> ===
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.43
> diff -u -p -r1.43 sched_bsd.c
> --- kern/sched_bsd.c  9 Mar 2016 13:38:50 -   1.43
> +++ kern/sched_bsd.c  24 Jan 2017 03:18:24 -
> @@ -302,6 +302,7 @@ yield(void)
>   p->p_stat = SRUN;
>   setrunqueue(p);
>   p->p_ru.ru_nvcsw++;
> + sched_nvcsw++;
>   mi_switch();
>   SCHED_UNLOCK(s);
>  }
> @@ -327,9 +328,12 @@ preempt(struct proc *newp)
>   SCHED_LOCK(s);
>   p->p_priority = p->p_usrpri;
>   p->p_stat = SRUN;
> +#if 0
>   p->p_cpu = sched_choosecpu(p);
> +#endif
>   setrunqueue(p);
>   p->p_ru.ru_nivcsw++;
> + sched_nivcsw++;
>   mi_switch();
>   SCHED_UNLOCK(s);
>  }
> Index: sys/sched.h
> ===
> RCS file: /cvs/src/sys/sys/sched.h,v
> retrieving revision 1.41
> diff -u -p -r1.41 sched.h
> --- sys/sched.h   17 Mar 2016 13:18:47 -  1.41
> +++ sys/sched.h   24 Jan 2017 02:10:41 -
> @@ -134,6 +134,9 @@ struct schedstate_percpu {
>  extern int schedhz;  /* ideally: 16 */
>  extern int rrticks_init; /* ticks per roundrobin() */
>
> +extern uint64_t sched_nvcsw; /* voluntary context switches */
> +extern uint64_t sched_nivcsw;/* involuntary context switches 
> */
> +
>  struct proc;
>  void schedclock(struct proc 

Re: global mbuf memory limit

2016-11-28 Thread Simon Mages
It seems netstat -m is not printing the correct results with this
diff. The max values is wrong.

# sysctl kern.maxclusters
kern.maxclusters=262144

# netstat -m
9543 mbufs in use:
8044 mbufs allocated to data
1491 mbufs allocated to packet headers
8 mbufs allocated to socket names and addresses
0/72/64 mbuf 2048 byte clusters in use (current/peak/max)
8006/33735/120 mbuf 2112 byte clusters in use (current/peak/max)
0/48/64 mbuf 4096 byte clusters in use (current/peak/max)
0/64/64 mbuf 8192 byte clusters in use (current/peak/max)
0/56/112 mbuf 9216 byte clusters in use (current/peak/max)
0/60/80 mbuf 12288 byte clusters in use (current/peak/max)
0/128/64 mbuf 16384 byte clusters in use (current/peak/max)
0/72/64 mbuf 65536 byte clusters in use (current/peak/max)
93680 Kbytes allocated to network (20% in use)

On current without the diff:

# netstat -m
42 mbufs in use:
35 mbufs allocated to data
2 mbufs allocated to packet headers
5 mbufs allocated to socket names and addresses
0/8/262144 mbuf 2048 byte clusters in use (current/peak/max)
33/45/261900 mbuf 2112 byte clusters in use (current/peak/max)
0/8/131072 mbuf 4096 byte clusters in use (current/peak/max)
0/8/65536 mbuf 8192 byte clusters in use (current/peak/max)
0/14/58254 mbuf 9216 byte clusters in use (current/peak/max)
0/10/43690 mbuf 12288 byte clusters in use (current/peak/max)
0/8/32768 mbuf 16384 byte clusters in use (current/peak/max)
0/8/8192 mbuf 65536 byte clusters in use (current/peak/max)
1120 Kbytes allocated to network (7% in use)
0 requests for memory denied
0 requests for memory delayed
0 calls to protocol drain routines

2016-11-22 3:42 GMT+01:00, David Gwynne :
> right now pools that make up mbufs are each limited individually.
>
> the following diff instead has the mbuf layer have a global limit
> on the amount of memory that can be allocated to the pools. this
> is enforced by wrapping the multi page pool allocator with something
> that checks the mbuf memory limit first.
>
> this means all mbufs will use a max of 2k * nmbclust bytes instead
> of each pool being able to use that amount each.
>
> ok?
>
> Index: sys/pool.h
> ===
> RCS file: /cvs/src/sys/sys/pool.h,v
> retrieving revision 1.68
> diff -u -p -r1.68 pool.h
> --- sys/pool.h21 Nov 2016 01:44:06 -  1.68
> +++ sys/pool.h22 Nov 2016 02:31:47 -
> @@ -205,6 +205,7 @@ struct pool {
>  #ifdef _KERNEL
>
>  extern struct pool_allocator pool_allocator_single;
> +extern struct pool_allocator pool_allocator_multi;
>
>  struct pool_request {
>   TAILQ_ENTRY(pool_request) pr_entry;
> Index: sys/mbuf.h
> ===
> RCS file: /cvs/src/sys/sys/mbuf.h,v
> retrieving revision 1.222
> diff -u -p -r1.222 mbuf.h
> --- sys/mbuf.h24 Oct 2016 04:38:44 -  1.222
> +++ sys/mbuf.h22 Nov 2016 02:31:47 -
> @@ -416,6 +416,7 @@ struct mbuf_queue {
>  };
>
>  #ifdef   _KERNEL
> +struct pool;
>
>  extern   int nmbclust;   /* limit on the # of clusters */
>  extern   int mblowat;/* mbuf low water mark */
> @@ -444,6 +445,7 @@ int   m_leadingspace(struct mbuf *);
>  int  m_trailingspace(struct mbuf *);
>  struct mbuf *m_clget(struct mbuf *, int, u_int);
>  void m_extref(struct mbuf *, struct mbuf *);
> +void m_pool_init(struct pool *, u_int, u_int, const char *);
>  void m_extfree_pool(caddr_t, u_int, void *);
>  void m_adj(struct mbuf *, int);
>  int  m_copyback(struct mbuf *, int, int, const void *, int);
> Index: kern/uipc_mbuf.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v
> retrieving revision 1.238
> diff -u -p -r1.238 uipc_mbuf.c
> --- kern/uipc_mbuf.c  9 Nov 2016 08:55:11 -   1.238
> +++ kern/uipc_mbuf.c  22 Nov 2016 02:31:47 -
> @@ -133,6 +133,19 @@ void m_extfree(struct mbuf *);
>  void nmbclust_update(void);
>  void m_zero(struct mbuf *);
>
> +struct mutex m_pool_mtx = MUTEX_INITIALIZER(IPL_NET);
> +unsigned int mbuf_mem_limit; /* how much memory can we allocated */
> +unsigned int mbuf_mem_alloc; /* how much memory has been allocated */
> +
> +void *m_pool_alloc(struct pool *, int, int *);
> +void m_pool_free(struct pool *, void *);
> +
> +struct pool_allocator m_pool_allocator = {
> + m_pool_alloc,
> + m_pool_free,
> + 0 /* will be copied from pool_allocator_multi */
> +};
> +
>  static void (*mextfree_fns[4])(caddr_t, u_int, void *);
>  static u_int num_extfree_fns;
>
> @@ -148,6 +161,11 @@ mbinit(void)
>   int i;
>   unsigned int lowbits;
>
> + m_pool_allocator.pa_pagesz = pool_allocator_multi.pa_pagesz;
> +
> + nmbclust_update();
> + mbuf_mem_alloc = 0;
> +
>  #if DIAGNOSTIC
>   if (mclsizes[0] != MCLBYTES)
>   panic("mbinit: the smallest cluster size != 

Re: SOCKET_LOCK looking for testers

2016-11-04 Thread Simon Mages
Hi,

i did some performance measurements with and without your diff on
OpenBSD-current.

There is no performance difference. I think this is the expected outcome.

BR
Simon

2016-11-04 10:12 GMT+01:00, Martin Pieuchot :
> On 03/11/16(Thu) 11:21, Martin Pieuchot wrote:
>> Here's the next iteration of my diff introducing a rwlock to serialize
>> the network input path with socket paths.  Changes are:
>>
>>   - more timeout_set_proc() that should fix problems reported by
>> Chris Jackman.
>>
>>   - I introduced a set of macro to make it easier to audit existing
>> splsoftnet().
>>
>>   - It makes use of splassert_fail() if the lock is not held.
>>
>>
>> My plan is to commit it, assuming it is stable enough, then fix the
>> remaining issues in tree.  This includes:
>>
>>   - Analyze and if needed fix the two code paths were we do an
>> unlock/lock
>> dance
>>
>>   - Remove unneeded/recursive splsoftnet() dances.
>>
>> Once that's done we should be able to remove the KERNEL_LOCK() from the
>> input path.
>>
>> So please test and report back.
>
> Updated version that prevents a recursion in doaccept(), reported by Nils
> Frohberg.
>
> diff --git sys/kern/sys_socket.c sys/kern/sys_socket.c
> index 7a90f78..a7be8a1 100644
> --- sys/kern/sys_socket.c
> +++ sys/kern/sys_socket.c
> @@ -133,7 +133,7 @@ soo_poll(struct file *fp, int events, struct proc *p)
>   int revents = 0;
>   int s;
>
> - s = splsoftnet();
> + SOCKET_LOCK(s);
>   if (events & (POLLIN | POLLRDNORM)) {
>   if (soreadable(so))
>   revents |= events & (POLLIN | POLLRDNORM);
> @@ -159,7 +159,7 @@ soo_poll(struct file *fp, int events, struct proc *p)
>   so->so_snd.sb_flagsintr |= SB_SEL;
>   }
>   }
> - splx(s);
> + SOCKET_UNLOCK(s);
>   return (revents);
>  }
>
> diff --git sys/kern/uipc_socket.c sys/kern/uipc_socket.c
> index 9e8d05f..dd067b3 100644
> --- sys/kern/uipc_socket.c
> +++ sys/kern/uipc_socket.c
> @@ -89,6 +89,11 @@ struct pool sosplice_pool;
>  struct taskq *sosplice_taskq;
>  #endif
>
> +/*
> + * Serialize socket operations.
> + */
> +struct rwlock socketlock = RWLOCK_INITIALIZER("socketlock");
> +
>  void
>  soinit(void)
>  {
> @@ -123,7 +128,7 @@ socreate(int dom, struct socket **aso, int type, int
> proto)
>   return (EPROTONOSUPPORT);
>   if (prp->pr_type != type)
>   return (EPROTOTYPE);
> - s = splsoftnet();
> + SOCKET_LOCK(s);
>   so = pool_get(_pool, PR_WAITOK | PR_ZERO);
>   TAILQ_INIT(>so_q0);
>   TAILQ_INIT(>so_q);
> @@ -141,10 +146,10 @@ socreate(int dom, struct socket **aso, int type, int
> proto)
>   if (error) {
>   so->so_state |= SS_NOFDREF;
>   sofree(so);
> - splx(s);
> + SOCKET_UNLOCK(s);
>   return (error);
>   }
> - splx(s);
> + SOCKET_UNLOCK(s);
>   *aso = so;
>   return (0);
>  }
> @@ -154,9 +159,9 @@ sobind(struct socket *so, struct mbuf *nam, struct proc
> *p)
>  {
>   int s, error;
>
> - s = splsoftnet();
> + SOCKET_LOCK(s);
>   error = (*so->so_proto->pr_usrreq)(so, PRU_BIND, NULL, nam, NULL, p);
> - splx(s);
> + SOCKET_UNLOCK(s);
>   return (error);
>  }
>
> @@ -171,11 +176,11 @@ solisten(struct socket *so, int backlog)
>   if (isspliced(so) || issplicedback(so))
>   return (EOPNOTSUPP);
>  #endif /* SOCKET_SPLICE */
> - s = splsoftnet();
> + SOCKET_LOCK(s);
>   error = (*so->so_proto->pr_usrreq)(so, PRU_LISTEN, NULL, NULL, NULL,
>   curproc);
>   if (error) {
> - splx(s);
> + SOCKET_UNLOCK(s);
>   return (error);
>   }
>   if (TAILQ_FIRST(>so_q) == NULL)
> @@ -185,14 +190,14 @@ solisten(struct socket *so, int backlog)
>   if (backlog < sominconn)
>   backlog = sominconn;
>   so->so_qlimit = backlog;
> - splx(s);
> + SOCKET_UNLOCK(s);
>   return (0);
>  }
>
>  void
>  sofree(struct socket *so)
>  {
> - splsoftassert(IPL_SOFTNET);
> + SOCKET_ASSERT_LOCKED();
>
>   if (so->so_pcb || (so->so_state & SS_NOFDREF) == 0)
>   return;
> @@ -232,7 +237,7 @@ soclose(struct socket *so)
>   struct socket *so2;
>   int s, error = 0;
>
> - s = splsoftnet();
> + SOCKET_LOCK(s);
>   if (so->so_options & SO_ACCEPTCONN) {
>   while ((so2 = TAILQ_FIRST(>so_q0)) != NULL) {
>   (void) soqremque(so2, 0);
> @@ -256,7 +261,7 @@ soclose(struct socket *so)
>   (so->so_state & SS_NBIO))
>   goto drop;
>   while (so->so_state & SS_ISCONNECTED) {
> - error = tsleep(>so_timeo,
> + error = rwsleep(>so_timeo, ,
>   PSOCK | PCATCH, "netcls",
>   so->so_linger * hz);
>  

Re: per-cpu caches for pools

2016-10-31 Thread Simon Mages
Hi,

today i did some performance messurements on OpenBSD-current, with and
without your diff.

I use a custom HTTP proxy. Performance in this case is requests/s and bandwidth.
To messure requests/s i use requests with a very small response. To
messure bandwidth
i use big responses. This custom proxy uses socket splicing to improve
the performace for
bigger junks of data. The following results are a average over
multiple test runs.

without your diff:
requests/s = 3929
bandwidth in Mbit/s = 1196

with your diff:
requests/s = 4093
bandwidth in Mbit/s = 1428

Just ask if you want more details on the testsetup.

BR

Simon

2016-10-27 5:54 GMT+02:00, David Gwynne :
> On Tue, Oct 25, 2016 at 10:35:45AM +1000, David Gwynne wrote:
>> On Mon, Oct 24, 2016 at 04:24:13PM +1000, David Gwynne wrote:
>> > ive posted this before as part of a much bigger diff, but smaller
>> > is better.
>> >
>> > it basically lets things ask for per-cpu item caches to be enabled
>> > on pools. the most obvious use case for this is the mbuf pools.
>> >
>> > the caches are modelled on whats described in the "Magazines and
>> > Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary
>> > Resources" paper by Jeff Bonwick and Jonathan Adams. pools are
>> > modelled on slabs, which bonwick described in a previous paper.
>> >
>> > the main inspiration the paper provided was around how many objects
>> > to cache on each cpu, and how often to move sets of objects between
>> > the cpu caches and a global list of objects. unlike the paper we
>> > do not care about maintaining constructed objects on the free lists,
>> > so we reuse the objects themselves to build the free list.
>> >
>> > id like to get this in so we can tinker with it in tree. the things
>> > i think we need to tinker with are what poisioning we can get away
>> > with on the per cpu caches, and what limits can we enforce at the
>> > pool level.
>> >
>> > i think poisioning will be relatively simple to add. the limits one
>> > is more challenging because we dont want the pools to have to
>> > coordinate between cpus for every get or put operation. my thought
>> > there was to limit the number of pages that a pool can allocate
>> > from its backend rather than limit the items the pool can provide.
>> > limiting the pages could also be done at a lower level. eg, the
>> > mbuf clusters could share a common backend that limits the pages
>> > the pools can get, rather than have the cluster pools account for
>> > pages separately.
>> >
>> > anyway, either way i would like to get this in so we can work on
>> > this stuff.
>> >
>> > ok?
>>
>> this adds per-cpu caches to the mbuf pools so people can actually
>> try and see if the code works or not.
>
> this fixes a crash hrvoje and i found independently. avoid holding
> a mutex when calling yield().
>
> also some whitespace fixes.
>
> Index: kern/subr_pool.c
> ===
> RCS file: /cvs/src/sys/kern/subr_pool.c,v
> retrieving revision 1.198
> diff -u -p -r1.198 subr_pool.c
> --- kern/subr_pool.c  15 Sep 2016 02:00:16 -  1.198
> +++ kern/subr_pool.c  27 Oct 2016 03:51:10 -
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>
> @@ -96,6 +97,33 @@ struct pool_item {
>  };
>  #define POOL_IMAGIC(ph, pi) ((u_long)(pi) ^ (ph)->ph_magic)
>
> +#ifdef MULTIPROCESSOR
> +struct pool_list {
> + struct pool_list*pl_next;   /* next in list */
> + unsigned longpl_cookie;
> + struct pool_list*pl_nextl;  /* next list */
> + unsigned longpl_nitems; /* items in list */
> +};
> +
> +struct pool_cache {
> + struct pool_list*pc_actv;
> + unsigned longpc_nactv;  /* cache pc_actv nitems */
> + struct pool_list*pc_prev;
> +
> + uint64_t pc_gen;/* generation number */
> + uint64_t pc_gets;
> + uint64_t pc_puts;
> + uint64_t pc_fails;
> +
> + int  pc_nout;
> +};
> +
> +void *pool_cache_get(struct pool *);
> +void  pool_cache_put(struct pool *, void *);
> +void  pool_cache_destroy(struct pool *);
> +#endif
> +void  pool_cache_info(struct pool *, struct kinfo_pool *);
> +
>  #ifdef POOL_DEBUG
>  int  pool_debug = 1;
>  #else
> @@ -355,6 +383,11 @@ pool_destroy(struct pool *pp)
>   struct pool_item_header *ph;
>   struct pool *prev, *iter;
>
> +#ifdef MULTIPROCESSOR
> + if (pp->pr_cache != NULL)
> + pool_cache_destroy(pp);
> +#endif
> +
>  #ifdef DIAGNOSTIC
>   if (pp->pr_nout != 0)
>   panic("%s: pool busy: still out: %u", __func__, pp->pr_nout);
> @@ -421,6 +454,14 @@ pool_get(struct pool *pp, int flags)
>   void *v = NULL;
>   int slowdown = 0;
>
> +#ifdef MULTIPROCESSOR
> + if (pp->pr_cache != NULL) {
> + v = pool_cache_get(pp);
> + if (v != 

Re: Help me testing the netlock

2016-10-14 Thread Simon Mages
Hi,

i did some performace tests in current with and without your diff.

There is no difference in performance.

I will try to do performance tests with current on a regular base now.

2016-10-05 10:15 GMT+02:00, Martin Pieuchot :
> On 10/04/16 16:44, Martin Pieuchot wrote:
>> On 10/03/16 16:43, Martin Pieuchot wrote:
>>> Diff below introduces a single write lock that will be used to serialize
>>> access to ip_output().
>>>
>>> This lock will be then split in multiple readers and writers to allow
>>> multiple forwarding paths to run in parallel of each others but still
>>> serialized with the socket layer.
>>>
>>> I'm currently looking for people wanting to run this diff and try to
>>> break it.  In other words, your machine might panic with it and if it
>>> does report the panic to me so the diff can be improved.
>>>
>>> I tested NFS v2 and v3 so I'm quite confident, but I might have missed
>>> some obvious stuff.
>>
>> Updated diff attaced including a fix for syn_cache_timer(), problem
>> reported by Chris Jackman.
>
> Thanks to all testers!
>
> Here's a newer version that includes a fix for rt_timer_timer() also
> found by Chris Jackman.
>
>



[PATCH] fix mbuf leak in uicp_usrreq.c

2016-08-17 Thread Simon Mages
Hi,

while i was debugging dlg@'s diff regarding the bigger mbuf clusters i stumbled
across a bug in the PRU_SEND case in uicp_usrreq.c.

There is a call to sbappendcontrol which can fail, but there is no
error handling done.
If sbappendcontrol fails m will be set to NULL, which just leaks this
mbuf because
it was never put into the sb.

I think the following diff fixes this problem by handly the error correctly.

Index: kern/uipc_usrreq.c
===
RCS file: /cvs/src/sys/kern/uipc_usrreq.c,v
retrieving revision 1.100
diff -u -p -u -p -r1.100 uipc_usrreq.c
--- kern/uipc_usrreq.c  19 Jul 2016 05:30:48 -  1.100
+++ kern/uipc_usrreq.c  16 Aug 2016 15:58:32 -
@@ -254,6 +254,10 @@ uipc_usrreq(struct socket *so, int req,
if (control) {
if (sbappendcontrol(rcv, m, control))
control = NULL;
+   else {
+   error = ENOBUFS;
+   break;
+   }
} else if (so->so_type == SOCK_SEQPACKET)
sbappendrecord(rcv, m);
else



Re: bigger mbuf clusters for sosend()

2016-08-17 Thread Simon Mages
Hi,

this diff works for me.

I tested TCP and Unix Domain Sockets. I did no performance tests though.

I like this version better then the one i was working with, it really
is easier to read.

For completeness follows the diff i was using:

Index: kern/uipc_socket.c
===
RCS file: /cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.152
diff -u -p -u -p -r1.152 uipc_socket.c
--- kern/uipc_socket.c  13 Jun 2016 21:24:43 -  1.152
+++ kern/uipc_socket.c  16 Aug 2016 14:01:39 -
@@ -496,15 +496,18 @@ restart:
mlen = MLEN;
}
if (resid >= MINCLSIZE && space >= MCLBYTES) {
-   MCLGET(m, M_NOWAIT);
+   MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
+   lmin(space, MAXMCLBYTES)));
if ((m->m_flags & M_EXT) == 0)
goto nopages;
if (atomic && top == 0) {
-   len = ulmin(MCLBYTES - max_hdr,
-   resid);
+   len = lmin(lmin(resid, space),
+   m->m_ext.ext_size -
+   max_hdr);
m->m_data += max_hdr;
} else
-   len = ulmin(MCLBYTES, resid);
+   len = lmin(lmin(resid, space),
+   m->m_ext.ext_size);
space -= len;
} else {
 nopages:


2016-08-13 10:59 GMT+02:00, Claudio Jeker :
> This diff refactors the uio to mbuf code to make use of bigger buffers (up
> to 64k) and also switches the MCLGET to use M_WAIT like the MGET calls in
> the same function. I see no point in not waiting for a cluster and instead
> chain lots of mbufs together as a consequence.
>
> This makes in my opinion the code easier to read and allows for further
> optimizations (like using non-DMA reachable mbufs for AF_UNIX sockets).
>
> This increased the preformance of loopback connections significantly when
> I tested this at n2k16.
> --
> :wq Claudio
>
>
> Index: kern//uipc_socket.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> retrieving revision 1.152
> diff -u -p -r1.152 uipc_socket.c
> --- kern//uipc_socket.c   13 Jun 2016 21:24:43 -  1.152
> +++ kern//uipc_socket.c   12 Aug 2016 14:07:36 -
> @@ -373,6 +373,8 @@ bad:
>   return (error);
>  }
>
> +int m_getuio(struct mbuf **, int, long, struct uio *);
> +
>  #define  SBLOCKWAIT(f)   (((f) & MSG_DONTWAIT) ? M_NOWAIT : M_WAITOK)
>  /*
>   * Send on a socket.
> @@ -395,10 +397,7 @@ int
>  sosend(struct socket *so, struct mbuf *addr, struct uio *uio, struct mbuf
> *top,
>  struct mbuf *control, int flags)
>  {
> - struct mbuf **mp;
> - struct mbuf *m;
>   long space, clen = 0;
> - u_long len, mlen;
>   size_t resid;
>   int error, s;
>   int atomic = sosendallatonce(so) || top;
> @@ -475,7 +474,6 @@ restart:
>   goto restart;
>   }
>   splx(s);
> - mp = 
>   space -= clen;
>   do {
>   if (uio == NULL) {
> @@ -485,52 +483,14 @@ restart:
>   resid = 0;
>   if (flags & MSG_EOR)
>   top->m_flags |= M_EOR;
> - } else do {
> - if (top == 0) {
> - MGETHDR(m, M_WAIT, MT_DATA);
> - mlen = MHLEN;
> - m->m_pkthdr.len = 0;
> - m->m_pkthdr.ph_ifidx = 0;
> - } else {
> - MGET(m, M_WAIT, MT_DATA);
> - mlen = MLEN;
> - }
> - if (resid >= MINCLSIZE && space >= MCLBYTES) {
> - MCLGET(m, M_NOWAIT);
> - if ((m->m_flags & M_EXT) == 0)
> - goto nopages;
> - if (atomic && top == 0) {
> - len = ulmin(MCLBYTES - max_hdr,
> - resid);
> - 

Fwd: [PATCH] let the mbufs use more then 4gb of memory

2016-08-01 Thread Simon Mages
I sent this message to dlg@ directly to discuss my modification of his
diff to make the
bigger mbuf clusters work. i got no response so far, thats why i
decided to post it on tech@
directly. Maybe this way i get faster some feedback :)

BR
Simon

### Original Mail:

-- Forwarded message --
From: Simon Mages <mages.si...@googlemail.com>
Date: Fri, 22 Jul 2016 13:24:24 +0200
Subject: Re: [PATCH] let the mbufs use more then 4gb of memory
To: David Gwynne <da...@gwynne.id.au>

Hi,

I think i found the problem with your diff regarding the bigger mbuf clusters.

You choose a buffer size based on space and resid, but what happens when resid
is larger then space and space is for example 2050? The cluster choosen has
then the size 4096. But this size is to large for the socket buffer. In the
past this was never a problem because you only allocated external clusters
of size MCLBYTES and this was only done when space was larger then MCLBYTES.

diff:
Index: kern/uipc_socket.c
===
RCS file: /cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.152
diff -u -p -u -p -r1.152 uipc_socket.c
--- kern/uipc_socket.c  13 Jun 2016 21:24:43 -  1.152
+++ kern/uipc_socket.c  22 Jul 2016 10:56:02 -
@@ -496,15 +496,18 @@ restart:
mlen = MLEN;
}
if (resid >= MINCLSIZE && space >= MCLBYTES) {
-   MCLGET(m, M_NOWAIT);
+   MCLGETI(m, M_NOWAIT, NULL, lmin(resid,
+   lmin(space, MAXMCLBYTES)));
if ((m->m_flags & M_EXT) == 0)
goto nopages;
if (atomic && top == 0) {
-   len = ulmin(MCLBYTES - max_hdr,
-   resid);
+   len = lmin(lmin(resid, space),
+   m->m_ext.ext_size -
+   max_hdr);
m->m_data += max_hdr;
} else
-   len = ulmin(MCLBYTES, resid);
+   len = lmin(lmin(resid, space),
+   m->m_ext.ext_size);
space -= len;
} else {
 nopages:

Im using this diff no for a while on my notebook and everything works as
expected. But i had no time to realy test it or test the performance. This will
be my next step.

I reproduced the unix socket problem you mentioned with the following little
programm:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 
#include 

#define FILE "/tmp/afile"

int senddesc(int fd, int so);
int recvdesc(int so);

int
main(void)
{
struct stat sb;
int sockpair[2];
pid_t pid = 0;
int status;
int newfile;

if (unlink(FILE) < 0)
warn("unlink: %s", FILE);

int file = open(FILE, O_RDWR|O_CREAT|O_TRUNC);

if (socketpair(AF_UNIX, SOCK_STREAM|SOCK_NONBLOCK, 0, sockpair) < 0)
err(1, "socketpair");

if ((pid =fork())) {
senddesc(file, sockpair[0]);
if (waitpid(pid, , 0) < 0)
err(1, "waitpid");
} else {
newfile = recvdesc(sockpair[1]);
if (fstat(newfile, ) < 0)
err(1, "fstat");
}

return 0;
}

int
senddesc(int fd, int so)
{
struct msghdr msg;
struct cmsghdr *cmsg;
union {
struct cmsghdr  hdr;
unsigned char   buf[CMSG_SPACE(sizeof(int))];
} cmsgbuf;

char *cbuf = calloc(6392, sizeof(char));
memset(cbuf, 'K', 6392);
struct iovec iov = {
.iov_base = cbuf,
.iov_len = 6392,
};

memset(, 0, sizeof(struct msghdr));
msg.msg_iov = 
msg.msg_iovlen = 1;
msg.msg_control = 
msg.msg_controllen = sizeof(cmsgbuf.buf);

cmsg = CMSG_FIRSTHDR();
cmsg->cmsg_len = CMSG_LEN(sizeof(int));
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
*(int *)CMSG_DATA(cmsg) = fd;

struct pollfd pfd[1];
int nready;
int wrote = 0;
int wrote_total = 0;
pfd[0].fd = so;
pfd[0].events = POLLOUT;

while (1) {
nready = poll(pfd, 1, -1);

Re: [PATCH] dont increase the size of socket buffers in low memory situations

2016-07-05 Thread Simon Mages
2016-07-05 15:36 GMT+02:00, Claudio Jeker <cje...@diehard.n-r-g.com>:
> On Tue, Jul 05, 2016 at 07:22:27AM -0600, Bob Beck wrote:
>> Makes sense to me.  Others?
>>
>>
>> On Tue, Jul 5, 2016 at 4:08 AM, Simon Mages <mages.si...@googlemail.com>
>> wrote:
>> > At the moment the buffersize will be set to the default even if the
>> > current value
>> > is smaller.
>> >
>> > The following diff fixes this problem.
>> >
>> > Index: netinet/tcp_usrreq.c
>> > ===
>> > RCS file: /cvs/src/sys/netinet/tcp_usrreq.c,v
>> > retrieving revision 1.131
>> > diff -u -p -u -p -r1.131 tcp_usrreq.c
>> > --- netinet/tcp_usrreq.c18 Jun 2016 10:36:13 -  1.131
>> > +++ netinet/tcp_usrreq.c5 Jul 2016 09:26:24 -
>> > @@ -979,10 +979,11 @@ tcp_update_sndspace(struct tcpcb *tp)
>> > struct socket *so = tp->t_inpcb->inp_socket;
>> > u_long nmax;
>> >
>> > -   if (sbchecklowmem())
>> > +   if (sbchecklowmem()) {
>> > /* low on memory try to get rid of some */
>> > -   nmax = tcp_sendspace;
>> > -   else if (so->so_snd.sb_wat != tcp_sendspace)
>> > +   if (so->so_snd.sb_hiwat < nmax)
>> > +   nmax = tcp_sendspace;
>> > +   } else if (so->so_snd.sb_wat != tcp_sendspace)
>> > /* user requested buffer size, auto-scaling disabled */
>> > nmax = so->so_snd.sb_wat;
>> > else
>
> Here, nmax can be used uninitialized now.
> It needs be initialized to something maybe sb_hiwat?

Thats true, i found also another bug in this diff, the new one follows.

>
>> > @@ -1017,10 +1018,11 @@ tcp_update_rcvspace(struct tcpcb *tp)
>> > struct socket *so = tp->t_inpcb->inp_socket;
>> > u_long nmax = so->so_rcv.sb_hiwat;
>> >
>> > -   if (sbchecklowmem())
>> > +   if (sbchecklowmem()) {
>> > /* low on memory try to get rid of some */
>> > -   nmax = tcp_recvspace;
>> > -   else if (so->so_rcv.sb_wat != tcp_recvspace)
>> > +   if (tcp_recvspace < nmax)
>> > +   nmax = tcp_recvspace;
>> > +   } else if (so->so_rcv.sb_wat != tcp_recvspace)
>> > /* user requested buffer size, auto-scaling disabled */
>> > nmax = so->so_rcv.sb_wat;
>> > else {
>> >
>
> Here there is no issue.
>
> --
> :wq Claudio
>

Index: netinet/tcp_usrreq.c
===
RCS file: /cvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.131
diff -u -p -u -p -r1.131 tcp_usrreq.c
--- netinet/tcp_usrreq.c18 Jun 2016 10:36:13 -  1.131
+++ netinet/tcp_usrreq.c5 Jul 2016 13:41:49 -
@@ -977,12 +977,13 @@ void
 tcp_update_sndspace(struct tcpcb *tp)
 {
struct socket *so = tp->t_inpcb->inp_socket;
-   u_long nmax;
+   u_long nmax = so->so_snd.sb_hiwat;

-   if (sbchecklowmem())
+   if (sbchecklowmem()) {
/* low on memory try to get rid of some */
-   nmax = tcp_sendspace;
-   else if (so->so_snd.sb_wat != tcp_sendspace)
+   if (tcp_sendspace < nmax)
+   nmax = tcp_sendspace;
+   } else if (so->so_snd.sb_wat != tcp_sendspace)
/* user requested buffer size, auto-scaling disabled */
nmax = so->so_snd.sb_wat;
else
@@ -1017,10 +1018,11 @@ tcp_update_rcvspace(struct tcpcb *tp)
struct socket *so = tp->t_inpcb->inp_socket;
u_long nmax = so->so_rcv.sb_hiwat;

-   if (sbchecklowmem())
+   if (sbchecklowmem()) {
/* low on memory try to get rid of some */
-   nmax = tcp_recvspace;
-   else if (so->so_rcv.sb_wat != tcp_recvspace)
+   if (tcp_recvspace < nmax)
+   nmax = tcp_recvspace;
+   } else if (so->so_rcv.sb_wat != tcp_recvspace)
/* user requested buffer size, auto-scaling disabled */
nmax = so->so_rcv.sb_wat;
else {



[PATCH] dont increase the size of socket buffers in low memory situations

2016-07-05 Thread Simon Mages
At the moment the buffersize will be set to the default even if the
current value
is smaller.

The following diff fixes this problem.

Index: netinet/tcp_usrreq.c
===
RCS file: /cvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.131
diff -u -p -u -p -r1.131 tcp_usrreq.c
--- netinet/tcp_usrreq.c18 Jun 2016 10:36:13 -  1.131
+++ netinet/tcp_usrreq.c5 Jul 2016 09:26:24 -
@@ -979,10 +979,11 @@ tcp_update_sndspace(struct tcpcb *tp)
struct socket *so = tp->t_inpcb->inp_socket;
u_long nmax;

-   if (sbchecklowmem())
+   if (sbchecklowmem()) {
/* low on memory try to get rid of some */
-   nmax = tcp_sendspace;
-   else if (so->so_snd.sb_wat != tcp_sendspace)
+   if (so->so_snd.sb_hiwat < nmax)
+   nmax = tcp_sendspace;
+   } else if (so->so_snd.sb_wat != tcp_sendspace)
/* user requested buffer size, auto-scaling disabled */
nmax = so->so_snd.sb_wat;
else
@@ -1017,10 +1018,11 @@ tcp_update_rcvspace(struct tcpcb *tp)
struct socket *so = tp->t_inpcb->inp_socket;
u_long nmax = so->so_rcv.sb_hiwat;

-   if (sbchecklowmem())
+   if (sbchecklowmem()) {
/* low on memory try to get rid of some */
-   nmax = tcp_recvspace;
-   else if (so->so_rcv.sb_wat != tcp_recvspace)
+   if (tcp_recvspace < nmax)
+   nmax = tcp_recvspace;
+   } else if (so->so_rcv.sb_wat != tcp_recvspace)
/* user requested buffer size, auto-scaling disabled */
nmax = so->so_rcv.sb_wat;
else {



[PATCH] let the mbufs use more then 4gb of memory

2016-06-22 Thread Simon Mages
On a System where you use the maximum socketbuffer size of 256kbyte you
can run out of memory after less then 9k open sockets.

My patch adds a new uvm_constraint for the mbufs with a bigger memory area.
I choose this area after reading the comments in sys/arch/amd64/include/pmap.h.
This patch further changes the maximum sucketbuffer size from 256k to 1gb as
it is described in the rfc1323 S2.3.

I tested this diff with the ix, em and urndis driver. I know that this
diff only works
for amd64 right now, but i wanted to send this diff as a proposal what could be
done. Maybe somebody has a different solution for this Problem or can me why
this is a bad idea.


Index: arch/amd64/amd64/bus_dma.c
===
RCS file: /openbsd/src/sys/arch/amd64/amd64/bus_dma.c,v
retrieving revision 1.49
diff -u -p -u -p -r1.49 bus_dma.c
--- arch/amd64/amd64/bus_dma.c  17 Dec 2015 17:16:04 -  1.49
+++ arch/amd64/amd64/bus_dma.c  22 Jun 2016 11:33:17 -
@@ -584,7 +584,7 @@ _bus_dmamap_load_buffer(bus_dma_tag_t t,
 */
pmap_extract(pmap, vaddr, (paddr_t *));

-   if (curaddr > dma_constraint.ucr_high)
+   if (curaddr > mbuf_constraint.ucr_high)
panic("Non dma-reachable buffer at curaddr %#lx(raw)",
curaddr);

Index: arch/amd64/amd64/machdep.c
===
RCS file: /openbsd/src/sys/arch/amd64/amd64/machdep.c,v
retrieving revision 1.221
diff -u -p -u -p -r1.221 machdep.c
--- arch/amd64/amd64/machdep.c  21 May 2016 00:56:43 -  1.221
+++ arch/amd64/amd64/machdep.c  22 Jun 2016 11:33:17 -
@@ -202,9 +202,11 @@ struct vm_map *phys_map = NULL;
 /* UVM constraint ranges. */
 struct uvm_constraint_range  isa_constraint = { 0x0, 0x00ffUL };
 struct uvm_constraint_range  dma_constraint = { 0x0, 0xUL };
+struct uvm_constraint_range  mbuf_constraint = { 0x0, 0xfUL };
 struct uvm_constraint_range *uvm_md_constraints[] = {
 _constraint,
 _constraint,
+_constraint,
 NULL,
 };

Index: kern/uipc_mbuf.c
===
RCS file: /openbsd/src/sys/kern/uipc_mbuf.c,v
retrieving revision 1.226
diff -u -p -u -p -r1.226 uipc_mbuf.c
--- kern/uipc_mbuf.c13 Jun 2016 21:24:43 -  1.226
+++ kern/uipc_mbuf.c22 Jun 2016 11:33:18 -
@@ -153,7 +153,7 @@ mbinit(void)

pool_init(, MSIZE, 0, 0, 0, "mbufpl", NULL);
pool_setipl(, IPL_NET);
-   pool_set_constraints(, _dma_contig);
+   pool_set_constraints(, _mbuf_contig);
pool_setlowat(, mblowat);

pool_init(, PACKET_TAG_MAXSIZE + sizeof(struct m_tag),
@@ -166,7 +166,7 @@ mbinit(void)
pool_init([i], mclsizes[i], 0, 0, 0,
mclnames[i], NULL);
pool_setipl([i], IPL_NET);
-   pool_set_constraints([i], _dma_contig);
+   pool_set_constraints([i], _mbuf_contig);
pool_setlowat([i], mcllowat);
}

Index: sys/socketvar.h
===
RCS file: /openbsd/src/sys/sys/socketvar.h,v
retrieving revision 1.60
diff -u -p -u -p -r1.60 socketvar.h
--- sys/socketvar.h 25 Feb 2016 07:39:09 -  1.60
+++ sys/socketvar.h 22 Jun 2016 11:33:18 -
@@ -112,7 +112,7 @@ struct socket {
short   sb_flags;   /* flags, see below */
u_short sb_timeo;   /* timeout for read/write */
} so_rcv, so_snd;
-#defineSB_MAX  (256*1024)  /* default for max chars in 
sockbuf */
+#defineSB_MAX  (1024*1024*1024)/* default for max chars in 
sockbuf */
 #defineSB_LOCK 0x01/* lock on data queue */
 #defineSB_WANT 0x02/* someone is waiting to lock */
 #defineSB_WAIT 0x04/* someone is waiting for 
data/space */
Index: uvm/uvm_extern.h
===
RCS file: /openbsd/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.139
diff -u -p -u -p -r1.139 uvm_extern.h
--- uvm/uvm_extern.h5 Jun 2016 08:35:57 -   1.139
+++ uvm/uvm_extern.h22 Jun 2016 11:33:18 -
@@ -234,6 +234,7 @@ extern struct uvmexp uvmexp;
 /* Constraint ranges, set by MD code. */
 extern struct uvm_constraint_range  isa_constraint;
 extern struct uvm_constraint_range  dma_constraint;
+extern struct uvm_constraint_range  mbuf_constraint;
 extern struct uvm_constraint_range  no_constraint;
 extern struct uvm_constraint_range *uvm_md_constraints[];

@@ -398,6 +399,7 @@ extern const struct kmem_pa_mode kp_zero
 extern const struct kmem_pa_mode kp_dma;
 extern const struct kmem_pa_mode kp_dma_contig;
 extern const struct kmem_pa_mode kp_dma_zero;
+extern const struct kmem_pa_mode kp_mbuf_contig;
 extern const struct kmem_pa_mode 

Supermicro X9SCM without ipmi panics while trying to attach ipmi0

2016-05-12 Thread Simon Mages
It looks like the Supermicro X9SCM BIOS lies about the presence of a BMC.

This board does not have a BMC but OpenBSD 5.9 tries to attach it and fails
with the following panic:
...
acpibtn0 at acpi0: SLPB
acpibtn1 at acpi0: PWRB
panic: ipmi0: sendcmd fails
Starting stack trace...
panic() at panic+0x10b
ipmi_cmd_poll() at ipmi_cmd_poll+0x5c
ipmi_match() at ipmi_match+0x11f
config_scan() at config_scan+0x133
config_search() at config_search+0x129
config_found_sm() at config_found_sm+0x2b
mainbus_attach() at mainbus_attach+0x224
config_attach() at config_attach+0x1bc
cpu_configure() at cpu_configure+0x1b
main() at main+0x40d
end trace frame: 0x0, count: 4

OpenBSD 5.8 works fine though. I think the behaviour changed with rev1.84.
I also attached a workaround which is working for me.


Index: dev/ipmi.c
===
RCS file: /cvs/src/sys/dev/ipmi.c,v
retrieving revision 1.95
diff -u -p -u -p -r1.95 ipmi.c
--- dev/ipmi.c  11 Feb 2016 04:02:22 -  1.95
+++ dev/ipmi.c  12 May 2016 14:01:05 -
@@ -150,8 +150,8 @@ int get_sdr(struct ipmi_softc *, u_int16

 intipmi_sendcmd(struct ipmi_cmd *);
 intipmi_recvcmd(struct ipmi_cmd *);
-void   ipmi_cmd(struct ipmi_cmd *);
-void   ipmi_cmd_poll(struct ipmi_cmd *);
+intipmi_cmd(struct ipmi_cmd *);
+intipmi_cmd_poll(struct ipmi_cmd *);
 void   ipmi_cmd_wait(struct ipmi_cmd *);
 void   ipmi_cmd_wait_cb(void *);

@@ -1026,26 +1026,33 @@ ipmi_recvcmd(struct ipmi_cmd *c)
return (rc);
 }

-void
+int
 ipmi_cmd(struct ipmi_cmd *c)
 {
+   int rv = 1;
+
if (cold || panicstr != NULL)
-   ipmi_cmd_poll(c);
+   rv = ipmi_cmd_poll(c);
else
ipmi_cmd_wait(c);
+
+   return rv;
 }

-void
+int
 ipmi_cmd_poll(struct ipmi_cmd *c)
 {
mtx_enter(>c_sc->sc_cmd_mtx);

if (ipmi_sendcmd(c)) {
-   panic("%s: sendcmd fails", DEVNAME(c->c_sc));
+   mtx_leave(>c_sc->sc_cmd_mtx);
+   return 0; /* BIOS is lying, there is no BMC */
}
c->c_ccode = ipmi_recvcmd(c);

mtx_leave(>c_sc->sc_cmd_mtx);
+
+   return 1;
 }

 void
@@ -1671,10 +1678,11 @@ ipmi_match(struct device *parent, void *
c.c_maxrxlen = sizeof(cmd);
c.c_rxlen = 0;
c.c_data = cmd;
-   ipmi_cmd();
+   rv = ipmi_cmd();
+
+   if (rv == 1) /* GETID worked, we got IPMI */
+   dbg_dump(1, "bmc data", c.c_rxlen, cmd);

-   dbg_dump(1, "bmc data", c.c_rxlen, cmd);
-   rv = 1; /* GETID worked, we got IPMI */
ipmi_unmap_regs(sc);
}



redundant code in reboot/halt and init?

2015-03-03 Thread Simon Mages
Hi there,

i read the code of init.c and reboot.c and was asking myself why
reboot is not just sending SIGINT to init?

The whole reboot code seems to be redundant, or am i missing something here?

Why not just determine if im running as halt or reboot and send the
correct signal to init. Let init handle everything else.

But maybe there is something i don't see at the moment though.

BR

Simon



Re: [PATCH] bpf is now blocking again with and without timeout

2015-01-23 Thread Simon Mages
Is everything right with my regression test?

2015-01-21 15:28 GMT+01:00, Simon Mages mages.si...@googlemail.com:
 btw. here is my regression test for bpf:

 Index: regress/sys/net/Makefile
 ===
 RCS file: /home/cvs/src/regress/sys/net/Makefile,v
 retrieving revision 1.6
 diff -u -p -r1.6 Makefile
 --- regress/sys/net/Makefile  12 Jul 2014 21:41:49 -  1.6
 +++ regress/sys/net/Makefile  21 Jan 2015 13:54:05 -
 @@ -1,6 +1,6 @@
  #$OpenBSD: Makefile,v 1.6 2014/07/12 21:41:49 bluhm Exp $

 -SUBDIR +=pf_divert pf_forward pf_fragment
 +SUBDIR +=pf_divert pf_forward pf_fragment bpf_features

  .MAIN: regress

 Index: regress/sys/net/bpf_features/Makefile
 ===
 RCS file: regress/sys/net/bpf_features/Makefile
 diff -N regress/sys/net/bpf_features/Makefile
 --- /dev/null 1 Jan 1970 00:00:00 -
 +++ regress/sys/net/bpf_features/Makefile 21 Jan 2015 14:21:19 -
 @@ -0,0 +1,46 @@
 +.MAIN: all
 +
 +CDIAGFLAGS = -std=c99 -Werror -Wall -Wstrict-prototypes \
 + -Wmissing-prototypes -Wno-main -Wno-uninitialized \
 + -Wbad-function-cast -Wcast-align -Wcast-qual \
 + -Wextra -Wmissing-declarations -Wpointer-arith -Wshadow \
 + -Wsign-compare -Wuninitialized -Wunused -Wno-unused-parameter \
 + -Wnested-externs -Wunreachable-code -Winline \
 + -Wdisabled-optimization -Wconversion -Wfloat-equal -Wswitch \
 + -Wswitch-default -Wtrigraphs -Wsequence-point -Wimplicit \
 +WARNINGS =   yes
 +depend: bpf_read_blocking bpf_read_timeout bpf_read_async
 bpf_read_timeout_loop
 +
 +bpf_read_blocking: bpf_read_blocking.c
 + ${CC} ${CFLAGS} -o bpf_read_blocking bpf_read_blocking.c
 +
 +bpf_read_timeout: bpf_read_timeout.c
 + ${CC} ${CFLAGS} -o bpf_read_timeout bpf_read_timeout.c
 +
 +bpf_read_async: bpf_read_async.c
 + ${CC} ${CFLAGS} -o bpf_read_async bpf_read_async.c
 +
 +bpf_read_timeout_loop: bpf_read_timeout_loop.c
 + ${CC} ${CFLAGS} -o bpf_read_timeout_loop bpf_read_timeout_loop.c
 +
 +TARGETS += blocking timeout async timeoutloop
 +
 +run-regress-blocking: bpf_read_blocking
 + ./bpf_read_blocking
 +
 +run-regress-timeout: bpf_read_timeout
 + ./bpf_read_timeout
 +
 +run-regress-async: bpf_read_async
 + @/sbin/ping -c 10 127.0.0.1  /dev/null 21 
 + ./bpf_read_async
 +
 +run-regress-timeoutloop: bpf_read_timeout_loop
 + ./bpf_read_timeout_loop
 +
 +REGRESS_TARGETS =${TARGETS:S/^/run-regress-/}
 +
 +CLEANFILES +=bpf_read_timeout bpf_read_blocking 
 bpf_read_async \
 + bpf_read_timeout_loop
 +
 +.include bsd.regress.mk
 Index: regress/sys/net/bpf_features/bpf_read_async.c
 ===
 RCS file: regress/sys/net/bpf_features/bpf_read_async.c
 diff -N regress/sys/net/bpf_features/bpf_read_async.c
 --- /dev/null 1 Jan 1970 00:00:00 -
 +++ regress/sys/net/bpf_features/bpf_read_async.c 21 Jan 2015 12:32:01
 -
 @@ -0,0 +1,86 @@
 +#include stdlib.h
 +#include stdio.h
 +#include fcntl.h
 +#include unistd.h
 +#include signal.h
 +#include string.h
 +#include errno.h
 +#include err.h
 +
 +#include sys/types.h
 +#include sys/time.h
 +#include sys/ioctl.h
 +#include sys/socket.h
 +#include net/bpf.h
 +#include net/ethertypes.h
 +#include netinet/in.h
 +#include net/if.h
 +#include sys/time.h
 +
 +#define BUFFFER_SIZE 32786
 +u_char buffer[BUFFFER_SIZE];
 +u_int buf_size = BUFFFER_SIZE;
 +
 +struct bpf_program bpf_machine = {
 + 2,
 + (struct bpf_insn []){
 + BPF_STMT(BPF_LD+BPF_W+BPF_LEN, 8),
 + BPF_STMT(BPF_RET+BPF_K, 8),
 + },
 +};
 +struct ifreq interface;
 +struct sigaction sigact;
 +int fd, out, pid, flag;
 +int async = 1;
 +struct sigaction action;
 +
 +void handler(int);
 +
 +int
 +main(void)
 +{
 + action.sa_handler = handler;
 + sigemptyset(action.sa_mask);
 + action.sa_flags = 0;
 + if (sigaction(SIGIO, action, NULL)  0)
 + err(1, sigaction);
 +
 + pid = getpid();
 + if (setpgrp(0, 0)  0)
 +   err(1, setpgrp);
 +
 + sigemptyset(action.sa_mask);
 + if (sigprocmask(SIG_SETMASK, action.sa_mask, NULL)  0)
 + err(1, sigprocmask);
 +
 + if ((fd = open(/dev/bpf9, O_RDONLY))  0)
 + err(1, open);
 +
 + if (ioctl(fd, BIOCSBLEN, buf_size)  0)
 + err(1, BIOCSBLEN);
 +
 + strlcpy(interface.ifr_name, lo0, sizeof(interface.ifr_name));
 + if(ioctl(fd, BIOCSETIF, interface)  0)
 + err(1, BIOCSETIF);
 +
 + if (ioctl(fd, FIOSETOWN, pid)  0)
 + err(1, FIOSETOWN);
 + if (ioctl(fd, FIOASYNC, async)  0)
 + err(1, FIOASYNC);
 + 
 + if (ioctl(fd, BIOCSETF, bpf_machine)  0)
 + err(1, BIOCSETF);
 +
 + out = read(fd, buffer, sizeof(buffer));
 + if (out  0)
 + err(1, read

Re: [PATCH] bpf is now blocking again with and without timeout

2015-01-07 Thread Simon Mages
I tested the patch and its working.

I have a small test program already. I create a regression test with it.
I'll post the diff here.
 Am 06.01.2015 04:19 schrieb Philip Guenther guent...@gmail.com:

 [(@#*$(*# control-enter keybinding]

 On Mon, Jan 5, 2015 at 7:15 PM, Philip Guenther guent...@gmail.com
 wrote:
  On Mon, Jan 5, 2015 at 11:01 AM, Ted Unangst t...@tedunangst.com
 wrote:
  ...
  In the regular timeout case, I'm not sure what you're changing. There
  is a problem here though. If we're already close to the timeout
  expiring, we shouldn't sleep the full timeout, only the time left
  since we began the read.

 Yes, that was what I was trying to convey in my reply to Mages's
 earlier post on this bpf issue.

 Your diff looks correct to me, though untested.

 Mages, do you have code this can be tested against?  Is there
 something you could contribute to form a regress test we could place
 under /usr/src/regress/net/ to verify that we got this right and to
 catch breakage in the future?


 Philip Guenther



tcpdump non-blocking/immediate mode patch

2014-12-14 Thread Simon Mages
Hi,

tcpdump feels a bit laggy or slow some times when i use it for live
debugging.

The following patch adds a new flag, '-b', to tcpdump. With this flag,
tcpdump
sets BIOCIMMEDIATE on the bpf(4) device. With BIOCIMMEDIATE set, the output
is fluent.

Index: usr.sbin/tcpdump/privsep.c
===
RCS file: /cvs/src/usr.sbin/tcpdump/privsep.c,v
retrieving revision 1.30
diff -u -p -r1.30 privsep.c
--- usr.sbin/tcpdump/privsep.c 22 Sep 2011 09:12:30 - 1.30
+++ usr.sbin/tcpdump/privsep.c 14 Dec 2014 22:40:14 -
@@ -318,7 +318,7 @@ priv_init(int argc, char **argv)
 static void
 impl_open_bpf(int fd, int *bpfd)
 {
- int snaplen, promisc, err;
+ int snaplen, promisc, immediate, err;
  u_int dlt, dirfilt;
  char device[IFNAMSIZ];
  size_t iflen;
@@ -327,12 +327,13 @@ impl_open_bpf(int fd, int *bpfd)

  must_read(fd, snaplen, sizeof(int));
  must_read(fd, promisc, sizeof(int));
+ must_read(fd, immediate, sizeof(int));
  must_read(fd, dlt, sizeof(u_int));
  must_read(fd, dirfilt, sizeof(u_int));
  iflen = read_string(fd, device, sizeof(device), __func__);
  if (iflen == 0)
  errx(1, Invalid interface size specified);
- *bpfd = pcap_live(device, snaplen, promisc, dlt, dirfilt);
+ *bpfd = pcap_live(device, snaplen, promisc, immediate, dlt, dirfilt);
  err = errno;
  if (*bpfd  0)
  logmsg(LOG_DEBUG,
Index: usr.sbin/tcpdump/privsep.h
===
RCS file: /cvs/src/usr.sbin/tcpdump/privsep.h,v
retrieving revision 1.7
diff -u -p -r1.7 privsep.h
--- usr.sbin/tcpdump/privsep.h 25 Aug 2009 06:59:17 - 1.7
+++ usr.sbin/tcpdump/privsep.h 14 Dec 2014 22:40:14 -
@@ -47,10 +47,10 @@ int priv_init(int, char **);
 voidpriv_init_done(void);

 int setfilter(int, int, char *);
-int pcap_live(const char *, int, int, u_int, u_int);
+int pcap_live(const char *, int, int, int, u_int, u_int);

 struct bpf_program *priv_pcap_setfilter(pcap_t *, int, u_int32_t);
-pcap_t *priv_pcap_live(const char *, int, int, int, char *, u_int,
+pcap_t *priv_pcap_live(const char *, int, int, int, int, char *, u_int,
 u_int);
 pcap_t *priv_pcap_offline(const char *, char *);

Index: usr.sbin/tcpdump/privsep_pcap.c
===
RCS file: /cvs/src/usr.sbin/tcpdump/privsep_pcap.c,v
retrieving revision 1.17
diff -u -p -r1.17 privsep_pcap.c
--- usr.sbin/tcpdump/privsep_pcap.c 14 Nov 2012 03:33:04 - 1.17
+++ usr.sbin/tcpdump/privsep_pcap.c 14 Dec 2014 22:40:14 -
@@ -172,8 +172,8 @@ priv_pcap_setfilter(pcap_t *hpcap, int o

 /* privileged part of priv_pcap_live */
 int
-pcap_live(const char *device, int snaplen, int promisc, u_int dlt,
-u_int dirfilt)
+pcap_live(const char *device, int snaplen, int promisc, int immediate,
+u_int dlt, u_int dirfilt)
 {
  char bpf[sizeof /dev/bpf00];
  int fd, n = 0;
@@ -204,6 +204,10 @@ pcap_live(const char *device, int snaple
  if (promisc)
  /* this is allowed to fail */
  ioctl(fd, BIOCPROMISC, NULL);
+
+ if (immediate  ioctl(fd, BIOCIMMEDIATE, immediate)  0)
+ goto error;
+
  if (ioctl(fd, BIOCSDIRFILT, dirfilt)  0)
  goto error;

@@ -223,7 +227,7 @@ pcap_live(const char *device, int snaple
  * unprivileged part.
  */
 pcap_t *
-priv_pcap_live(const char *dev, int slen, int prom, int to_ms,
+priv_pcap_live(const char *dev, int slen, int prom, int imme, int to_ms,
 char *ebuf, u_int dlt, u_int dirfilt)
 {
  int fd, err;
@@ -251,6 +255,7 @@ priv_pcap_live(const char *dev, int slen
  write_command(priv_fd, PRIV_OPEN_BPF);
  must_write(priv_fd, slen, sizeof(int));
  must_write(priv_fd, prom, sizeof(int));
+ must_write(priv_fd, imme, sizeof(int));
  must_write(priv_fd, dlt, sizeof(u_int));
  must_write(priv_fd, dirfilt, sizeof(u_int));
  write_string(priv_fd, dev);
Index: usr.sbin/tcpdump/tcpdump.8
===
RCS file: /cvs/src/usr.sbin/tcpdump/tcpdump.8,v
retrieving revision 1.83
diff -u -p -r1.83 tcpdump.8
--- usr.sbin/tcpdump/tcpdump.8 3 Jun 2014 02:57:29 - 1.83
+++ usr.sbin/tcpdump/tcpdump.8 14 Dec 2014 22:40:15 -
@@ -28,7 +28,7 @@
 .Sh SYNOPSIS
 .Nm tcpdump
 .Bk -words
-.Op Fl AadefILlNnOopqStvXx
+.Op Fl AabdefILlNnOopqStvXx
 .Op Fl c Ar count
 .Op Fl D Ar direction
 .Oo Fl E Oo Ar espalg : Oc Ns
@@ -61,6 +61,9 @@ The smaller of the entire packet or
 bytes will be printed.
 .It Fl a
 Attempt to convert network and broadcast addresses to names.
+.It Fl b
+Disables read blocking on the bpf(4) buffer. With this so called
+``immediate mode'' reads return immediately upon packet reception.
 .It Fl c Ar count
 Exit after receiving
 .Ar count
Index: usr.sbin/tcpdump/tcpdump.c
===
RCS file: /cvs/src/usr.sbin/tcpdump/tcpdump.c,v
retrieving revision 1.66
diff -u -p -r1.66 tcpdump.c
--- usr.sbin/tcpdump/tcpdump.c 30 Jun 2014 04:25:11 - 1.66
+++ usr.sbin/tcpdump/tcpdump.c 14 Dec 2014 22:40:15 

Re: patch: Intel CPU sensor readout correction

2014-12-03 Thread Simon Mages
Hi,

i would like to ask if everything's ok with my patch.

If somethings wrong, tell me, i'll fix it.

I think this patch is a nice correction of the temperature readout
of Intel CPU's.

I tested this patch on some CPU's and everything looks fine so far.

BR

Simon

2014-11-27 12:44 GMT+01:00, Mages, Simon simon_ma...@genua.de:
 Hi there,

 the temperatures 'sysctl hw.sensors' displays for each CPU
 are wrong for the most modern Intel CPUs.

 OpenBSD uses only 100 or 85 degC as TJmax for Intel CPUs, but
 in reality the TJmax value is somewhere around those specified
 values. Intel defines a TJmax for every production batch
 individually and burns this on the DIE, since WESTMERE we
 can officially read this value in supervisor mode.

 I have a patch which would fix this for CPUs since WESTMERE.

 Index: sys/arch/amd64//amd64/identcpu.c
 ===
 RCS file: /cvs/src/sys/arch/amd64/amd64/identcpu.c,v
 retrieving revision 1.56
 diff -u -p -u -p -r1.56 identcpu.c
 --- sys/arch/amd64//amd64/identcpu.c  17 Oct 2014 18:15:48 -  1.56
 +++ sys/arch/amd64//amd64/identcpu.c  27 Nov 2014 10:14:05 -
 @@ -179,24 +179,75 @@ voidintelcore_update_sensor(void *args)
  /*
   * Temperature read on the CPU is relative to the maximum
   * temperature supported by the CPU, Tj(Max).
 - * Poorly documented, refer to:
 - * http://softwarecommunity.intel.com/isn/Community/
 - * en-US/forums/thread/30228638.aspx
 - * Basically, depending on a bit in one msr, the max is either 85 or 100.
 - * Then we subtract the temperature portion of thermal status from
 - * max to get current temperature.
 + * Refer to:
 + * 64-ia-32-architectures-software-developer-vol-3c-part-3-manual.pdf
 + * Section 35 and
 + * http://www.intel.com/content/dam/www/public/us/en/documents/
 + * white-papers/cpu-monitoring-dts-peci-paper.pdf
 + *
 + * The Temperature on Intel CPUs can be between 70 and 105 degC, since
 + * WESTMERE we can read the TJmax from the DIE. For older CPUs we have
 + * to gues or use undocumented MSRs. Then we subtract the temperature
 + * portion of thermal status from max to get current temperature.
   */
  void
  intelcore_update_sensor(void *args)
  {
   struct cpu_info *ci = (struct cpu_info *) args;
   u_int64_t msr;
 - int max = 100;
 + int max;

 - /* Only some Core family chips have MSR_TEMPERATURE_TARGET. */
 - if (ci-ci_model == 0xe 
 - (rdmsr(MSR_TEMPERATURE_TARGET)  MSR_TEMPERATURE_TARGET_LOW_BIT))
 - max = 85;
 + switch (ci-ci_model) {
 + case INTEL_FUTURE_MODEL_4E:
 + case INTEL_FUTURE_MODEL_56:
 + case INTEL_BROADWELL_MODEL_3D:
 + case INTEL_HASWELL_MODEL_3C:
 + case INTEL_HASWELL_MODEL_3F:
 + case INTEL_HASWELL_MODEL_45:
 + case INTEL_HASWELL_MODEL_46:
 + case INTEL_IVYBRIDGE_MODEL_3A:
 + case INTEL_IVYBRIDGE_MODEL_3E:
 + case INTEL_NEHALEM_MODEL_1A:
 + case INTEL_NEHALEM_MODEL_1E:
 + case INTEL_NEHALEM_MODEL_1F:
 + case INTEL_NEHALEM_MODEL_2E:
 + case INTEL_SANDYBRIDGE_MODEL_2A:
 + case INTEL_SANDYBRIDGE_MODEL_2D:
 + case INTEL_SILVERMONT_MODEL_37:
 + case INTEL_SILVERMONT_MODEL_4A:
 + case INTEL_SILVERMONT_MODEL_4D:
 + case INTEL_SILVERMONT_MODEL_5A:
 + case INTEL_SILVERMONT_MODEL_5D:
 + case INTEL_WESTMERE_MODEL_25:
 + case INTEL_WESTMERE_MODEL_2C:
 + case INTEL_WESTMERE_MODEL_2F:
 + /*
 +  * Newer CPU's can tell you what there max temperature is.
 +  * See: '64-ia-32-architectures-software-developer-
 +  * vol-3c-part-3-manual.pdf'
 +  */
 + max = MSR_TEMPERATURE_TARGET_TJMAX(
 + rdmsr(MSR_TEMPERATURE_TARGET));
 + break;
 + case INTEL_YONAH_MODEL_0E:
 + /*
 +  * Only Core Duo and Core Solo family chips have
 +  * this undocumented MSR_TEMPERATURE_TARGET.
 +  */
 + if (rdmsr(MSR_TEMPERATURE_TARGET_UNDOCUMENTED) 
 + MSR_TEMPERATURE_TARGET_LOW_BIT_UNDOCUMENTED) {
 + max = 85;
 + break;
 + }
 + /* FALLTHROUGH */
 + default:
 + /*
 +  * XXX: 100 degC is not the max for every not here
 +  * covered CPU. But newer CPU's, since Nehalem,
 +  * have MSR_TEMPERATURE_TARGET anyway.
 +  */
 + max = 100;
 + }

   msr = rdmsr(MSR_THERM_STATUS);
   if (msr  MSR_THERM_STATUS_VALID_BIT) {
 Index: sys/arch/amd64//include/specialreg.h
 ===
 RCS file: /cvs/src/sys/arch/amd64/include/specialreg.h,v
 retrieving revision 1.28
 diff -u -p -u -p -r1.28 specialreg.h
 --- sys/arch/amd64//include/specialreg.h  3 Jul 2014 21:15:28 -   
 1.28
 +++ sys/arch/amd64//include/specialreg.h  27 Nov 2014 10:14:05 -
 @@ -278,9