Adding a new member to m_pkthdr

2022-05-27 Thread John Baldwin

For NIC offload of kernel TLS on the receive side, the kernel needs to know
the "leaf" interface that packets arrive on up in the socket buffer layer
when appending received packet data to a socket using KTLS.  rcvif does not
fully work since connections that transit a virtual interface like if_vlan or
if_lagg rewrite m_rcvif to be the virtual interface.  For KTLS transmit we
are able to follow the transmit path down to configure KTLS on the leaf
interface.  However, while the receive path is usually a mirror of the
transmit path, it is not always.  In particular, when using a lagg(4) with
LACP, the other end of the lagg is free to use whatever hash it chooses to
distribute traffic across the lagg ports such that the receive and transmit
sides of a connection may transit different ports within a lagg.

To provide a leaf interface, I have a patch that adds a new field to m_pkthdr
to track the leaf receive interface.  It is optional and the only use
currently anticipated is KTLS.  In the current KTLS patches it is set
on received packets by the mlx5 driver.  Possibly it could be set more
generically in ether_input instead of in individual drivers.  It is
serialized to an index and generation count while packets are deferred to
a netisr similar to rcvif except that it is non-fatal if the ifp cannot
be re-materialized when the mbuf is dequeued.  Instead, the pointer is
simply left as NULL.

However, using more space in m_pkthdr is a non-trivial thing, so it's worth
raising the conversation more broadly.  The change to add this field is in
https://reviews.freebsd.org/D35339.  Drew has tested this isolated change
under load at Netflix and found no impact on performance.

--
John Baldwin



RFC: KTLS RX software support

2020-04-29 Thread John Baldwin
I'd like some feedback on a patch I have to add software KTLS RX support.
There's more detail on some of the unusual-ness (especially in regards to
socket buffers) and the reasoning behind it in the review, so the best 
place to talk about it is probably the review itself which you can find at
https://reviews.freebsd.org/D24628

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: panic on invalid ifp pointer in iflib drivers

2019-10-17 Thread John Baldwin
On 10/17/19 4:22 PM, Keller, Jacob E wrote:
> Hmm.. now that I look at that more closely I think it's a separate issue.

This may just be the rcvif issue where we don't reference count the ifp we
store in rcvif in mbufs?  That was my reaction to your first e-mail except
that you said it wasn't reproducible on some other drivers.  I wonder if
other drivers would also provoke this if you just ran them in a detach/attach
loop long enough.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: panic on invalid ifp pointer in iflib drivers

2019-10-17 Thread John Baldwin
On 10/16/19 2:16 PM, Keller, Jacob E wrote:
> Hi,
> 
> I'm investigating an issue on the iflib ixl driver in 11.3-RELEASE as well as 
> 12-RELEASE. We found a panic in that occurs if SCTP/IPv6 traffic is being 
> transmitted while the device is detached:
> 
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0xfe411e38
> fault code  = supervisor read data, page not present
> instruction pointer = 0x20:0x80c84700
> stack pointer   = 0x28:0xfe2f4351b600
> frame pointer   = 0x28:0xfe2f4351b650
> code segment= base 0x0, limit 0xf, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags= interrupt enabled, resume, IOPL = 0
> current process = 12 (swi4: clock (0))
> trap number = 12
> panic: page fault
> cpuid = 0
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfe2f4351b2c0
> vpanic() at vpanic+0x17e/frame 0xfe2f4351b320
> panic() at panic+0x43/frame 0xfe2f4351b380
> trap_fatal() at trap_fatal+0x369/frame 0xfe2f4351b3d0
> trap_pfault() at trap_pfault+0x62/frame 0xfe2f4351b420
> trap() at trap+0x2b3/frame 0xfe2f4351b530
> calltrap() at calltrap+0x8/frame 0xfe2f4351b530
> --- trap 0xc, rip = 0x80c84700, rsp = 0xfe2f4351b600, rbp = 
> 0xfe2f4351b650 ---
> in6_selecthlim() at in6_selecthlim+0x20/frame 0xfe2f4351b650
> sctp_lowlevel_chunk_output() at sctp_lowlevel_chunk_output+0xeb2/frame 
> 0xfe2f4351b790
> sctp_chunk_output() at sctp_chunk_output+0x68c/frame 0xfe2f4351c110
> sctp_timeout_handler() at sctp_timeout_handler+0x2d8/frame 0xfe2f4351c180
> softclock_call_cc() at softclock_call_cc+0x15b/frame 0xfe2f4351c230
> softclock() at softclock+0x7c/frame 0xfe2f4351c260
> intr_event_execute_handlers() at intr_event_execute_handlers+0x9a/frame 
> 0xfe2f4351c2a0
> ithread_loop() at ithread_loop+0xb7/frame 0xfe2f4351c2f0
> fork_exit() at fork_exit+0x84/frame 0xfe2f4351c330
> fork_trampoline() at fork_trampoline+0xe/frame 0xfe2f4351c330
> --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
> KDB: enter: panic
> 
> 
> From what I've gathered so far, it appears that the issue is a use-after-free 
> where the SCTP stack gets an ifp pointer that's no longer valid. We've 
> reproduced this issue on multiple iflib-based drivers, including ixl and the 
> recently published ice driver code (available on phabricator).
> 
> Additionally, we cannot reproduce it on legacy-stack drivers for ixl, or a 
> mellanox 100G board we have. This leads me to believe that it's an issue in 
> iflib rather than in the specific device drivers.
> 
> I am not sure exactly what's going wrong here... anyone have suggestions? I 
> thought it might be an issue of when ether_ifdetach is called. That function 
> is supposed to clear all of the pre-existing routes from the route entry 
> list. I'm thinking maybe somehow a route gets added after ether_ifdetach is 
> called.
> 
> In the iflib_device_deregister function, ether_ifdetach is called just after 
> iflib_stop, (which would call a device's if_stop routine), and then the task 
> queues are shutdown, a driver's ifdi_detach handler is called, and the ifp is 
> free'd at the end. In the ixl legacy driver, ether_ifdetach is called prior 
> to the stop routine. However, in the mlx5 driver, it's called after a call to 
> close_locked()...
> 
> So I'm really not sure exactly what could cause a stale ifp pointer to get 
> into the route entry list.

Nominally, ifnet drivers should call ether_ifdetach first to remove public
references to the ifnet and only call their stop routine after that has 
returned.
This ensures any open if_ioctl invocations have completed, etc. before the
stop routine is invoked.  Otherwise you are open to a race where the inteface
can be upped via an ioctl after you have stopped the hardware.

Any other references to the ifnet via eventhandlers, etc. should also be
deregistered before calling the stop routine.  

After the hardware is stopped, interrupt handlers should be torn down and 
callouts
and tasks drained to ensure there are no other references to the ifp outside of
the thread running detach.

After that you can release device resources, destroy mutexes, free the ifp, etc.
Note that drivers have to be prepared for ether_ifdetach to invoke if_ioctl 
(e.g.
when detaching bpf), but of the drivers I've looked at this has generally been a
non-issue.

It sounds like iflib should be doing the detach before calling iflib_stop.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: mmap kernel chunk into user space

2019-07-29 Thread John Baldwin
On 7/21/19 6:21 PM, Laurie Jennings wrote:
> 
> Im wondering if there have been changes to the api since FreeBSD 9 as I can't 
> get some code I'm porting to work.
> I have a block of kernel memory wired down and I want to map it to user 
> space. Its just a big structure that has stats and other volatile info. In 
> 9.x I was able to simply do:
> // kptr has the kernel address obtained from an ioctl call
> 
> fd=open("/dev/kmem",O_RDWR);memp=mmap(0,size,PROT_READ|PROT_WRITE,MAP_SHARED,fd,(off_t)kptr);
> 
> And it just worked. In 11.3 it fails, and I havent been able to get ANYTHING 
> to work with this method. I'm open to another method; I read something about 
> mmap no longer supportinjgkmem. In which case, what can I do?

Supporting arbitrary mmap of /dev/kmem is not safe as the user mappings
weren't updated if the kernel mappings changed.  To do what you want,
you would create a dedicated device in /dev that you can mmap to access
your data structure.  The simplest way is to just have your device's
d_mmap callback use the offset as an offset into your in-kernel pointer
and then use pmap_kextract() to get the physical address to return.

A fancier version would be to build an sglist describing your buffer and
create an OBJT_SG VM object that you returned from the d_mmap_single
callback, but if you only ever have a single object that is never freed,
the simple version will work fine.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Differential] D19422: if_vxlan(4) Allow set MTU more than 1500 bytes.

2019-07-15 Thread jhb (John Baldwin)
jhb accepted this revision.
jhb added a comment.
This revision is now accepted and ready to land.


  Thanks!

CHANGES SINCE LAST ACTION
  https://reviews.freebsd.org/D19422/new/

REVISION DETAIL
  https://reviews.freebsd.org/D19422

EMAIL PREFERENCES
  https://reviews.freebsd.org/settings/panel/emailpreferences/

To: aleksandr.fedorov_itglobal.com, bryanv, hrs, #network, rgrimes, krion, jhb
Cc: evgueni.gavrilov_itglobal.com, olevole_olevole.ru, ae, freebsd-net-list, 
krzysztof.galazka_intel.com
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Differential] D19422: if_vxlan(4) Allow set MTU more than 1500 bytes.

2019-07-12 Thread jhb (John Baldwin)
jhb added a comment.


  I agree that we can't handle this in ether_ioctl as it varies too much by 
real hardware.

INLINE COMMENTS

> if_vxlan.c:90
> + */
> +#define VXLAN_MAX_MTU65435
> +

If it was possible to make this derived from other constants that would be 
ideal.   has ETHER_MAX_LEN_JUMBO of only 9018, but it seems 
some other drivers have higher maximums (9600 for cxgbe and 9728 for ixl(4)).  
So sadly there is no good constant for the 64k max.   Is there an expression 
that would give you the max overhead?  If so you could use something like:

  '(65535 - sizeof(struct vxlan_header) - ETHER_HDR_LEN - ETHER_CRC_LEN - 
ETHER_VLAN_VCAP_LEN)

if_ixl.c has something like that for setting the MTU.  What I don't know is 
what the possible encapsulation overhead is for this interface.

CHANGES SINCE LAST ACTION
  https://reviews.freebsd.org/D19422/new/

REVISION DETAIL
  https://reviews.freebsd.org/D19422

EMAIL PREFERENCES
  https://reviews.freebsd.org/settings/panel/emailpreferences/

To: aleksandr.fedorov_itglobal.com, bryanv, hrs, #network, rgrimes, krion, jhb
Cc: evgueni.gavrilov_itglobal.com, olevole_olevole.ru, ae, freebsd-net-list, 
krzysztof.galazka_intel.com
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Differential] D9058: alc: Add Killer E2500 support.

2017-01-06 Thread jhb (John Baldwin)
jhb accepted this revision.
This revision has a positive review.

REVISION DETAIL
  https://reviews.freebsd.org/D9058

EMAIL PREFERENCES
  https://reviews.freebsd.org/settings/panel/emailpreferences/

To: sepherosa_gmail.com, decui_microsoft.com, jhb
Cc: freebsd-net-list
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Differential] D8905: if: Defer the if_up until the ifnet.if_ioctl is called.

2016-12-29 Thread jhb (John Baldwin)
jhb added a comment.


  I would suggest rewording the message a bit to something like:
  
Defer if_up() until after the interface's if_ioctl method is called.

This ensures the interface is initialized by the interface driver before
it can be used by the rest of the system.

REVISION DETAIL
  https://reviews.freebsd.org/D8905

EMAIL PREFERENCES
  https://reviews.freebsd.org/settings/panel/emailpreferences/

To: sepherosa_gmail.com, delphij, royger, decui_microsoft.com, 
honzhan_microsoft.com, howard0su_gmail.com, adrian, hiren, bz, gnn, glebius, 
karels
Cc: jhb, freebsd-net-list
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: cxgbe's native netmap support broken since r307394

2016-12-29 Thread John Baldwin
On Thursday, December 29, 2016 10:02:32 AM Vincenzo Maffione wrote:
> Ok, thanks for the clarification. I change the lock type to MTX_DEF
> (and did a test). I attached the new patch.

Looks good to me, thanks!

> Cheers,
>   Vincenzo
> 
> 2016-12-29 2:06 GMT+01:00 John Baldwin :
> > On Wednesday, December 28, 2016 07:25:22 PM Vincenzo Maffione wrote:
> >> Hi,
> >>   The "worker_lock" is taken by nm_os_kthread_wakeup_worker(), which
> >> in turn is called by nm_pt_host_notify(). The latter function is a
> >> callback that may be called by a driver interrupt service routine;
> >> more precisely this happens when the driver calls netmap_tx_irq() or
> >> netmap_rx_irq(). As far as I know in FreeBSD it is not possible to
> >> lock a MTX_DEF mtx inside an ISR. Am I wrong on this?
> >
> > It depends.  Most interrupt handlers run in ithreads and can use MTX_DEF.
> > Only interrupt handlers that run from a filter are restricted to using
> > MTX_SPIN.  Looking at all the calls to netmap_[tr]x_irq() in the tree,
> > they are all done from contexts that are safe to use MTX_DEF (and in
> > general they have to be as the equivalent code for the non-netmap case
> > is calling routines like if_input or m_freem that can't be invoked from
> > a filter either).
> >
> >>
> >> Cheers,
> >>   Vincenzo
> >>
> >> 2016-12-28 19:06 GMT+01:00 John Baldwin :
> >> > Why are you using MTX_SPIN?  Changing the lock type to MTX_DEF would 
> >> > seem to
> >> > be a smaller patch and probably more correct for FreeBSD.
> >> >
> >> > On Thursday, December 22, 2016 05:29:41 PM Luigi Rizzo wrote:
> >> >> sure go ahead and thank you!
> >> >>
> >> >> On Thu, Dec 22, 2016 at 5:15 PM, Adrian Chadd  
> >> >> wrote:
> >> >> > ok, does anyone mind if I commit it as-is?
> >> >> >
> >> >> >
> >> >> > -a
> >> >> >
> >> >> >
> >> >> > On 21 December 2016 at 13:37, Vincenzo Maffione 
> >> >> >  wrote:
> >> >> >> Hi Luigi,
> >> >> >>   I attached a minimal change containing two fixes:
> >> >> >>
> >> >> >> - change IFNET_WLOCK into IFNET_RLOCK, to fix the cxgbe issue related
> >> >> >> to this thread
> >> >> >> - use the proper locking functions for the "worker_lock", unrelated
> >> >> >> but needed to avoid the O.S. to trap because of a mismatch between
> >> >> >> MTX_SPIN and MTX_DEF.
> >> >> >>
> >> >> >> Cheers,
> >> >> >>   Vincenzo
> >> >> >>
> >> >> >> 2016-12-21 20:30 GMT+01:00 Luigi Rizzo :
> >> >> >>> On Wed, Dec 21, 2016 at 11:15 AM, Vincenzo Maffione
> >> >> >>>  wrote:
> >> >> >>>> Hi,
> >> >> >>>>   There is no commit related to that in the FreeBSD svn or git.
> >> >> >>>>
> >> >> >>>> The fix has been published to the github netmap repository here
> >> >> >>>> (branch master): https://github.com/luigirizzo/netmap
> >> >> >>>>
> >> >> >>>> What we should do is to import all the recent updates from the 
> >> >> >>>> github
> >> >> >>>> into HEAD. I can prepare a patch for HEAD, if you wish. Just let me
> >> >> >>>> know.
> >> >> >>>
> >> >> >>> I just checked and the diff between FreeBSD head and netmap head
> >> >> >>> in github is almost 3k lines due to a lot of recent refactoring.
> >> >> >>> So, if there is an easy way to extract just the locking change that 
> >> >> >>> would
> >> >> >>> be preferable as an interim solution.
> >> >> >>>
> >> >> >>> cheers
> >> >> >>> luigi
> >> >> >>>
> >> >> >>>>
> >> >> >>>> Cheers,
> >> >> >>>>   Vincenzo
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> 2016-12-20 21:45 GMT+01:00 Adrian Chadd :
> >> >> >>>>> hi,
> >&

Re: cxgbe's native netmap support broken since r307394

2016-12-28 Thread John Baldwin
On Wednesday, December 28, 2016 07:25:22 PM Vincenzo Maffione wrote:
> Hi,
>   The "worker_lock" is taken by nm_os_kthread_wakeup_worker(), which
> in turn is called by nm_pt_host_notify(). The latter function is a
> callback that may be called by a driver interrupt service routine;
> more precisely this happens when the driver calls netmap_tx_irq() or
> netmap_rx_irq(). As far as I know in FreeBSD it is not possible to
> lock a MTX_DEF mtx inside an ISR. Am I wrong on this?

It depends.  Most interrupt handlers run in ithreads and can use MTX_DEF.
Only interrupt handlers that run from a filter are restricted to using
MTX_SPIN.  Looking at all the calls to netmap_[tr]x_irq() in the tree,
they are all done from contexts that are safe to use MTX_DEF (and in
general they have to be as the equivalent code for the non-netmap case
is calling routines like if_input or m_freem that can't be invoked from
a filter either).

> 
> Cheers,
>   Vincenzo
> 
> 2016-12-28 19:06 GMT+01:00 John Baldwin :
> > Why are you using MTX_SPIN?  Changing the lock type to MTX_DEF would seem to
> > be a smaller patch and probably more correct for FreeBSD.
> >
> > On Thursday, December 22, 2016 05:29:41 PM Luigi Rizzo wrote:
> >> sure go ahead and thank you!
> >>
> >> On Thu, Dec 22, 2016 at 5:15 PM, Adrian Chadd  
> >> wrote:
> >> > ok, does anyone mind if I commit it as-is?
> >> >
> >> >
> >> > -a
> >> >
> >> >
> >> > On 21 December 2016 at 13:37, Vincenzo Maffione  
> >> > wrote:
> >> >> Hi Luigi,
> >> >>   I attached a minimal change containing two fixes:
> >> >>
> >> >> - change IFNET_WLOCK into IFNET_RLOCK, to fix the cxgbe issue related
> >> >> to this thread
> >> >> - use the proper locking functions for the "worker_lock", unrelated
> >> >> but needed to avoid the O.S. to trap because of a mismatch between
> >> >> MTX_SPIN and MTX_DEF.
> >> >>
> >> >> Cheers,
> >> >>   Vincenzo
> >> >>
> >> >> 2016-12-21 20:30 GMT+01:00 Luigi Rizzo :
> >> >>> On Wed, Dec 21, 2016 at 11:15 AM, Vincenzo Maffione
> >> >>>  wrote:
> >> >>>> Hi,
> >> >>>>   There is no commit related to that in the FreeBSD svn or git.
> >> >>>>
> >> >>>> The fix has been published to the github netmap repository here
> >> >>>> (branch master): https://github.com/luigirizzo/netmap
> >> >>>>
> >> >>>> What we should do is to import all the recent updates from the github
> >> >>>> into HEAD. I can prepare a patch for HEAD, if you wish. Just let me
> >> >>>> know.
> >> >>>
> >> >>> I just checked and the diff between FreeBSD head and netmap head
> >> >>> in github is almost 3k lines due to a lot of recent refactoring.
> >> >>> So, if there is an easy way to extract just the locking change that 
> >> >>> would
> >> >>> be preferable as an interim solution.
> >> >>>
> >> >>> cheers
> >> >>> luigi
> >> >>>
> >> >>>>
> >> >>>> Cheers,
> >> >>>>   Vincenzo
> >> >>>>
> >> >>>>
> >> >>>> 2016-12-20 21:45 GMT+01:00 Adrian Chadd :
> >> >>>>> hi,
> >> >>>>>
> >> >>>>> What's the commit? We should get it into -HEAD asap.
> >> >>>>>
> >> >>>>>
> >> >>>>> -adrian
> >> >>>>>
> >> >>>>>
> >> >>>>> On 20 December 2016 at 01:25, Vincenzo Maffione 
> >> >>>>>  wrote:
> >> >>>>>> Ok, applied to the netmap github repo.
> >> >>>>>> This fix will be published when Luigi does the next commit on 
> >> >>>>>> FreeBSD.
> >> >>>>>>
> >> >>>>>> Cheers,
> >> >>>>>>   Vincenzo
> >> >>>>>>
> >> >>>>>> 2016-12-19 20:05 GMT+01:00 Navdeep Parhar :
> >> >>>>>>> IFNET_RLOCK will work, thanks.
> >> >>>>>>>
> >> >>>>>>> Navdeep
> >>

aio_write and PRUS_MORETOCOME

2016-12-28 Thread John Baldwin
For folks who are interested, I have uploaded a patch to phabricator that
will set PRUS_MORETOCOME when there are multiple writes queued up to a
socket via aio_write().  This mostly matters for TCP.  Silly arc wouldn't
let me add 'network' to the review, but you can find the revision at

https://reviews.freebsd.org/D8955

if you'd like to comment/review.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: cxgbe's native netmap support broken since r307394

2016-12-28 Thread John Baldwin
t;>>> The last major update to netmap (r307394 and followups) broke 
> >>>>>>>>> cxgbe's
> >>>>>>>>> native netmap support.  The problem is that netmap_hw_reg now holds 
> >>>>>>>>> an
> >>>>>>>>> rw_lock around the driver's netmap_on/off routines.  It has always 
> >>>>>>>>> been
> >>>>>>>>> safe for the driver to sleep during these operations but now it 
> >>>>>>>>> panics
> >>>>>>>>> instead.
> >>>>>>>>>
> >>>>>>>>> Why is IFNET_WLOCK needed here?  It seems like a regression to 
> >>>>>>>>> disallow
> >>>>>>>>> sleep on the control path.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Navdeep
> >>>>>>>>>
> >>>>>>>>> begin_synchronized_op with the following non-sleepable locks held:
> >>>>>>>>> exclusive rw ifnet_rw (ifnet_rw) r = 0 (0x8271d680) locked @
> >>>>>>>>> /root/ws/head/sys/dev/netmap/netmap_freebsd.c:95
> >>>>>>>>> stack backtrace:
> >>>>>>>>> #0 0x810837a5 at witness_debugger+0xe5
> >>>>>>>>> #1 0x81084d88 at witness_warn+0x3b8
> >>>>>>>>> #2 0x83ef2bcc at begin_synchronized_op+0x6c
> >>>>>>>>> #3 0x83f14beb at cxgbe_netmap_reg+0x5b
> >>>>>>>>> #4 0x809846f1 at netmap_hw_reg+0x81
> >>>>>>>>> #5 0x809806de at netmap_do_regif+0x19e
> >>>>>>>>> #6 0x8098121d at netmap_ioctl+0x7ad
> >>>>>>>>> #7 0x8098682f at freebsd_netmap_ioctl+0x5f
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Vincenzo Maffione
> >>>>>>>> ___
> >>>>>>>> freebsd-net@freebsd.org mailing list
> >>>>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> >>>>>>>> To unsubscribe, send any mail to 
> >>>>>>>> "freebsd-net-unsubscr...@freebsd.org"
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Vincenzo Maffione
> >>>>>> ___
> >>>>>> freebsd-net@freebsd.org mailing list
> >>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> >>>>>> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Vincenzo Maffione
> >>>
> >>>
> >>>
> >>> --
> >>> -+---
> >>>  Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
> >>>  http://www.iet.unipi.it/~luigi/. Universita` di Pisa
> >>>  TEL  +39-050-2217533   . via Diotisalvi 2
> >>>  Mobile   +39-338-6809875   . 56122 PISA (Italy)
> >>> -+---
> >>
> >>
> >>
> >> --
> >> Vincenzo Maffione
> 
> 
> 
> 


-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Differential] [Accepted] D5853: dhclient: Log a warning instead of bailing upon "illegal" options

2016-04-18 Thread jhb (John Baldwin)
jhb accepted this revision.

REVISION DETAIL
  https://reviews.freebsd.org/D5853

EMAIL PREFERENCES
  https://reviews.freebsd.org/settings/panel/emailpreferences/

To: sepherosa_gmail.com, network, secteam, delphij, glebius, adrian, 
honzhan_microsoft.com, howard0su_gmail.com, decui_microsoft.com, 
freebsd-net-list, pkelsey, gnn, jhb
Cc: sbruno
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some MSI are not routed correctly

2015-10-21 Thread John Baldwin
On Wednesday, October 21, 2015 11:29:17 AM Maxim Sobolev wrote:
> Yes, I do. However, please note that for some reason they are not using
> nearly as much CPU time as the other 4 for some reason.
> 
>11 root  -92- 0K  1104K WAIT3  95.3H  28.96%
> intr{irq267: igb0:que}
>11 root  -92- 0K  1104K WAIT1  95.5H  24.41%
> intr{irq265: igb0:que}
>11 root  -92- 0K  1104K CPU22  95.2H  23.73%
> intr{irq266: igb0:que}
>11 root  -92- 0K  1104K WAIT0  95.2H  23.05%
> intr{irq264: igb0:que}
>11 root  -92- 0K  1104K WAIT6 286:37   1.12%
> intr{irq271: igb1:que}
>11 root  -92- 0K  1104K WAIT7 278:05   1.12%
> intr{irq272: igb1:que}
>11 root  -92- 0K  1104K WAIT5 284:26   1.07%
> intr{irq270: igb1:que}
>11 root  -92- 0K  1104K WAIT4 290:41   0.98%
> intr{irq269: igb1:que}
> 
> CPU 0:   0.0% user,  0.0% nice,  0.9% system, 24.9% interrupt, 74.2% idle
> CPU 1:   0.5% user,  0.0% nice,  0.0% system, 26.3% interrupt, 73.2% idle
> CPU 2:   0.0% user,  0.0% nice,  1.4% system, 25.4% interrupt, 73.2% idle
> CPU 3:   0.0% user,  0.0% nice,  0.5% system, 23.9% interrupt, 75.6% idle
> CPU 4:   0.9% user,  0.0% nice,  2.3% system,  2.3% interrupt, 94.4% idle
> CPU 5:   1.4% user,  0.0% nice,  4.2% system,  4.2% interrupt, 90.1% idle
> CPU 6:   1.4% user,  0.0% nice,  3.8% system,  1.4% interrupt, 93.4% idle
> CPU 7:   2.8% user,  0.0% nice,  0.0% system,  3.8% interrupt, 93.4% idle
> 
> 34263 igb0:que 0
> 32308 igb0:que 1
> 35022 igb0:que 2
> 34593 igb0:que 3
> 14931 igb1:que 0
> 13059 igb1:que 1
> 12971 igb1:que 2
> 13032 igb1:que 3
> 
> So I guess interrupts are routed correctly after all, but for some reason
> driver takes some 5 times less time to process it on cpus 4-7
> (per-interrupt). Weird.

Are the pps rates the same?  It seems like the interrupt rates on igb0
are double those of igb1?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some MSI are not routed correctly

2015-10-21 Thread John Baldwin
On Tuesday, October 20, 2015 06:31:47 PM Maxim Sobolev wrote:
> Here you go:
> 
> $ sudo procstat -S 11
>   PIDTID COMM TDNAME   CPU CSID CPU MASK
>11 100082 intr irq269: igb1:que   41 4
>11 100084 intr irq270: igb1:que   51 5
>11 100086 intr irq271: igb1:que   61 6
>11 100088 intr irq272: igb1:que   71 7

These are clearly what you want, and you can see that the last CPU they
ran on is the CPU you want as well.  If you run 'top -SHz' do you see
the threads running on other CPUs?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Question on mbufs

2015-10-20 Thread John Baldwin
On Friday, October 16, 2015 05:56:33 PM David Somayajulu wrote:
> Hi All,
> When indicating a chain of mbufs to the network via ifp->if_input(), what are 
> the repercussions of setting M_PKTHDR bit in all the mbufs in the chain, 
> instead of just the first mbuf ?
> Thanks
> David S.

Right now the various input routines do not expect a chain of packets and
probably assume that the chain is a single packet.  I assume you mean that
the chain here represents a single mbuf still and they are chained via m_next
and not m_nextpkt?  I suspect that this is probably fine and that most things
will only check for M_PKTHDR and look for the fields in the first mbuf in
the chain.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Some MSI are not routed correctly

2015-10-19 Thread John Baldwin
On Thursday, October 08, 2015 07:33:27 AM Maxim Sobolev wrote:
> Hi John & others,
> 
> We've came across a weird MSI routing issue on one of our newest dual
> E5-2690v3 (haswell) Supermicro X10DRL-i boxes running latest 10.2-p4. It is
> fitted with dual port Intel I350 card, in addition to the built-in I210
> chip that is not used. The hw.igb.num_queues is set to 4, and the driver
> reports binding to the CPUs 0-3 for the first port and CPUs 4-7 for the
> second, however when verified with top -P under the load, interrupts are
> only delivered to the CPUs 0-3, no interrupt time is recorded on the CPUs
> 4-7. systat -vm shows that all 8 queues are firing interrupts, so my guess
> that for whatever reason bus_bind_intr() is not doing what's expected to do
> for half of those interrupts.
> 
> What's interesting is that on a similar box (same chassis/mobo/cpu) but
> equipped with the quad-port X540-AT2 10Gig card, interrupts are routed
> properly. The latter is running with hw.ix.num_queues="3".
> 
> pcib2:  port 0xcf8-0xcff on acpi0
> pci0:  on pcib2
> pcib3:  irq 26 at device 1.0 on pci0
> pci1:  on pcib3
> igb0:  mem
> 0xc720-0xc72f,0xc7304000-0xc7307fff irq 26 at device 0.0 on pci1
> igb0: Using MSIX interrupts with 5 vectors
> igb0: Ethernet address: a0:36:9f:76:af:20
> igb0: Bound queue 0 to cpu0
> igb0: Bound queue 1 to cpu1
> igb0: Bound queue 2 to cpu2
> igb0: Bound queue 3 to cpu3
> igb0: netmap queues/slots: TX 4/4096, RX 4/4096
> igb1:  mem
> 0xc710-0xc71f,0xc730-0xc7303fff irq 28 at device 0.1 on pci1
> igb1: Using MSIX interrupts with 5 vectors
> igb1: Ethernet address: a0:36:9f:76:af:21
> igb1: Bound queue 0 to cpu4
> igb1: Bound queue 1 to cpu5
> igb1: Bound queue 2 to cpu6
> igb1: Bound queue 3 to cpu7
> igb1: netmap queues/slots: TX 4/4096, RX 4/4096
> 
> pcib2:  port 0xcf8-0xcff on acpi0
> pci0:  on pcib2
> pcib3:  irq 26 at device 1.0 on pci0
> pci1:  on pcib3
> pcib4:  irq 32 at device 2.0 on pci0
> pci2:  on pcib4
> pcib5:  irq 40 at device 3.0 on pci0
> pci3:  on pcib5
> ix0:  port
> 0x6020-0x603f mem 0xc7c0-0xc7df,0xc7e04000-0xc7e07fff irq 40 at
> device 0.0 on pci3
> ix0: Using MSIX interrupts with 4 vectors
> ix0: Bound queue 0 to cpu 0
> ix0: Bound queue 1 to cpu 1
> ix0: Bound queue 2 to cpu 2
> ix0: Ethernet address: 0c:c4:7a:5e:be:64
> ix0: PCI Express Bus: Speed 5.0GT/s Width x8
> ix0: netmap queues/slots: TX 3/4096, RX 3/4096
> ix1:  port
> 0x6000-0x601f mem 0xc7a0-0xc7bf,0xc7e0-0xc7e03fff irq 44 at
> device 0.1 on pci3
> ix1: Using MSIX interrupts with 4 vectors
> ix1: Bound queue 0 to cpu 3
> ix1: Bound queue 1 to cpu 4
> ix1: Bound queue 2 to cpu 5
> ix1: Ethernet address: 0c:c4:7a:5e:be:65
> ix1: PCI Express Bus: Speed 5.0GT/s Width x8
> ix1: netmap queues/slots: TX 3/4096, RX 3/4096
> 
> Some extra debug is here:
> 
> http://sobomax.sippysoft.com/haswell_bug/bad.dmesg
> http://sobomax.sippysoft.com/haswell_bug/lstopo_bad.png
> http://sobomax.sippysoft.com/haswell_bug/systat_vm_bad.png
> http://sobomax.sippysoft.com/haswell_bug/top_P_bad.png
> 
> http://sobomax.sippysoft.com/haswell_bug/good.dmesg
> http://sobomax.sippysoft.com/haswell_bug/lstopo_good.png
> http://sobomax.sippysoft.com/haswell_bug/systat_vm_good.png
> http://sobomax.sippysoft.com/haswell_bug/top_P_good.png
> 
> Any ideas on how to debug that further are welcome. The box in the
> production, but we can remove traffic during off-peak to run some
> test/debug code on.

Can you get procstat -S output for the interrupt threads?  (Usually interrupt
threads are in pid 12, so 'procstat -S 12' would suffice.)

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Can tcp_close() be called in INP_TIMEWAIT case

2015-09-25 Thread John Baldwin
On Thursday, September 24, 2015 02:13:24 PM Julien Charbon wrote:
>  So the issue is:
> 
>  - tcp_close() is called for some reasons with an inp in INP_TIMEWAIT
> state and sets the INP_DROPPED flag,
>  - tcp_detach() is called when the last reference on socket is dropped
> 
>  then now in_pcbfree() can be called twice instead of once:
> 
>  1. First in tcp_detach():
> 
> static void
> tcp_detach(struct socket *so, struct inpcb *inp)
> {
> struct tcpcb *tp;
> tp = intotcpcb(inp);
> 
> if (inp->inp_flags & INP_TIMEWAIT) {
> if (inp->inp_flags & INP_DROPPED) {
> in_pcbdetach(inp);
> in_pcbfree(inp); <--
> }
> 
>  2. Second when tcptw expires here:
> 
> void
> tcp_twclose(struct tcptw *tw, int reuse)
> {
> struct socket *so;
> struct inpcb *inp;
> 
> inp = tw->tw_inpcb;
> 
> tcp_tw_2msl_stop(tw, reuse);
> inp->inp_ppcb = NULL;
> in_pcbdrop(inp);
> 
> so = inp->inp_socket;
> if (so != NULL) {
> ...
> } else {
> in_pcbfree(inp); <--
> }
> 
>  This behavior is backed by Palle kernel panic backstraces and coredumps.
> 
>  o Solutions:
> 
>  Long:  Forbid to call tcp_close() when inp is in INP_TIMEWAIT state,
> the TCP stack rule being:
> 
>  - if !INP_TIMEWAIT: Call tcp_close()
>  - if INP_TIMEWAIT: Call tcp_twclose() (or call nothing, the tcptw will
> expire/be recycled anyway)
> 
>  Short:
>   if INP_TIMEWAIT & INP_DROPPED:
> Do not call in_pcbfree(inp) in tcp_detach() unless tcptw is already
> discarded.
> 
>  The long solution seems cleaner, backed by tcp_detach() old comments
> and most of current tcp_close() calls but I won't take that longer path
> without -net approval first.

I prefer the longer solution if it keeps tcp_detach() simpler by avoiding
an extra condition.  Please just document it via assertions in tcp_close()
(or is this the assertion that fired and triggered the reported panic?  If so,
then you obviously don't need to add it. :-P)

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Locking Memory Question

2015-07-30 Thread John Baldwin
On Wednesday, July 29, 2015 11:28:03 PM K. Macy wrote:
> >>
> >> Im not clear how I'd do that. the data being passed up from the kernel is 
> >> a variable size. To use copyout I'd have to pass a
> >> pointer with a static buffer, right?
> >
> > Correct, you can pass along the size, and if it's not large enough
> > try again... Less than ideal...
> >
> >> Is there a way to malloc user space memory from within an ioctl call?
> >
> > Well, it is possible that you could do the equivalent of mmap, and pass
> > the address back along w/ a length, and leave it up to the user to
> > munmap it...  This is probably the nicest method if you the size is
> > really largely variable, and it's expensive if the userland process
> > allocated too much memory...  The down side is that this is more
> > complex to code...
> >
> 
> 
> Mach has the ability to send large "out of line messages". For smaller
> messages where it doesn't do VM tricks to avoid copying it does
> exactly this. In the receive path the kernel calls vm_allocate (which
> is essentially just a wrapper for mmap) then copies the buffer in to
> the newly allocated address space. The message itself contains the
> allocated address and size of the allocation. The receiver is expected
> to call vm_deallocate (munmap) when it's done with the data.
> 
> The implementation is mixed in with enough other code that it may not
> be a useful reference. Nonetheless, I wanted to point at that this
> isn't as strange as it might sound.

You can do this in FreeBSD by calling vm_mmap() with a NULL handle pointer
to simulate a MAP_ANON mapping.  Something like this:

vm_mmap(&curproc->p_vmspace->vm_map, &addr, , VM_PROT_READ |
VM_PROT_WRITE, VM_PROT_READ | VM_PROT_WRITE, MAP_SHARED, 
OBJT_DEFAULT,
NULL, 0);

It's not great for a true shared memory buffer, but is fine for a one-time
copy.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Locking Memory Question

2015-07-29 Thread John Baldwin
okup_userbuf(*offset);

/*
 * Clear the offset to zero as it will now be relative to
 * the object we are returning.
 */
*offset = 0;
vm_object_reference(ub->obj);
*object = ub->obj;
}

A second option you can use is to instead have userland allocate a
VM object via shm_open() and then map that into the kernel in your
driver.  There are helper routines in uipc_shm.c to facilitate this
that I've used in some out-of-tree code before.  Something like this:

Header:

struct foo_mapbuf {
int fd;
size_t len;
};

User code:

struct foo_mapbuf mb;
int devfd;

devfd = open("/dev/foo", );
ioctl(devfd, BUFFER_SIZE, &mb.len);

mb.fd = shm_open(SHM_ANON, O_RDWR, 0600);
ftruncate(mb.fd, mb.len);

ioctl(devfd, MAP_BUFFER, &mb);

p = mmap(..., mb.fd, ...);

Driver code:

struct foo_userbuf {
struct file *fp;
void *kmem;
size_t size;
};

int
foo_ioctl(...)
{
struct foo_mapbuf *mb;
struct foo_userbuf *ub;

switch (cmd) {
case BUFFER_SIZE:
/* return desired size */
case MAP_BUFFER:
mb = (struct foo_mapbuf *)data;

ub = ;

/* fget takes a few more parameters you'll need to work out */
ub->fp = fget(mb->fd);
ub->size = mb->size;

/* This assumes a starting offset of 0 */
shm_map(ub->fp, ub->size, 0, &ub->mem);

/*
 * Can now access buffer in kernel via 'ub->mem'
 * pointer, and pages are wired until released
 * by a call to shm_unmap().
 */
}

You will want some way to handle unmapping the buffer either via devfs_priv
dtor method, d_close or something else to avoid leaking the kernel mappings.
This has the advantage over the first approach that it will keep the pages
around until all mappings are gone, though once all the kernel mappings are
gone the pages will no longer be wired (though they will be swap-backed).

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Adding new media types to if_media.h

2015-03-12 Thread John Baldwin
On Thursday, March 12, 2015 04:07:54 PM Hans Petter Selasky wrote:
> On 02/28/15 13:28, John Baldwin wrote:
> > On Friday, February 27, 2015 10:23:10 PM Gleb Smirnoff wrote:
> >> On Thu, Feb 26, 2015 at 08:25:59PM -0800, Adrian Chadd wrote:
> >> A> [snip]
> >> A>
> >> A> I think Mike's approach is good - it makes it easy to MFC to 10.2
> >> A> since there's extended lifecycle stuff to do there - and then we can
> >> A> plan out how do the "betterer" fix after it's landed and churned
> >> A> things.
> >>
> >> ... and we will be ought to support the "betterer" fix along with
> >> the "not so betterer" for a very long time.
> >>
> >> The rock on which we split in this argument is that some developers
> >> write their code for stable/x and then forward-port it to head,
> >> focused on quality of result for stable/x; while other developers
> >> do the opposite: write code to head, then consider or not consider
> >> merging it stable/x.
> >
> > No, this is not quite true.  Some folks have to write drivers on HEAD but 
> > also
> > support running those drivers on older branches.  The MFC's get harder when
> > you have very different APIs on the different branches.  It's already harder
> > to test stat changes now since it requires completely different patches for
> > <= 10 (the only thing people are supposed to use in production) vs head due 
> > to
> > if_getcounter() and friends.  Also, since 11 won't be out until 2016, that 
> > is
> > far, far too long to wait for more media types.  The stuff we need to 
> > support
> > is already shipping in products today.  We can't not support these in 10 
> > (and
> > possibly 9).
> >
> 
> Any news on this issue? Is anyone working on a solution for -head ?

I believe a variant of Mike's patch is in phabricator now?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Adding new media types to if_media.h

2015-02-28 Thread John Baldwin
On Friday, February 27, 2015 10:23:10 PM Gleb Smirnoff wrote:
> On Thu, Feb 26, 2015 at 08:25:59PM -0800, Adrian Chadd wrote:
> A> [snip]
> A>
> A> I think Mike's approach is good - it makes it easy to MFC to 10.2
> A> since there's extended lifecycle stuff to do there - and then we can
> A> plan out how do the "betterer" fix after it's landed and churned
> A> things.
> 
> ... and we will be ought to support the "betterer" fix along with
> the "not so betterer" for a very long time.
> 
> The rock on which we split in this argument is that some developers
> write their code for stable/x and then forward-port it to head,
> focused on quality of result for stable/x; while other developers
> do the opposite: write code to head, then consider or not consider
> merging it stable/x.

No, this is not quite true.  Some folks have to write drivers on HEAD but also
support running those drivers on older branches.  The MFC's get harder when
you have very different APIs on the different branches.  It's already harder
to test stat changes now since it requires completely different patches for
<= 10 (the only thing people are supposed to use in production) vs head due to
if_getcounter() and friends.  Also, since 11 won't be out until 2016, that is 
far, far too long to wait for more media types.  The stuff we need to support 
is already shipping in products today.  We can't not support these in 10 (and 
possibly 9).

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Accessing socket APIs soon after accept

2015-02-27 Thread John Baldwin
On Friday, February 27, 2015 10:32:17 AM Adrian Chadd wrote:
> On 27 February 2015 at 10:07, John Baldwin  wrote:
> > On Friday, February 27, 2015 10:03:33 AM Adrian Chadd wrote:
> >> Is this also a bug on -9 and -10?
> > 
> > Yes.  I may merge just the tcp_syncache.c part of this change down to
> > stable branches.
> 
> Cool, thanks.
> 
> Placing half-completed connections on the queue always looked a bit odd to
> me..

So this appears stranger.  Supposedly, the tcbinfo global lock should have 
fixed this race.

In particular, in 8.x, tcp_intput holds a write lock on the tcbinfo lock 
around all of syncache_expand() including all of syncache_socket() from 
sonewconn() on down to the end of the function not releasing it until after 
the addresses are all set, etc.  tcp_usr_accept() on 8.x acquires a read lock 
on the tcbinfo global lock, so if accept() races with syncache_socket(), even 
though accept() might dequeue the socket from sq_comp before the socket is 
fully constructed, the call to soaccept() inside of accept() should call 
tcp_usr_accept() which will try to read-lock the tcbinfo lock and will thus 
block until syncache_socket() has completed.  Thus, you shouldn't be able to 
have accept() return before syncache_socket() has finished.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Accessing socket APIs soon after accept

2015-02-27 Thread John Baldwin
On Friday, February 27, 2015 10:03:33 AM Adrian Chadd wrote:
> Is this also a bug on -9 and -10?

Yes.  I may merge just the tcp_syncache.c part of this change down to stable 
branches.

> 
> -a
> 
> 
> On 27 February 2015 at 07:22, Quattlebaum, Ryan
> 
>  wrote:
> > Thanks, John. That's almost exactly the approach we were considering.
> > 
> > - Ryan Q
> > ________
> > From: John Baldwin 
> > Sent: Friday, February 27, 2015 10:20 AM
> > To: freebsd-net@freebsd.org
> > Cc: Quattlebaum, Ryan; Adrian Chadd
> > Subject: Re: Accessing socket APIs soon after accept
> > 
> > On Friday, January 16, 2015 05:07:28 PM Quattlebaum, Ryan wrote:
> >> Hi, Adrian. Thanks for taking a look at this.
> >> 
> >> We're using FreeBSD 8.2 and httpd-2.4.10 with arp-1.5.1 and
> >> apr-util-1.5.4.
> >> 
> >> The problem we're seeing is pretty intermittent, so I hope this test case
> >> can  shine a little bit of light on
> >> the
> >> problem. We tried debugging this on our own by adding calls to
> >> getsockname() right after the accept call (in
> >> srclib/apr/network_io/unix/sockets.c: apr_socket_accept()) and logging
> >> the
> >> output. That's where we saw invalid data.
> >> 
> >> I took a look at the source code for the TCP syncache module and the
> >> accept
> >> syscall. It looks like the new child socket is available for the
> >> application to accept after the call to sonewconn returns, but the
> >> address
> >> information isn't set until further down in the function. Wouldn't this
> >> open a window where an application could accept on a socket that the
> >> syncache code isn't done configuring?
> > 
> > This is a bug in 8.x it seems.  It was fixed in HEAD in this commit:
> > 
> > 
> > r261242 | gnn | 2014-01-28 15:28:32 -0500 (Tue, 28 Jan 2014) | 10 lines
> > 
> > Decrease lock contention within the TCP accept case by removing
> > the INP_INFO lock from tcp_usr_accept.  As the PR/patch states
> > this was following the advice already in the code.
> > See the PR below for a full disucssion of this change and its
> > measured effects.
> > 
> > PR:     183659
> > Submitted by:   Julian Charbon
> > Reviewed by:jhb
> > 
> > In particular, that commit changed the syncache code to not place the
> > socket in the queue until the end of the function via soisconnected().
> > 
> > You can probably merge the tcp_syncache.c portion of that change back to
> > 8.x without any ill effects and it should fix your problem.
> > 
> > --
> > John Baldwin


-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Accessing socket APIs soon after accept

2015-02-27 Thread John Baldwin
On Friday, January 16, 2015 05:07:28 PM Quattlebaum, Ryan wrote:
> Hi, Adrian. Thanks for taking a look at this.
> 
> We're using FreeBSD 8.2 and httpd-2.4.10 with arp-1.5.1 and apr-util-1.5.4.
> 
> The problem we're seeing is pretty intermittent, so I hope this test case
> can  shine a little bit of light on the
> problem. We tried debugging this on our own by adding calls to
> getsockname() right after the accept call (in
> srclib/apr/network_io/unix/sockets.c: apr_socket_accept()) and logging the
> output. That's where we saw invalid data.
> 
> I took a look at the source code for the TCP syncache module and the accept
> syscall. It looks like the new child socket is available for the
> application to accept after the call to sonewconn returns, but the address
> information isn't set until further down in the function. Wouldn't this
> open a window where an application could accept on a socket that the
> syncache code isn't done configuring?

This is a bug in 8.x it seems.  It was fixed in HEAD in this commit:


r261242 | gnn | 2014-01-28 15:28:32 -0500 (Tue, 28 Jan 2014) | 10 lines

Decrease lock contention within the TCP accept case by removing
the INP_INFO lock from tcp_usr_accept.  As the PR/patch states
this was following the advice already in the code.
See the PR below for a full disucssion of this change and its
measured effects.

PR: 183659
Submitted by:   Julian Charbon
Reviewed by:jhb

In particular, that commit changed the syncache code to not place the socket 
in the queue until the end of the function via soisconnected().

You can probably merge the tcp_syncache.c portion of that change back to 8.x 
without any ill effects and it should fix your problem.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Adding new media types to if_media.h

2015-02-17 Thread John Baldwin
it is preferable to preserve a
> > > compatible driver KPI, which means using a scalar value encoding what 
> > > is
> > > necessary.  Although that interface is rather Ethernet-centric, that 
> > > is
> > > really what it is used for.
> > >
> > > An additional, selfish goal is to make it easy to back-port drivers 
> > > using
> > > the new interface to older versions (which I am quite likely to do).
> > > Preserving the KPI and general user API will be highly useful there.
> > > I'd be likely to do a 11-style version of ifconfig personally, but it
> > > might not be difficult to do in a more general way.
> > >
> > > I am willing to do a prototype for -current for evaluation.
> > >
> > > Comments, alternatives, ?
> 
> > I agree with your statements above and I'd like to see the prototype.
> 
> Well, I developed the prototype as I had planned, using a 64-bit media
> word, and found that I got about 100 files in GENERIC that didn't compile;
> they attempted to store "media words" in an int.  My kingdom for a typedef.
> That didn't meet my goal of KPI compatibility, so I went to Plan B.
> 
> Plan B is to steal an unused bit (RFU) to indicate an "extended" media
> type.  I then used the variant/subtype field to store the extended type.
> Effectively, the previously unused bit doubles the effective size of the
> subtype  field.  Given that the previous 5-bit field lasted us 18 years,
> I figured that doubling it would last a while.  I also changed the
> SIOGGIFMEDIA ioctl, splitting it for binary compatibility; extended
> types are all mapped to IFM_OTHER (31) using the old interface, but
> are visible using the new one.
> 
> With these changes, I modified one driver (vtnet) to use an extended type,
> and the rest of GENERIC is happy.  The changes to ifconfig are also fairly
> small.  The patch is appended, where email programs will screw it up,
> or at ftp://ftp.karels.net/outgoing/if_media.patch.
> 
> The VFAST subtype is a throw-away for testing.
> 
> This seems like a reasonably pragmatic change to support the new 40 Gb/s
> media types until someone wants to design an improved but non-backward-
> compatible interface.  I think it meets the goal of suitability for
> back-porting; it could be MFCed.

Seems like a reasonable next step to me.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Differential] [Updated] D1777: Associated fix for arp/nd6 timer usage.

2015-02-04 Thread jhb (John Baldwin)
jhb added a comment.

This is just "How It Works".  You are always supposed to do a callout_drain() 
before freeing the storage belonging to a callout.  I don't understand how you 
are preventing the callout/lock being freed out from under the callout routine 
in this version either.  Now you can have this sequence:

a) softclock dequeues callout to run

b) other thread grabs lle_lock

c) softclock blocks on lle_wlock above

d) other thread tears down structure, unlocks lock, zeros memory, 0xdeadc0de, 
etc.

e) softclock wakes up in mutex code and panics becuase the mutex is destroyed 
and it either triggers an assertion, follows a bad pointer trying to propagate 
priority or see if the "owner" is running, etc.

You have to drain the callout somehow.  Hans other solution is to arrange to 
have a callback function do the free for you if you can't block in the context 
where you are trying to free the structure.

REVISION DETAIL
  https://reviews.freebsd.org/D1777

To: rrs, imp, sbruno, gnn, rwatson, lstewart, kostikbel, adrian, bz, jhb
Cc: bz, emaste, hiren, julian, hselasky, freebsd-net
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Differential] [Commented On] D1711: Changes to the callout code to restore active semantics and also add a test-framework and test to validate thecallout code (and potentially for use by other tests)

2015-01-29 Thread jhb (John Baldwin)
jhb added a comment.

To be clear, I'm fine with updating my tests to a different framework, but I 
think it's worth discussing what that looked like.

I also had to explicitly drop Giant in my test module handler.

I do think it's probably better to explicitly ask tests to run instead of 
having kldload do them, that just seemed like less hassle to implement.

REVISION DETAIL
  https://reviews.freebsd.org/D1711

To: rrs, gnn, rwatson, adrian, sbruno, lstewart, imp, hselasky
Cc: jhb, kostikbel, emaste, delphij, neel, erj, freebsd-net
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[Differential] [Changed Subscribers] D1711: Changes to the callout code to restore active semantics and also add a test-framework and test to validate thecallout code (and potentially for use by other

2015-01-29 Thread jhb (John Baldwin)
jhb added a subscriber: jhb.
jhb added a comment.

Hmm, I do think the idea of a kernel test framework should be discussed in its 
own right.  I have implemented a much simpler one on my own for unit tests of 
locking primitives that you can see here.  These worked by declaring tests in 
linker sets in a kernel module.  Loading the module implicitly runs the tests.  
I have other changes in this branch to allow "catching" panics like exceptions 
and similar changes for witness as well.

The test framework (much simpler) can be found at:

http://p4db.freebsd.org/fileDownLoad.cgi?FSPC=//depot/user/jhb/lock/kern/subr_test.c&REV=6
http://p4db.freebsd.org/fileDownLoad.cgi?FSPC=//depot/user/jhb/lock/sys/unittest.h&REV=2

Sample test modules can be found here:

http://p4db.freebsd.org/depotTreeBrowser.cgi?FSPC=//depot/user/jhb/lock/modules/test&HIDEDEL=NO

I think "test" is probably a better name for a subdir of test modules as it is 
more consistent with what we are doing in userland with kyua.

REVISION DETAIL
  https://reviews.freebsd.org/D1711

To: rrs, gnn, rwatson, adrian, sbruno, lstewart, imp, hselasky
Cc: jhb, kostikbel, emaste, delphij, neel, erj, freebsd-net
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Issue with forwarding when creates new interface [was USB Tethering and forwarding]

2015-01-13 Thread John Baldwin
On 1/3/15 1:06 PM, Mike Tancsa wrote:
> On 1/3/2015 9:19 AM, Paul Thornton wrote:
>> Hi,
>>
>> I can also replicate this behaviour on 10.1-RELEASE by simply creating
>> an additional vlan interface.  It affects IPv4 and IPv6 forwarding.
> 
> Strange, I dont see that on RELENG_10
> 
> 0{marble}# ifconfig em2 up
> 0{marble}# ifconfig em2.3 create 1.1.1.2/24
> 0{marble}# sysctl -a | grep forwarding
> net.inet.ip.forwarding: 1
> net.inet.ip.fastforwarding: 0
> net.inet6.ip6.forwarding: 0
> 0{marble}# ifconfig vlan4 create 2.2.2.2 vlan 4 vlandev em2
> 0{marble}# sysctl -a | grep forwarding
> net.inet.ip.forwarding: 1
> net.inet.ip.fastforwarding: 0
> net.inet6.ip6.forwarding: 0
> 0{marble}#
> 
> do you set forwarding via just /etc/sysctl.conf or in /etc/rc.conf via
> ipv6_gateway_enable and gateway_enable. I seem to recall some discussion
> about there being a difference.  Perhaps devd is calling something that
> then fiddles with the setting ignoring whats in sysctl.conf ?

Yes, devd is running /etc/rc.d/netif start  which probably checks
gateway_enable and sets the sysctl based on that overriding what it in
sysctl.conf.  Just set gateway_enable=YES in rc.conf instead.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: compiling on nfs directories

2014-12-16 Thread John Baldwin
On Monday, December 15, 2014 3:59:29 pm Rick Macklem wrote:
> Mehmet Erol Sanliturk wrote:
> > On Mon, Dec 15, 2014 at 1:24 AM, Gerrit KĂĽhn
> > 
> > wrote:
> > >
> > > Hi all,
> > >
> > > I ran into some weird issue here last week:
> > > I have an NFS-Server for storage and diskless booting (pxe / nfs
> > > root)
> > > running under FreeBSD. The clients are running Gentoo Linux. Some
> > > time
> > > ago, I replaced the server, going from a HDD-based storage array
> > > (ZFS)
> > > under FreeBSD 8.3 to an SSD-based array under FreeBSD 10-stable (as
> > > of
> > > February this year - I know this needs updates).
> > >
> > > Only now I recognized that this somehow appears to have broken some
> > > of my
> > > Gentoo ebuilds that do not install cleanly anymore. They complain
> > > about
> > > "soiled libtool library files found" and "insecure RUNPATHs" in the
> > > installation stage of shared libs.
> > >
> > > I was not able to find any useful solution for this in the Net so
> > > far.
> > > However, I was able to verify that this is somehow an issue with
> > > the nfs
> > > server by plugging in a USB-drive into the diskless clients and
> > > mounting
> > > this as /var/tmp/portage (the directory structure where Gentoo's
> > > ebuilds
> > > are compiled). This makes the error messages go away, and
> > > everything works
> > > again (like it did before the server update).
> > >
> > > Are there any suggestions what might be causing this and how to fix
> > > it?
> > >
> > >
> > > cu
> > >   Gerrit
> > >
> > >
> > 
> > 
> > With respect to information given in your message , may pure guess is
> > the
> > following :
> > 
> > 
> > When a client generates a file in NFS server , it assumes that
> > everything
> > is written into the file .
> > The next step ( reading the generated file ) starts , BUT the file is
> > NOT
> > completely written into disk yet ,
> > therefore it reads an incomplete file which causes errors in the
> > client .
> > 
> Well, not exactly. The NFS client chooses whether or not the written
> data must be committed to stable storage (disk) right away via a flag
> argument on the write. It is up to the client to keep track of what has
> been written and if the FILE_STABLE flag wasn't set, must do a separate
> Commit RPC to force the data to stable storage on the server.
> It is also up to the NFS client to keep track of the file's size while
> it is being grown, since the NFS server's size may be smaller until
> the data gets written to the server.
> Also, note that he didn't see the problem with FreeBSD8.3, which would
> have been following the same rules on the server as 10.1.
> 
> What I suspect might cause this is one of two things:
> 1 - The modify time of the file is now changing at a time the Linux
>     client doesn't expect, due to changes in ZFS or maybe TOD clock
> resolution. (At one time, the TOD clock was only at a resolution
> of 1sec, so the client wouldn't see the modify time change often.
> I think it is now at a much higher resolution, but would have to
> look at the code/test to be sure.)

No, it's still only a second resolution on FreeBSD by default.  You can
make this precise on the NFS server by setting the vfs.timestamp_precision
sysctl to 3.  We should probably be using that by default for at least
server-class systems.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Adding new media types to if_media.h

2014-12-12 Thread John Baldwin
On Friday, December 12, 2014 12:26:24 PM Jack Vogel wrote:
> I think I'd go along with Mike, keeping it simpler seems like a good idea.
> 
> Jack

If the userland ABI impact isn't too broad I think this is fine.  Mike, do you
know off hand how many user-facing things would be affected?

> On Fri, Dec 12, 2014 at 6:39 AM, Mike Karels  wrote:
> > > Any other thoughts, or should I start looking at seeing what I can take
> > > copy from net80211?
> > > 
> > > Also adding -net, since this is pretty relevant.
> > 
> > I had the same reaction as Adrian initially, as an int with numerous
> > fields
> > seems really messy.  However, I don't think we have the challenges of
> > 802.11,
> > and the only real problem is that the subtype field has run out of bits.
> > And both ifconfig and the drivers are cast in the form of a scalar "media
> > word", with parameters to ifmedia_add like IFM_ETHER | IFM_1000_T |
> > IFM_FDX
> > (assuming that all the bits are in a scalar).
> > 
> > Instead, I would propose that we simply change the media word from 32 bits
> > to 64, and move the subtype from its current location to a new field (e.g.
> > 16 bits) in the new space.  I believe this can be KPI compatible, and it
> > is relatively easy to provide a backward-compatible ABI.  There should be
> > a reserved subtype for "other" that can be encoded in the existing field
> > for use when the subtype can't fit in the old field.  This seems much
> > easier,
> > can be KPI compatible, and will make it much easier to backport drivers.
> > A backport could simply define the new subtypes as "other", and the KPI
> > would still be compatible.
> > 
> > Thoughts?
> > 
> > Mike

-- 
John Baldwin

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixgbe(4) spin lock held too long

2014-10-24 Thread John Baldwin
On Thursday, October 23, 2014 02:12:44 PM Jason Wolfe wrote:
> On Sat, Oct 18, 2014 at 4:42 AM, John Baldwin  wrote:
> > On Friday, October 17, 2014 11:32:13 PM Jason Wolfe wrote:
> >> Producing 10G of random traffic against a server with this assertion
> >> added took about 2 hours to panic, so if it turns out we need anything
> >> further it should be pretty quick.
> >> 
> >> #4 list
> >> 2816 * timer and remember to restart (more output or
> >> persist). 2817 * If there is more data to be acked,
> >> restart retransmit 2818 * timer, using current
> >> (possibly backed-off) value. 2819 */
> >> 2820if (th->th_ack == tp->snd_max) {
> >> 2821tcp_timer_activate(tp, TT_REXMT, 0);
> >> 2822needoutput = 1;
> >> 2823} else if (!tcp_timer_active(tp, TT_PERSIST))
> >> 2824tcp_timer_activate(tp, TT_REXMT,
> >> tp->t_rxtcur);> 
> > Bah, this is just a bug in my assertion.  Rather than having a separate
> > tcp_timer_deactivate() routine, a delta of 0 passed to
> > tcp_timer_activate()
> > means "stop the timer".  My assertions were incorrect and need to exclude
> > the stop case.  Here is an updated patch (or you can just fix yours
> > locally):
> > 
> > Index: tcp_timer.c
> > ===
> > --- tcp_timer.c (revision 273219)
> > +++ tcp_timer.c (working copy)
> > @@ -869,10 +869,16 @@ tcp_timer_activate(struct tcpcb *tp, int timer_typ
> > 
> > case TT_REXMT:
> > t_callout = &tp->t_timers->tt_rexmt;
> > f_callout = tcp_timer_rexmt;
> > 
> > +   if (callout_active(&tp->t_timers->tt_persist) &&
> > +   delta != 0)
> > +   panic("scheduling retransmit with persist
> > active");> 
> > break;
> > 
> > case TT_PERSIST:
> > t_callout = &tp->t_timers->tt_persist;
> > f_callout = tcp_timer_persist;
> > 
> > +   if (callout_active(&tp->t_timers->tt_rexmt) &&
> > +   delta != 0)
> > +   panic("scheduling persist with retransmit
> > active");> 
> > break;
> > 
> > case TT_KEEP:
> > t_callout = &tp->t_timers->tt_keep;
> > 
> > --
> > John Baldwin
> 
> John,
> 
> panic: tcp_setpersist: retransmit pending
> 
> (kgdb) bt
> #0  doadump (textdump=1) at pcpu.h:219
> #1  0x806facb1 in kern_reboot (howto=260) at
> /usr/src/sys/kern/kern_shutdown.c:452
> #2  0x806fb014 in panic (fmt=) at
> /usr/src/sys/kern/kern_shutdown.c:759
> #3  0x808467d3 in tcp_setpersist (tp=) at
> /usr/src/sys/netinet/tcp_output.c:1619
> #4  0x8084e7b6 in tcp_timer_persist (xtp=0xf804ec124c00)
> at /usr/src/sys/netinet/tcp_timer.c:467
> #5  0x8070d95e in softclock_call_cc (c=0xf804ec124ec0,
> cc=0x81263380, direct=0)
> at /usr/src/sys/kern/kern_timeout.c:687
> #6  0x8070dce4 in softclock (arg=) at
> /usr/src/sys/kern/kern_timeout.c:816
> #7  0x806d16f3 in intr_event_execute_handlers (p= optimized out>, ie=0xf80015214400)
> at /usr/src/sys/kern/kern_intr.c:1263
> #8  0x806d2056 in ithread_loop (arg=0xf800151f7ee0) at
> /usr/src/sys/kern/kern_intr.c:1276
> #9  0x806cf481 in fork_exit (callout=0x806d1fc0
> , arg=0xf800151f7ee0,
> frame=0xfe1f9e9b0ac0) at /usr/src/sys/kern/kern_fork.c:996
> #10 0x80a67c0e in fork_trampoline () at
> /usr/src/sys/amd64/amd64/exception.S:606
> 
> (kgdb) frame 3
> #3  0x808467d3 in tcp_setpersist (tp=) at
> /usr/src/sys/netinet/tcp_output.c:1619
> 1619 panic("tcp_setpersist: retransmit pending");
> (kgdb) list
> 1614int t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1;
> 1615int tt;
> 1616
> 1617tp->t_flags &= ~TF_PREVVALID;
> 1618if (tcp_timer_active(tp, TT_REXMT))
> 1619 panic("tcp_setpersist: retransmit pending");
> 1620/*

Re: ixgbe(4) spin lock held too long

2014-10-18 Thread John Baldwin
On Friday, October 17, 2014 11:43:26 PM Adrian Chadd wrote:
> Hm, is this the bug that was just fixed in -HEAD?
> 
> I saw this similar bug on -HEAD with lots of quick connections and
> reused ports. It ended up deferencing a NULL tcp timer pointer from
> the inpcb. Is that what the code in your tree is doing?

This is not a NULL tcp timer pointer.  Instead, the retransmit timer is being 
armed while the persist timer is still armed.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixgbe(4) spin lock held too long

2014-10-18 Thread John Baldwin
On Friday, October 17, 2014 11:32:13 PM Jason Wolfe wrote:
> Producing 10G of random traffic against a server with this assertion
> added took about 2 hours to panic, so if it turns out we need anything
> further it should be pretty quick.
> 
> #4 list
> 2816 * timer and remember to restart (more output or 
> persist).
> 2817 * If there is more data to be acked, restart 
> retransmit
> 2818 * timer, using current (possibly backed-off) value.
> 2819 */
> 2820if (th->th_ack == tp->snd_max) {
> 2821tcp_timer_activate(tp, TT_REXMT, 0);
> 2822needoutput = 1;
> 2823} else if (!tcp_timer_active(tp, TT_PERSIST))
> 2824tcp_timer_activate(tp, TT_REXMT, 
> tp->t_rxtcur);

Bah, this is just a bug in my assertion.  Rather than having a separate
tcp_timer_deactivate() routine, a delta of 0 passed to tcp_timer_activate()
means "stop the timer".  My assertions were incorrect and need to exclude the
stop case.  Here is an updated patch (or you can just fix yours locally):

Index: tcp_timer.c
===
--- tcp_timer.c (revision 273219)
+++ tcp_timer.c (working copy)
@@ -869,10 +869,16 @@ tcp_timer_activate(struct tcpcb *tp, int timer_typ
case TT_REXMT:
t_callout = &tp->t_timers->tt_rexmt;
f_callout = tcp_timer_rexmt;
+   if (callout_active(&tp->t_timers->tt_persist) &&
+   delta != 0)
+   panic("scheduling retransmit with persist 
active");
break;
case TT_PERSIST:
t_callout = &tp->t_timers->tt_persist;
f_callout = tcp_timer_persist;
+   if (callout_active(&tp->t_timers->tt_rexmt) &&
+   delta != 0)
+   panic("scheduling persist with retransmit 
active");
break;
case TT_KEEP:
t_callout = &tp->t_timers->tt_keep;


-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixgbe(4) spin lock held too long

2014-10-16 Thread John Baldwin
On Saturday, October 11, 2014 2:19:19 am Jason Wolfe wrote:
> On Fri, Oct 10, 2014 at 8:53 AM, John Baldwin  wrote:
> 
> > On Thursday, October 09, 2014 02:31:32 PM Jason Wolfe wrote:
> > > On Wed, Oct 8, 2014 at 12:29 PM, John Baldwin  wrote:
> > > > My only other thought is if a direct timeout routine ran for a long
> > time.
> > > >
> > > > I just committed a change to current that can let you capture KTR
> > traces
> > > > of
> > > > callout routines for use with schedgraph (r272757).  Unfortunately,
> > > > enabling KTR_SCHED can be slow.  If you are up for trying it, I do
> > think
> > > > that
> > > > building a kernel with KTR and KTR_SCHED enabled 
(KTR_COMPILE=KTR_SCHED
> > > > and
> > > > KTR_MASK=KTR_SCHED) with the kernel part of the commit I referenced
> > above
> > > > applied is the best bet to figure out why it is spinning so long.  If
> > you
> > > > can
> > > > try that (and if it doesn't slow things down too much) and get a panic
> > > > with
> > > > those options enabled, then capture the output of
> > > > 'ktrdump -e /path/to/kernel -m /var/crash/vmcore.X -ct', we can use
> > > > Src/tools/sched/schedgraph.py to look at that output to get a picture
> > of
> > > > what
> > > > was going on just prior to the crash.
> > > >
> > > > --
> > > > John Baldwin
> > >
> > > As luck would have it.. caught one of the boxes with the new KTR
> > > code/options after only an hour.  Crashed in the same way w tid 13
> > and
> > > looking the same in callout_process
> > >
> > > spin lock 0x81262d00 (callout) held by 0xf800151fe000 (tid
> > > 13) too long
> > > spin lock 0x81262d00 (callout) held by 0xf800151fe000 (tid
> > > 13) too long
> > >
> > > #4 0x8070d6fa in callout_process (now=7915682202423) at
> > > /usr/src/sys/kern/kern_
> > > timeout.c:490
> > >
> > > The ktrdump oddly only seems to have the last 702, though
> > debug.ktr.entries
> > > is set to 1024.  It appears the buffer may also start after 13 had
> > > already hung?  I've bumped debug.ktr.entries up in case we don't have
> > > enough history here.
> > >
> > > http://nitrology.com/ktrdump-spinlock.txt
> >
> > Hmm, schedgraph.py chokes on this, but it's a bit interesting.  It seems
> > that
> > in this time sample, CPUs 1, 2, 4, and 5 were constantly running ixgbe
> > interrupt handlers.  No actual thread state changes are logged (which is
> > why
> > schedgraph choked).
> >
> > Try setting the sysctl 'net.inet.tcp.per_cpu_timers' to 1 and see if that
> > makes a difference.  I'm guessing they are all contending on the default
> > callout lock which is causing your headache.
> >
> > --
> > John Baldwin
> >
> 
> net.inet.tcp.per_cpu_timers=1 triggered some other fun :)  Saw this same
> panic across a handful of machines:
> 
> panic: tcp_setpersist: retransmit pending
> cpuid = 3
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
> 0xfe1f9e9ab800
> panic() at panic+0x155/frame 0xfe1f9e9ab880
> tcp_setpersist() at tcp_setpersist+0xa3/frame 0xfe1f9e9ab8b0
> tcp_timer_persist() at tcp_timer_persist+0x176/frame 0xfe1f9e9ab8e0
> softclock_call_cc() at softclock_call_cc+0x1ce/frame 0xfe1f9e9ab9e0
> softclock() at softclock+0x44/frame 0xfe1f9e9aba00
> intr_event_execute_handlers() at intr_event_execute_handlers+0x83/frame
> 0xfe1f9e9aba30
> ithread_loop() at ithread_loop+0x96/frame 0xfe1f9e9aba70
> fork_exit() at fork_exit+0x71/frame 0xfe1f9e9abab0
> fork_trampoline() at fork_trampoline+0xe/frame 0xfe1f9e9abab0
> --- trap 0, rip = 0, rsp = 0xfe1f9e9abb70, rbp = 0 ---
> 
> (kgdb) up 3
> #3  0x808467d3 in tcp_setpersist (tp=) at
> /usr/src/sys/netinet/tcp_output.c:1619
> 1619panic("tcp_setpersist: retransmit pending");
> (kgdb) list
> 1614int t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1;
> 1615int tt;
> 1616
> 1617tp->t_flags &= ~TF_PREVVALID;
> 1618if (tcp_timer_active(tp, TT_REXMT))
> 1619panic("tcp_setpersist: retransmit pending");
> 1620/*
> 1621 * Start/restart persistance timer.
> 1622 */
> 1623   

Re: ixgbe(4) spin lock held too long

2014-10-10 Thread John Baldwin
On Thursday, October 09, 2014 02:31:32 PM Jason Wolfe wrote:
> On Wed, Oct 8, 2014 at 12:29 PM, John Baldwin  wrote:
> > My only other thought is if a direct timeout routine ran for a long time.
> > 
> > I just committed a change to current that can let you capture KTR traces
> > of
> > callout routines for use with schedgraph (r272757).  Unfortunately,
> > enabling KTR_SCHED can be slow.  If you are up for trying it, I do think
> > that
> > building a kernel with KTR and KTR_SCHED enabled (KTR_COMPILE=KTR_SCHED
> > and
> > KTR_MASK=KTR_SCHED) with the kernel part of the commit I referenced above
> > applied is the best bet to figure out why it is spinning so long.  If you
> > can
> > try that (and if it doesn't slow things down too much) and get a panic
> > with
> > those options enabled, then capture the output of
> > 'ktrdump -e /path/to/kernel -m /var/crash/vmcore.X -ct', we can use
> > Src/tools/sched/schedgraph.py to look at that output to get a picture of
> > what
> > was going on just prior to the crash.
> > 
> > --
> > John Baldwin
> 
> As luck would have it.. caught one of the boxes with the new KTR
> code/options after only an hour.  Crashed in the same way w tid 13 and
> looking the same in callout_process
> 
> spin lock 0x81262d00 (callout) held by 0xf800151fe000 (tid
> 13) too long
> spin lock 0x81262d00 (callout) held by 0xf800151fe000 (tid
> 13) too long
> 
> #4 0x8070d6fa in callout_process (now=7915682202423) at
> /usr/src/sys/kern/kern_
> timeout.c:490
> 
> The ktrdump oddly only seems to have the last 702, though debug.ktr.entries
> is set to 1024.  It appears the buffer may also start after 13 had
> already hung?  I've bumped debug.ktr.entries up in case we don't have
> enough history here.
> 
> http://nitrology.com/ktrdump-spinlock.txt

Hmm, schedgraph.py chokes on this, but it's a bit interesting.  It seems that 
in this time sample, CPUs 1, 2, 4, and 5 were constantly running ixgbe 
interrupt handlers.  No actual thread state changes are logged (which is why 
schedgraph choked).

Try setting the sysctl 'net.inet.tcp.per_cpu_timers' to 1 and see if that 
makes a difference.  I'm guessing they are all contending on the default
callout lock which is causing your headache.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixgbe(4) spin lock held too long

2014-10-08 Thread John Baldwin
On Wednesday, October 08, 2014 10:56:56 AM Jason Wolfe wrote:
> On Tue, Oct 7, 2014 at 11:28 AM, John Baldwin  wrote:
> > On Tuesday, October 07, 2014 2:06:42 pm Jason Wolfe wrote:
> > > Hey John,
> > > 
> > > Happy to do this, but the pool of boxes is about 500 large, which is the
> > > reason I'm able to see a crash every day or so.  I've pulled a portion
> > > of them out of service to switch them over to a a kernel with WITNESS 
> > > options,
> > > might that assist in the same manner if we catch a crash there?  If you're
> > > thinking the KTR traces are required to diagnose this I'll get started on
> > > that though.  I've also cut 10.1-RC1 with some ixgbe fixes for future use,
> > > but noticed kern_timeout.c has been static for the past 8 months. 
> > > Thanks again for your help.
> > 
> > I don't think WITNESS will really help in this case.  ixgbe just happens to
> > be trying to schedule a timer and running into a problem that the timer is
> > spinning on another CPU.  While the kern_timeout code hasn't changed in a
> > while, it may be an edge case provoked by another bug elsewhere.  It's also
> > possible I suppose that a lot of callouts are scheduled for some reason.
> > 
> > Hmm, maybe we can still figure this out from your dumps.  We can still get 
> > to
> > '*cc' via an alternate route.  First, IIRC, the thread is an idle thread,
> > so its name should indicate which CPU it is for:
> > 
> > % sudo procstat -at  | grep 13
> > 
> >11 13 idle idle: cpu0 0  255 run -
> > 
> > (Hmm, I bet that is always CPU 0).  That means we can find 'cc' by doing
> > 'p cc_cpu[0]' in kgdb.  Can you do that from the last core dump along with
> > 'p last' in frame 4?
> > 
> > Grrr.  Looks like cc->cc_lastscan is already overwritten by the time you've
> > crashed, so we can't work back to the original value of 'firstb'.
> > 
> > However, there is a sysctl you can invoke to get callout stats.  Maybe run
> > 'sysctl kern.callout_stats=1' on one of your existing servers and then
> > check dmesg.  On an idle machine sitting here running stable/10 I have 86
> > callout scheduled currently:
> 
> Correct 13 is always CPU0, on our end at least.  Jackpot on the
> cc_cpu[0] also, here are cores across 3 different 'spin lock callout held
> by tid 13 too long' machines.  Callouts on an idle machine are in line
> with yours at 90, and a somewhat idle box pushing ~500Mb/s has in the
> 3000-4000 range.

Unfortunately, the thing I wanted out of cc_cpu[0] is the previous value of
cc_lastscan so I could work back to what 'firstb' is and see if it sticks in
the loop forever for some reason.  However, cc_lastscan is overwritten by the
time this cores.  Still, we can still tell something perhaps:

> (kgdb) tid 13
> [Switching to thread 40 (Thread 13)]#0  0x80ac39b8 in
> cpustop_handler ()
> at /usr/src/sys/amd64/amd64/mp_machdep.c:1432
> 1432savectx(&stoppcbs[cpu]);
> (kgdb) up 4
> #4  0x80733f79 in callout_process (now=3549882673091690) at
> /usr/src/sys/kern/kern_timeout.c:443
> 443 tmp = LIST_FIRST(sc);
> (kgdb) list
> 438 }
> 439
> 440 /* Iterate callwheel from firstb to nowb and then up to
> lastb. */
> 441 do {
> 442 sc = &cc->cc_callwheel[firstb & callwheelmask];
> 443 tmp = LIST_FIRST(sc);
> 444 while (tmp != NULL) {
> 445 /* Run the callout if present time within
> allowed. */
> 446 if (tmp->c_time <= now) {
> 447 /*
> (kgdb) p cc_cpu[0]
> $1 = {cc_lock = {lock_object = {lo_name = 0x80ccbec7 "callout",
> lo_flags = 720896, lo_data = 0, lo_witness = 0x0},
> mtx_lock = 4}, cc_exec_entity = {{cc_next = 0x0, cc_curr = 0x0,
> ce_migration_func = 0, ce_migration_arg = 0x0,
>   ce_migration_cpu = 64, ce_migration_time = 0, ce_migration_prec = 0,
> cc_cancel = false, cc_waiting = false}, {
>   cc_next = 0x812a7af0, cc_curr = 0x0, ce_migration_func = 0,
> ce_migration_arg = 0x0, ce_migration_cpu = 64,
>   ce_migration_time = 0, ce_migration_prec = 0, cc_cancel = false,
> cc_waiting = false}},
>   cc_callout = 0xfe69f000, cc_callwheel = 0xfe7c1000,
> cc_expireq = {tqh_first = 0x0,
> tqh_last = 0x812a0510}, cc_callfree = {slh_first =
&g

Re: [PATCH] Only lock send buffer in sopoll() if needed

2014-10-06 Thread John Baldwin
On Sunday, October 05, 2014 08:08:17 AM Robert N. M. Watson wrote:
> I'm not convinced that the race with SBS_CANTSENDMORE is OK, even though I
> really want to write that it is. The risk here is that we miss an
> asynchronous disconnect event, and that the thread then blocks even though
> an event is pending, which is a nasty turn of events. We might want to dig
> about a bit and see whether this case 'matters' -- e.g., that there are
> important cases where a test for writability might need to catch
> SBS_CANTRCVMORE but SBS_CANTSENDMORE is not simultaneously set.

I thought about this a bit more and this would be ok so long as the code that 
does a selwakeup() on so_rcv does so after setting SBS_CANTSENDMORE.  However, 
checking both soisdisconnecting() and soisdisconnected() shows that they call 
sorwakeup() and unlock the sb_rcv lock before setting the flag, so the race is 
not ok.

> Could you say a bit more about the performance benefits of this change -- is
> it substantial? If so, we might need to add a new socket-buffer flag along
> the lines of SB[RS]_DISCONNECTED that is 'broadcast' to both socket buffers
> on certain events. Doing so might require rejiggering elsewhere by causing
> additional locks to need to be held, possibly out of order.

I had posted this to an older thread on current@ where Luigi asked about the 
overhead of locking both for sopoll().  I never got a reply from Luigi (and no 
one else responded on that thread).  I found this again recently in an old 
tree and was curious what other folks thought of the idea.  I do not have any 
workloads I am working with where this is a factor.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixgbe(4) spin lock held too long

2014-10-03 Thread John Baldwin
On Thursday, October 02, 2014 06:40:21 PM Jason Wolfe wrote:
> On Wed, Sep 10, 2014 at 8:24 AM, John Baldwin  wrote:
> > On Monday, September 08, 2014 03:34:02 PM Eric van Gyzen wrote:
> > > On 09/08/2014 15:19, Sean Bruno wrote:
> > > > On Mon, 2014-09-08 at 12:09 -0700, Sean Bruno wrote:
> > > >> This sort of looks like the hardware failed to respond to us in time?
> > > >> Too busy?
> > > >> 
> > > >> sean
> > > > 
> > > > This seems to be affecting my 10/stable machines from 15Aug2014.
> > > > 
> > > > Not a lot of churn in the code so I don't think this is new.  The
> > > > afflicted machines, quite a few by my count, appear to have not been
> > > > super busy (pushing about 200 Mb/s).
> > > > 
> > > > sean
> > > > 
> > > >> panic: spin lock held too long
> > > >> 
> > > >> GNU gdb 6.1.1 [FreeBSD]
> > > >> Copyright 2004 Free Software Foundation, Inc.
> > > >> GDB is free software, covered by the GNU General Public License, and
> > 
> > you
> > 
> > > >> are
> > > >> welcome to change it and/or distribute copies of it under certain
> > > >> conditions.
> > > >> Type "show copying" to see the conditions.
> > > >> There is absolutely no warranty for GDB.  Type "show warranty" for
> > > >> details.
> > > >> This GDB was configured as "amd64-marcel-freebsd"...
> > > >> 
> > > >> Unread portion of the kernel message buffer:
> > > >> spin lock 0x812a0400 (callout) held by 0xf800151fe000
> > > >> (tid
> > > >> 13) too long
> > > 
> > > TID 13 is usually a kernel idle thread, which would seem to indicate
> > > a dangling lock.  Can you enable WITNESS (without WITNESS_SKIPSPIN) on
> > > this box?
> > 
> > Also, do 'tid 13' and 'bt' in kgdb to see what the thread holding the
> > lock
> > was doing.
> > 
> > --
> > John Baldwin
> 
> Sorry for the delay, I've been hoping to catch a crash on one of our
> machines running the WITNESS kernel.  Our luck seems to be in short supply,
> the machines running sans WITNESS crash in the same manner at a rate of 2/3
> a day.  I may have to grow the pool to catch this, but in the meantime here
> is the bt/tid.
> 
> (kgdb) bt 103
> #0  0x80ac39b8 in cpustop_handler () at
> /usr/src/sys/amd64/amd64/mp_machdep.c:1432
> #1  0x80ac397f in ipi_nmi_handler () at
> /usr/src/sys/amd64/amd64/mp_machdep.c:1417
> #2  0x80ad2d5a in trap (frame=0x81242830) at
> /usr/src/sys/amd64/amd64/trap.c:190
> #3  0x80ab93c3 in nmi_calltrap () at
> /usr/src/sys/amd64/amd64/exception.S:505
> #4  0x80734066 in callout_process (now=3278964590047193) at
> /usr/src/sys/kern/kern_timeout.c:487
> (kgdb) tid 13
> [Switching to thread 40 (Thread 13)]#0  0x80ac39b8 in
> cpustop_handler () at /usr/src/sys/amd64/amd64/mp_machdep.c:1432
> 1432savectx(&stoppcbs[cpu]);

Ok, so it is processing C_DIRECT callouts.  Can you go to frame 4 and see 
where it is at?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[PATCH] Only lock send buffer in sopoll() if needed

2014-09-30 Thread John Baldwin
Right now sopoll() always locks both socket buffers.  The receive socket 
buffer lock is always needed, but the send socket buffer lock is only needed 
while polling for writing (there is a potential test of SBS_CANTSENDMORE 
without the lock, but I think this might be ok).  What do folks think?

Index: uipc_socket.c
===
--- uipc_socket.c   (revision 272305)
+++ uipc_socket.c   (working copy)
@@ -3003,7 +3003,12 @@ sopoll_generic(struct socket *so, int events, stru
 {
int revents = 0;
 
-   SOCKBUF_LOCK(&so->so_snd);
+   if (events & (POLLOUT | POLLWRNORM))
+   SOCKBUF_LOCK(&so->so_snd);
+   /*
+* The so_rcv lock doubles as SOCK_LOCK so it it is needed for
+* all requests.
+*/
SOCKBUF_LOCK(&so->so_rcv);
if (events & (POLLIN | POLLRDNORM))
if (soreadabledata(so))
@@ -3038,7 +3043,8 @@ sopoll_generic(struct socket *so, int events, stru
}
 
SOCKBUF_UNLOCK(&so->so_rcv);
-   SOCKBUF_UNLOCK(&so->so_snd);
+   if (events & (POLLOUT | POLLWRNORM))
+   SOCKBUF_UNLOCK(&so->so_snd);
return (revents);
 }

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: getting factory MAC address

2014-09-30 Thread John Baldwin
On Sunday, September 21, 2014 8:13:17 am clutton wrote:
> Hi list. I'm relatively new here. So, Hi. :)
> 
> I don't know how to read the real MAC, I mean the one which is burned in
> ROM. Is it possible from the user space? I've ported GNU macchanger and
> it's the last non ported feature.
> 
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187363
> 
> In Linux it can be done like this:
> https://github.com/alobbs/macchanger/blob/master/src/netinfo.c#L118

There is currently not a way to do this, though there is a patch to add this 
functionality currently on the lists.  I believe gnn@ is looking at that 
patch?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Choice of private ioctl approach

2014-09-30 Thread John Baldwin
On Monday, September 29, 2014 8:36:22 am Andrew Rybchenko wrote:
> Hello,
> 
> we need to add private ioctl to the driver sfxge(4) to make FW update,
> do internal diagnostics commands etc.
> We see at least two approaches in other drivers:
>   1. SIOCGPRIVATE_0/ SIOCGPRIVATE_1 on net device
>   2. dedicated char device with its own ioctl's
> 
> Is there any recommendations on which way is preferred?

I would be inclined towards 2).  It is more flexible if you need to add more 
custom ioctls in the future.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


[PATCH] Convert netipsec/key from timeout(9) to callout(9)

2014-09-24 Thread John Baldwin
This patch converts a timer in the IPSEC code from using timeout() to using 
callout() instead.  Can someone test it or review it?

http://people.freebsd.org/~jhb/patches/ipsec_callout.patch

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: ixgbe(4) spin lock held too long

2014-09-10 Thread John Baldwin
On Monday, September 08, 2014 03:34:02 PM Eric van Gyzen wrote:
> On 09/08/2014 15:19, Sean Bruno wrote:
> > On Mon, 2014-09-08 at 12:09 -0700, Sean Bruno wrote:
> >> This sort of looks like the hardware failed to respond to us in time?
> >> Too busy?
> >> 
> >> sean
> > 
> > This seems to be affecting my 10/stable machines from 15Aug2014.
> > 
> > Not a lot of churn in the code so I don't think this is new.  The
> > afflicted machines, quite a few by my count, appear to have not been
> > super busy (pushing about 200 Mb/s).
> > 
> > sean
> > 
> >> panic: spin lock held too long
> >> 
> >> GNU gdb 6.1.1 [FreeBSD]
> >> Copyright 2004 Free Software Foundation, Inc.
> >> GDB is free software, covered by the GNU General Public License, and you
> >> are
> >> welcome to change it and/or distribute copies of it under certain
> >> conditions.
> >> Type "show copying" to see the conditions.
> >> There is absolutely no warranty for GDB.  Type "show warranty" for
> >> details.
> >> This GDB was configured as "amd64-marcel-freebsd"...
> >> 
> >> Unread portion of the kernel message buffer:
> >> spin lock 0x812a0400 (callout) held by 0xf800151fe000 (tid
> >> 100003) too long
> 
> TID 13 is usually a kernel idle thread, which would seem to indicate
> a dangling lock.  Can you enable WITNESS (without WITNESS_SKIPSPIN) on
> this box?

Also, do 'tid 13' and 'bt' in kgdb to see what the thread holding the lock 
was doing.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [Bug 193246] Bug in IPv6 multicast join(), uncovered by Jenkins

2014-09-04 Thread John Baldwin
On Wednesday, September 03, 2014 06:39:53 AM Craig Rodrigues wrote:
> On Wed, Sep 3, 2014 at 5:19 AM, Andrey V. Elsukov  wrote:
> > On 03.09.2014 14:05, bugzilla-nore...@freebsd.org wrote:
> > 
> > Hi,
> > 
> > you said that this code works in linux. I looked in the linux kernel
> > source, and I think it should return EINVAL too.
> > 
> > net/ipv6/mcast.c:ipv6_sock_mc_join:
> >  154 if (!ipv6_addr_is_multicast(addr))
> >  155 return -EINVAL;
> 
> The code does work in Linux.  However, you need to look at the
> JDK source, not the Linux kernel source.
> 
> In this file:
> http://hg.openjdk.java.net/jdk7/jdk7/jdk/file/9b8c96f96a0f/src/solaris/nativ
> e/java/net/PlainDatagramSocketImpl.c
> 
> in the mcast_join_leave() function, there are two code paths: (1)
> Linux, (2) Solaris.
> 
> It looks like on Solaris, they support IPv4-mapped multicast addresses for
> IPV6, and things work when they create an IPv6 socket, and then put an
> IPv4-mapped multicast address in it.  For Linux, they have specific
> code paths in that function which seem to force creating an IPv4
> socket.

>From looking at the source, it doesn't look like the Linux workaround (using 
IP_ADD_MEMBERSHIP on an AF_INET6 socket) will work on FreeBSD.  I'm not sure 
how hard it would be to fix in6_mcast.c to support IPv4 groups.  bms@ might 
know.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [PATCH] Packet loss when 'control' messages are present with large data (sendmsg(2))

2014-08-26 Thread John Baldwin
On Tuesday, August 26, 2014 11:05:12 am Alan Somers wrote:
> On Mon, Aug 25, 2014 at 1:52 PM, John Baldwin  wrote:
> > On Friday, August 22, 2014 01:34:28 PM Harald Schmalzbauer wrote:
> >>  BezĂĽglich Yuri's Nachricht vom 02.09.2013 06:54 (localtime):
> >> > Please check in this patch:
> >> > http://www.freebsd.org/cgi/query-pr.cgi?pr=181741
> >> > Please MFC into 9.X
> >> >
> >> > Description of the problem is within PR.
> >> >
> >> > Thanks,
> >> > Yuri
> >>
> >> Hello,
> >>
> >> I guess this fix should make it into 10.1.
> >> Can someone check please?
> >
> > A fix has to make into HEAD first.  I've cc'd Alan who responded to the bug.
> > Alan, note that glebius@ already committed the test case to HEAD a while 
> > ago.
> 
> First, Gleb's testcase needs to be converted to ATF.  This would be a
> good exercise for anybody who's new to ATF and wants some practice.
> Anybody interested?

While that would be nice, I don't know that it's quite fair to require the
test to be converted before working on a possible fix.  The existing test
doesn't seem that hard to run by hand:

% cd work/freebsd/head/tools/regression/sockets/unix_passfd/
% make
...
% > ./unix_passfd 
beginning test1-simplesendfd
test1-simplesendfd passed
...
beginning test8-rigths+creds+payload
unix_passfd: test8-rigths+creds+payload: recvmsg: 24 bytes received

I only say this because in the bug followup you seemed to have described a
possible solution so it seems that you would be able to develop a fix quicker
than other folks since you are already familiar with the issues involved.
(Also, you've fixed other related issues recently.)

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: [PATCH] Packet loss when 'control' messages are present with large data (sendmsg(2))

2014-08-25 Thread John Baldwin
On Friday, August 22, 2014 01:34:28 PM Harald Schmalzbauer wrote:
>  BezĂĽglich Yuri's Nachricht vom 02.09.2013 06:54 (localtime):
> > Please check in this patch:
> > http://www.freebsd.org/cgi/query-pr.cgi?pr=181741
> > Please MFC into 9.X
> > 
> > Description of the problem is within PR.
> > 
> > Thanks,
> > Yuri
> 
> Hello,
> 
> I guess this fix should make it into 10.1.
> Can someone check please?

A fix has to make into HEAD first.  I've cc'd Alan who responded to the bug.  
Alan, note that glebius@ already committed the test case to HEAD a while ago.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: zero window and persist timer not set

2014-08-11 Thread John Baldwin
 tcpcb*)(0xfe02ae289b70))->snd_una
> $7 = 3306392292
> (kgdb) print ((struct tcpcb*)(0xfe02ae289b70))->snd_wnd
> $8 = 0
> (kgdb) print ((struct tcpcb*)(0xfe02ae289b70))->snd_cwnd
> $9 = 4380
> (kgdb) print ((struct
> tcpcb*)(0xfe02ae289b70))->t_timers->tt_rexmt->c_flags
> $11 = 16
> (kgdb) print ((struct
> tcpcb*)(0xfe02ae289b70))->t_timers->tt_persist->c_flags
> $12 = 16
> (kgdb) print ((struct
> tcpcb*)(0xfe02ae289b70))->t_timers->tt_keep->c_flags
> $13 = 22
> (kgdb) print ((struct
> tcpcb*)(0xfe02ae289b70))->t_timers->tt_2msl->c_flags
> $14 = 16
> (kgdb) print ((struct
> tcpcb*)(0xfe02ae289b70))->t_timers->tt_delack->c_flags
> $15 = 16
> (kgdb) print ((struct
> tcpcb*)(0xfe02ae289b70))->t_inpcb->inp_socket.so_snd.sb_cc
> $16 = 1656
> 
> There is zero window, data in the socket buffer, and the persist timer is
> not set.
> 
> My proposed fix follows.  If you send a 0-length packet, but there is data
> is the socket buffer, and neither the rexmt or persist timer is already
> set, then activate the persist timer.
> 
> --- sys/netinet/tcp_output.c(revision 269644)
> +++ sys/netinet/tcp_output.c(working copy)
> @@ -1290,7 +1290,12 @@
> tp->t_rxtshift = 0;
> }
> tcp_timer_activate(tp, TT_REXMT, tp->t_rxtcur);
> -   }
> +   } else if (len == 0 && so->so_snd.sb_cc &&
> +  !tcp_timer_active(tp, TT_REXMT) &&
> +  !tcp_timer_active(tp, TT_PERSIST)) {
> +   tp->t_rxtshift = 0;
> +   tcp_setpersist(tp);
> +   }
> 
> } else {
> /*
>  * Persist case, update snd_max but since we are in
> 
> Let me know any comments.  Thanks,

I think your patch is correct, but please file this as a bug report so we can 
hopefully wrangle another person to review this.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Multicast races on vlan & lagg

2014-08-11 Thread John Baldwin
On Thursday, July 31, 2014 10:21:22 am Alexander Motin wrote:
> Hi.
> 
> Doing some tests on FreeNAS (FreeBSD 9.2+) I hit series of panic during
> active interfaces manipulation in some scenarios including multicast and
> several vlans on top of lagg. I am not ready to reproduce the full
> environment on head, but the code looks equal, so probably the bugs.
> 
> I've made a patch to improve locking in that area, that seems fixes the
> problems: http://people.freebsd.org/~mav/mcast_vlan_lagg.patch
> 
> Could somebody with more experience in the area please take a look?

Can't you use IF_ADDR_RLOCK instead of IF_ADDR_WLOCK?  Also, strictly speaking 
it might be best to use if_maddr_rlock() instead of directly using 
IF_ADDR_RLOCK().

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS client READ performance on -current

2014-08-11 Thread John Baldwin
On Saturday, July 19, 2014 1:28:19 pm John Baldwin wrote:
> On Thursday 17 July 2014 19:45:56 Julian Elischer wrote:
> > On 7/15/14, 10:34 PM, John Baldwin wrote:
> > > On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote:
> > >> Yonghyeon Pyun wrote:
> > >>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
> > >>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> > >>>>> John Baldwin wrote:
> > >>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> > >>>>>>> Russell L. Carter wrote:
> > >>>>>>>> On 07/02/14 19:09, Rick Macklem wrote:
> > >>>>>>>>> Could you please post the dmesg stuff for the network
> > >>>>>>>>> interface,
> > >>>>>>>>> so I can tell what driver is being used? I'll take a look
> > >>>>>>>>> at
> > >>>>>>>>> it,
> > >>>>>>>>> in case it needs to be changed to use m_defrag().
> > >>>>>>>> 
> > >>>>>>>> em0:  port
> > >>>>>>>> 0xd020-0xd03f
> > >>>>>>>> mem
> > >>>>>>>> 0xfe4a-0xfe4b,0xfe48-0xfe49 irq 44 at
> > >>>>>>>> device 0.0
> > >>>>>>>> on
> > >>>>>>>> pci2
> > >>>>>>>> em0: Using an MSI interrupt
> > >>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba
> > >>>>>>>> 001.07 [2323] netmap_attach success for em0
> > >>>>>>>> tx
> > >>>>>>>> 1/1024
> > >>>>>>>> rx
> > >>>>>>>> 1/1024 queues/slots
> > >>>>>>>> 
> > >>>>>>>> This is one of those dual nic cards, so there is em1 as
> > >>>>>>>> well...
> > >>>>>>> 
> > >>>>>>> Well, I took a quick look at the driver and it does use
> > >>>>>>> m_defrag(),
> > >>>>>>> but
> > >>>>>>> I think that the "retry:" label it does a goto after doing so
> > >>>>>>> might
> > >>>>>>> be in
> > >>>>>>> the wrong place.
> > >>>>>>> 
> > >>>>>>> The attached untested patch might fix this.
> > >>>>>>> 
> > >>>>>>> Is it convenient to build a kernel with this patch applied
> > >>>>>>> and then
> > >>>>>>> try
> > >>>>>>> it with TSO enabled?
> > >>>>>>> 
> > >>>>>>> rick
> > >>>>>>> ps: It does have the transmit segment limit set to 32. I have
> > >>>>>>> no
> > >>>>>>> idea if
> > >>>>>>> 
> > >>>>>>>  this is a hardware limitation.
> > >>>>>> 
> > >>>>>> I think the retry is not in the wrong place, but the overhead
> > >>>>>> of all
> > >>>>>> those
> > >>>>>> pullups is apparently quite severe.
> > >>>>> 
> > >>>>> The m_defrag() call after the first failure will just barely
> > >>>>> squeeze
> > >>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I
> > >>>>> think any
> > >>>>> m_pullup() done during the retry will allocate an mbuf
> > >>>>> (at a glance it seems to always do this when the old mbuf is a
> > >>>>> cluster)
> > >>>>> and prepend that to the list.
> > >>>>> --> Now the list is > 32 mbufs again and the
> > >>>>> bus_dmammap_load_mbuf_sg()
> > >>>>> 
> > >>>>>  will fail again on the retry, this time fatally, I think?
> > >>>>> 
> > >>>>> I can't see any reason to re-do all the stuff using m_pullup()
> > >>>>> and Russell
> > >>>>> reported that moving the "retry:" fixed his p

Re: Question about tcp keep-alive timer

2014-08-11 Thread John Baldwin
On Sunday, August 10, 2014 4:09:55 pm David Bar wrote:
> Hi
> 
> 
> (Forgive me if this topic has been discussed before. I didn't find it in
> the archives)
> 
> In tcp_input(), when a packet is received on an established socket the code
> re-arms the keep-alive timer, for each packet.
> Here:
> 
https://svnweb.freebsd.org/base/release/10.0.0/sys/netinet/tcp_input.c?revision=260789&view=markup#l1518
> 
> Isn't this a waste to do this for each packet?
> 
> The setting of the timer when the connection becomes established should
> suffice if there was a small change in tcp_timer_keep().
> If tcp_timer_keep() would first checks if tp->t_rcvtime is recent (newer
> than the TT_KEEPIDLE time), and would just re-arm the timer to go off
> later, then we would keep the same functionality.
> 
> I can't think of any downsides to this idea. Any good reason why this
> hasn't been done before?

I think it is just a tradeoff between having the timer run at all.  However, I 
suspect that with high packet rates it probably is cheaper to have the timer 
run periodically and reschedule itself if it notices it isn't needed as you 
suggested.  Do you want to write up a patch and test it?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS client READ performance on -current

2014-07-21 Thread John Baldwin
On Sunday 20 July 2014 11:56:38 Garrett Wollman wrote:
> In article <201407151034.54681@freebsd.org>, j...@freebsd.org writes:
> >Hmm, I am surprised by the m_pullup() behavior that it doesn't just
> >notice that the first mbuf with a cluster has the desired data already
> >and returns without doing anything.
> 
> The specification of m_pullup() is that it returns a *writable* mbuf
> (and thus also that the "length" provided is less than MHLEN).
> Clusters are read-only.

Well, my patch to if_em is definitely correct then as it doesn't want a 
writable mbuf, it just wants a certain portion of the mbuf chain to be stored 
in the first segment.

Also, you should fix the manpage.  It doesn't mention writable at all, just 
that the data is accessible so that mtod() will work:


   m_pullup(mbuf, len)
   Arrange that the first len bytes of an mbuf chain are contiguous
   and lay in the data area of mbuf, so they are accessible with
   mtod(mbuf, type).  It is important to remember that this may
   involve reallocating some mbufs and moving data so all pointers
   referencing data within the old mbuf chain must be recalculated or
   made invalid.  Return the new mbuf chain on success, NULL on fail-
   ure (the mbuf chain is freed in this case).  Note: It does not
   allocate any mbuf clusters, so len must be less than or equal to
   MHLEN.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS client READ performance on -current

2014-07-20 Thread John Baldwin
On Thursday 17 July 2014 19:45:56 Julian Elischer wrote:
> On 7/15/14, 10:34 PM, John Baldwin wrote:
> > On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote:
> >> Yonghyeon Pyun wrote:
> >>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
> >>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> >>>>> John Baldwin wrote:
> >>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> >>>>>>> Russell L. Carter wrote:
> >>>>>>>> On 07/02/14 19:09, Rick Macklem wrote:
> >>>>>>>>> Could you please post the dmesg stuff for the network
> >>>>>>>>> interface,
> >>>>>>>>> so I can tell what driver is being used? I'll take a look
> >>>>>>>>> at
> >>>>>>>>> it,
> >>>>>>>>> in case it needs to be changed to use m_defrag().
> >>>>>>>> 
> >>>>>>>> em0:  port
> >>>>>>>> 0xd020-0xd03f
> >>>>>>>> mem
> >>>>>>>> 0xfe4a-0xfe4b,0xfe48-0xfe49 irq 44 at
> >>>>>>>> device 0.0
> >>>>>>>> on
> >>>>>>>> pci2
> >>>>>>>> em0: Using an MSI interrupt
> >>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba
> >>>>>>>> 001.07 [2323] netmap_attach success for em0
> >>>>>>>> tx
> >>>>>>>> 1/1024
> >>>>>>>> rx
> >>>>>>>> 1/1024 queues/slots
> >>>>>>>> 
> >>>>>>>> This is one of those dual nic cards, so there is em1 as
> >>>>>>>> well...
> >>>>>>> 
> >>>>>>> Well, I took a quick look at the driver and it does use
> >>>>>>> m_defrag(),
> >>>>>>> but
> >>>>>>> I think that the "retry:" label it does a goto after doing so
> >>>>>>> might
> >>>>>>> be in
> >>>>>>> the wrong place.
> >>>>>>> 
> >>>>>>> The attached untested patch might fix this.
> >>>>>>> 
> >>>>>>> Is it convenient to build a kernel with this patch applied
> >>>>>>> and then
> >>>>>>> try
> >>>>>>> it with TSO enabled?
> >>>>>>> 
> >>>>>>> rick
> >>>>>>> ps: It does have the transmit segment limit set to 32. I have
> >>>>>>> no
> >>>>>>> idea if
> >>>>>>> 
> >>>>>>>  this is a hardware limitation.
> >>>>>> 
> >>>>>> I think the retry is not in the wrong place, but the overhead
> >>>>>> of all
> >>>>>> those
> >>>>>> pullups is apparently quite severe.
> >>>>> 
> >>>>> The m_defrag() call after the first failure will just barely
> >>>>> squeeze
> >>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I
> >>>>> think any
> >>>>> m_pullup() done during the retry will allocate an mbuf
> >>>>> (at a glance it seems to always do this when the old mbuf is a
> >>>>> cluster)
> >>>>> and prepend that to the list.
> >>>>> --> Now the list is > 32 mbufs again and the
> >>>>> bus_dmammap_load_mbuf_sg()
> >>>>> 
> >>>>>  will fail again on the retry, this time fatally, I think?
> >>>>> 
> >>>>> I can't see any reason to re-do all the stuff using m_pullup()
> >>>>> and Russell
> >>>>> reported that moving the "retry:" fixed his problem, from what I
> >>>>> understood.
> >>>> 
> >>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be
> >>>> nops in this
> >>>> case.  It seems the NIC would really like to have all those things
> >>>> in a single
> >>>> segment, but it is not required, so I agree that your patch is
> >>>> 

Re: NFS client READ performance on -current

2014-07-15 Thread John Baldwin
On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote:
> Yonghyeon Pyun wrote:
> > On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
> > > On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> > > > John Baldwin wrote:
> > > > > On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> > > > > > Russell L. Carter wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 07/02/14 19:09, Rick Macklem wrote:
> > > > > > > 
> > > > > > > > Could you please post the dmesg stuff for the network
> > > > > > > > interface,
> > > > > > > > so I can tell what driver is being used? I'll take a look
> > > > > > > > at
> > > > > > > > it,
> > > > > > > > in case it needs to be changed to use m_defrag().
> > > > > > > 
> > > > > > > em0:  port
> > > > > > > 0xd020-0xd03f
> > > > > > > mem
> > > > > > > 0xfe4a-0xfe4b,0xfe48-0xfe49 irq 44 at
> > > > > > > device 0.0
> > > > > > > on
> > > > > > > pci2
> > > > > > > em0: Using an MSI interrupt
> > > > > > > em0: Ethernet address: 00:15:17:bc:29:ba
> > > > > > > 001.07 [2323] netmap_attach success for em0
> > > > > > > tx
> > > > > > > 1/1024
> > > > > > > rx
> > > > > > > 1/1024 queues/slots
> > > > > > > 
> > > > > > > This is one of those dual nic cards, so there is em1 as
> > > > > > > well...
> > > > > > > 
> > > > > > Well, I took a quick look at the driver and it does use
> > > > > > m_defrag(),
> > > > > > but
> > > > > > I think that the "retry:" label it does a goto after doing so
> > > > > > might
> > > > > > be in
> > > > > > the wrong place.
> > > > > > 
> > > > > > The attached untested patch might fix this.
> > > > > > 
> > > > > > Is it convenient to build a kernel with this patch applied
> > > > > > and then
> > > > > > try
> > > > > > it with TSO enabled?
> > > > > > 
> > > > > > rick
> > > > > > ps: It does have the transmit segment limit set to 32. I have
> > > > > > no
> > > > > > idea if
> > > > > > this is a hardware limitation.
> > > > > 
> > > > > I think the retry is not in the wrong place, but the overhead
> > > > > of all
> > > > > those
> > > > > pullups is apparently quite severe.
> > > > The m_defrag() call after the first failure will just barely
> > > > squeeze
> > > > the just under 64K TSO segment into 32 mbuf clusters. Then I
> > > > think any
> > > > m_pullup() done during the retry will allocate an mbuf
> > > > (at a glance it seems to always do this when the old mbuf is a
> > > > cluster)
> > > > and prepend that to the list.
> > > > --> Now the list is > 32 mbufs again and the
> > > > bus_dmammap_load_mbuf_sg()
> > > > will fail again on the retry, this time fatally, I think?
> > > > 
> > > > I can't see any reason to re-do all the stuff using m_pullup()
> > > > and Russell
> > > > reported that moving the "retry:" fixed his problem, from what I
> > > > understood.
> > > 
> > > Ah, I had assumed (incorrectly) that the m_pullup()s would all be
> > > nops in this
> > > case.  It seems the NIC would really like to have all those things
> > > in a single
> > > segment, but it is not required, so I agree that your patch is
> > > fine.
> > > 
> > 
> > I recall em(4) controllers have various limitation in TSO. Driver
> > has to update IP header to make TSO work so driver has to get a
> > writable mbufs.  bpf(4) consumers will see IP packet length is 0
> > after this change.  I think tcpdump has a compile time option to
> > guess correct IP packet length.  The firmware of controller also
> > should be able to access complete IP/TCP header in a single buffer.
> > I don't remember more details in TSO limitation but I guess you may
> > be able to get more details TSO limitation from publicly available
> > Intel data sheet.
> I think that the patch should handle this ok. All of the m_pullup()
> stuff gets done the first time. Then, if the result is more than 32
> mbufs in the list, m_defrag() is called to copy the chain. This should
> result in all the header stuff in the first mbuf cluster and the map
> call is done again with this list of clusters. (Without the patch,
> m_pullup() would allocate another prepended mbuf and make the chain
> more than 32mbufs again.)

Hmm, I am surprised by the m_pullup() behavior that it doesn't just
notice that the first mbuf with a cluster has the desired data already
and returns without doing anything.  That is, I'm surprised the first
statement in m_pullup() isn't just:

if (n->m_len >= len)
return (n);

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: NFS client READ performance on -current

2014-07-11 Thread John Baldwin
On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> John Baldwin wrote:
> > On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> > > Russell L. Carter wrote:
> > > > 
> > > > 
> > > > On 07/02/14 19:09, Rick Macklem wrote:
> > > > 
> > > > > Could you please post the dmesg stuff for the network
> > > > > interface,
> > > > > so I can tell what driver is being used? I'll take a look at
> > > > > it,
> > > > > in case it needs to be changed to use m_defrag().
> > > > 
> > > > em0:  port
> > > > 0xd020-0xd03f
> > > > mem
> > > > 0xfe4a-0xfe4b,0xfe48-0xfe49 irq 44 at device 0.0
> > > > on
> > > > pci2
> > > > em0: Using an MSI interrupt
> > > > em0: Ethernet address: 00:15:17:bc:29:ba
> > > > 001.07 [2323] netmap_attach success for em0 tx
> > > > 1/1024
> > > > rx
> > > > 1/1024 queues/slots
> > > > 
> > > > This is one of those dual nic cards, so there is em1 as well...
> > > > 
> > > Well, I took a quick look at the driver and it does use m_defrag(),
> > > but
> > > I think that the "retry:" label it does a goto after doing so might
> > > be in
> > > the wrong place.
> > > 
> > > The attached untested patch might fix this.
> > > 
> > > Is it convenient to build a kernel with this patch applied and then
> > > try
> > > it with TSO enabled?
> > > 
> > > rick
> > > ps: It does have the transmit segment limit set to 32. I have no
> > > idea if
> > > this is a hardware limitation.
> > 
> > I think the retry is not in the wrong place, but the overhead of all
> > those
> > pullups is apparently quite severe.
> The m_defrag() call after the first failure will just barely squeeze
> the just under 64K TSO segment into 32 mbuf clusters. Then I think any
> m_pullup() done during the retry will allocate an mbuf
> (at a glance it seems to always do this when the old mbuf is a cluster)
> and prepend that to the list.
> --> Now the list is > 32 mbufs again and the bus_dmammap_load_mbuf_sg()
> will fail again on the retry, this time fatally, I think?
> 
> I can't see any reason to re-do all the stuff using m_pullup() and Russell
> reported that moving the "retry:" fixed his problem, from what I understood.

Ah, I had assumed (incorrectly) that the m_pullup()s would all be nops in this
case.  It seems the NIC would really like to have all those things in a single
segment, but it is not required, so I agree that your patch is fine.

> >  It would be interesting to test
> > the
> > following in addition to your change to see if it improves
> > performance
> > further:
> > 
> > Index: if_em.c
> > ===
> > --- if_em.c (revision 268495)
> > +++ if_em.c (working copy)
> > @@ -1959,7 +1959,9 @@ retry:
> > if (error == EFBIG && remap) {
> > struct mbuf *m;
> >  
> > -   m = m_defrag(*m_headp, M_NOWAIT);
> > +   m = m_collapse(*m_headp, M_NOWAIT, EM_MAX_SCATTER);
> > +   if (m == NULL)
> > +   m = m_defrag(*m_headp, M_NOWAIT);
> Since a just under 64K TSO segment barely fits in 32 mbuf clusters,
> I'm at least 99% sure the m_collapse() will fail, but it can't hurt to
> try it. (If it supported 33 or 34, I think m_collapse() would have a
> reasonable chance of success.)
> 
> Right now the NFS and krpc code creates 2 small mbufs in front of the
> read/write data clusters and I think the TCP layer adds another one.
> Even if this was modified to put it all in one cluster, I don't think
> m_collapse() would succeed, since it only copies the data up and deletes
> an mbuf from the chain if it will all fit in the preceding one. Since
> the read/write data clusters are full (except the last one), they can't
> fit in the M_TRAILINGSPACE() of the preceding one unless it is empty
> from my reading of m_collapse().

Correct, ok.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: System Booting Kernel from Secondary Drive

2014-07-10 Thread John Baldwin
On Thursday, July 03, 2014 10:06:21 am Laurie Jennings via freebsd-net wrote:
> I'm having a problem with a Supermicro system running FreeBSD 9.1. Sometimes 
when I upgrade the kernel in my main drive (ada0), 
> the system boots the kernel from the 2nd drive. It only happens sometimes. 
ada0 is mounted. but the system is running the old kernel. 
> Pulling the 2nd fixed the problem.
> 
> What can cause this to happen? Is it a supermicro problem (it's a 5017R-MTF 
superserver) or is it something with FreeBSD. 

Are you using a software RAID between the two disks?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Add netbw option to systat

2014-07-10 Thread John Baldwin
On Wednesday, July 02, 2014 8:54:41 pm hiren panchasara wrote:
> On Wed, Jul 2, 2014 at 4:50 PM, Bryan Venteicher
>  wrote:
> > Awhile back, DragonlyFlyBSD added a netbw option to systat that I've 
ported
> > to FreeBSD and found handy at various times:
> >
> >netbw  Display aggregate and per-connection TCP receive and 
transmit
> >   rates.  Only active TCP connections are shown.
> >
> > Leading to output such as:
> >
> > tcp acceptsconnects rcv 1.192G snd 15.77K rexmit
> >
> >   192.168.10.80:22  192.168.10.20:23103 rcvsnd 415.7  [  NTSX 
]
> >   192.168.10.80:22  192.168.10.20:46560 rcv 19.80M snd 14.47K [  NTSX 
]
> >   192.168.10.80:22  192.168.10.20:60699 rcvsnd 886.3  [  NTSX 
]
> >   192.168.10.81:5201192.168.10.51:60844 rcv 293.2M snd[R  TSX 
]
> >   192.168.10.81:5201192.168.10.51:60845 rcv 293.5M snd[R  TSX 
]
> >   192.168.10.81:5201192.168.10.51:60846 rcv 293.2M snd[R  TSX 
]
> >   192.168.10.81:5201192.168.10.51:60847 rcv 292.9M snd[R  TSX 
]
> >
> > It uses the sequences number from the 'struct tcpcb' to derive the rates,
> > which is usually good but certainly not perfect (i.e., don't set the
> > interval too long).
> >
> > I'd like to commit this if anybody else thinks they'd find it useful.
> >
> > http://people.freebsd.org/~bryanv/patches/systat-netbw.patch
> 
> I like the idea.

I also like the idea.

> A few things about the patch:
> 1) You may want to remove the code hidden behind "#if 0" at 2 places.
> 2) I am not entirely clear on why/if we need the last column with
> flags but if we keep it (for compatibility of any other reason), It
> would be nice to have those flags explained in the manpage:
> 
> + mvwprintw(wnd, LINES-2, 0,
> +  "Rate/sec, "
> +  "R=rxpend T=txpend N=nodelay T=tstmp "
> +  "S=sack X=winscale F=fastrec");
> 3) I feel that the header line for o/p (specially 'tcp accepts and
> connects' terminology) can be improved but I do not have a better
> suggestion :-)

4) Should numtok() just be humanize_number?  Or rather, would it simplify
   the code to use humanize_number?  (It might not, but if it does, I
   think that would be preferable.)

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: r256920 missing in stable/9 and releng/9.3

2014-07-10 Thread John Baldwin
On Monday, July 07, 2014 12:58:53 pm hiren panchasara wrote:
> + freebsd-net@,
> 
> On Mon, Jul 7, 2014 at 4:37 AM, Harald Schmalzbauer
>  wrote:
> > BezĂĽglich Jan Mikkelsen's Nachricht vom 24.06.2014 04:49 (localtime):
> >> Hi,
> >>
> >> I’m bringing 9.3-RC1 into our local Perforce depot and moving our local 
patches to 9.2 forward.
> >>
> >> I noticed that r256920 (changing sys/netinet/tcp_input.c) has not been 
MFC’d. It was listed as “MFC after 3 days” back in October 2013.
> >>
> >> Is this patch missing for a reason?
> >
> > I'm wondering too if there's any good reason not to MFC?
> 
> I also don't see any obvious reason.
> 
> If nobody objects on -net@, I can do it.

I think this looks fine to merge.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: NFS client READ performance on -current

2014-07-10 Thread John Baldwin
On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> Russell L. Carter wrote:
> > 
> > 
> > On 07/02/14 19:09, Rick Macklem wrote:
> > 
> > > Could you please post the dmesg stuff for the network interface,
> > > so I can tell what driver is being used? I'll take a look at it,
> > > in case it needs to be changed to use m_defrag().
> > 
> > em0:  port 0xd020-0xd03f
> > mem
> > 0xfe4a-0xfe4b,0xfe48-0xfe49 irq 44 at device 0.0 on
> > pci2
> > em0: Using an MSI interrupt
> > em0: Ethernet address: 00:15:17:bc:29:ba
> > 001.07 [2323] netmap_attach success for em0 tx 1/1024
> > rx
> > 1/1024 queues/slots
> > 
> > This is one of those dual nic cards, so there is em1 as well...
> > 
> Well, I took a quick look at the driver and it does use m_defrag(), but
> I think that the "retry:" label it does a goto after doing so might be in
> the wrong place.
> 
> The attached untested patch might fix this.
> 
> Is it convenient to build a kernel with this patch applied and then try
> it with TSO enabled?
> 
> rick
> ps: It does have the transmit segment limit set to 32. I have no idea if
> this is a hardware limitation.

I think the retry is not in the wrong place, but the overhead of all those
pullups is apparently quite severe.  It would be interesting to test the
following in addition to your change to see if it improves performance
further:

Index: if_em.c
===
--- if_em.c (revision 268495)
+++ if_em.c (working copy)
@@ -1959,7 +1959,9 @@ retry:
if (error == EFBIG && remap) {
struct mbuf *m;
 
-   m = m_defrag(*m_headp, M_NOWAIT);
+   m = m_collapse(*m_headp, M_NOWAIT, EM_MAX_SCATTER);
+   if (m == NULL)
+       m = m_defrag(*m_headp, M_NOWAIT);
if (m == NULL) {
adapter->mbuf_alloc_failed++;
m_freem(*m_headp);


-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Ordering problem in if_detach_internal regarding if_bridge

2014-06-24 Thread John Baldwin
On Monday, June 23, 2014 1:12:54 pm Roger Pau Monné wrote:
> On 23/06/14 18:49, Alexander V. Chernikov wrote:
> > On 23.06.2014 20:39, Alexander V. Chernikov wrote:
> >> On 23.06.2014 19:32, John Baldwin wrote:
> >>> On Friday, June 20, 2014 11:25:51 am Roger Pau Monné wrote:
> >>>> Hello,
> >>>>
> >>>> I've stumbled across the following panic when testing Xen netback with 
> >>>> if_bridge:
> >>>>
> >>>> Kernel page fault with the following non-sleepable locks held:
> >>>> exclusive sleep mutex if_bridge (if_bridge) r = 0 (0xf80006306c18) 
> >>> locked @ /usr/src/sys/m
> >>>> KDB: stack backtrace:
> >>>> X_db_symbol_values() at X_db_symbol_values+0x10b/frame 0xfe213490
> >>>> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfe213540
> >>>> witness_warn() at witness_warn+0x4a8/frame 0xfe213600
> >>>> trap() at trap+0xc9d/frame 0xfe2136a0
> >>>> trap() at trap+0x669/frame 0xfe2138b0
> >>>> calltrap() at calltrap+0x8/frame 0xfe2138b0
> >>>> --- trap 0xc, rip = 0x8221a0ef, rsp = 0xfe213970, rbp = 
> >>> 0xfe2139e0 ---
> >>>> bridge_input() at bridge_input+0x5ff/frame 0xfe2139e0
> >>>> ether_vlanencap() at ether_vlanencap+0x4a3/frame 0xfe213a10
> >>>> netisr_dispatch_src() at netisr_dispatch_src+0x90/frame 
> >>>> 0xfe213a80
> >>>> ether_ifattach() at ether_ifattach+0x19f/frame 0xfe213ab0
> >>>> ath_dfs_get_thresholds() at ath_dfs_get_thresholds+0x81ce/frame 
> >>> 0xfe213b30
> >>>> intr_event_execute_handlers() at intr_event_execute_handlers+0x93/frame 
> >>> 0xfe213b70
> >>>> db_dump_intr_event() at db_dump_intr_event+0x796/frame 0xfe213bb0
> >>>> fork_exit() at fork_exit+0x84/frame 0xfe213bf0
> >>>> fork_trampoline() at fork_trampoline+0xe/frame 0xfe213bf0
> >>>> --- trap 0, rip = 0, rsp = 0xfe213cb0, rbp = 0 ---
> >>>>
> >>>> I've tracked this down to if_detach_internal setting ifp->if_addr to 
> >>>> NULL before calling EVENTHANDLER_INVOKE(ifnet_departure_event..., which 
> >>>> causes a panic in GRAB_OUR_PACKETS in the if_bridge code when it tries 
> >>>> to perform IF_LLADDR on an interface that's in the process of being 
> >>>> destroyed (ifp->if_addr set to NULL, but the ifnet_departure_event event 
> >>>> has not fired yet).
> >>>>
> >>>> I have the following naive patch that moves the firing of the event 
> >>>> before if_addr is set to NULL, but I'm not familiar with the ordering 
> >>>> in if_detach_internal, so I'm not sure if this might cause problems in 
> >>>> other parts of the code, could someone familiar with the net stuff 
> >>>> comment on the best way to deal with it?
> >>
> >> We should notify kernel customers only when we are really taking this
> >> interface down and every other subsystem cannot add any new state to the
> >> interface.
> >>
> >> In this patch you're sending notification before taking ifnet down,
> >> removing its L3 addresses, routes, and so on.
> >>
> >> This can easily lead to panic in, for example, BPF subsystem (since BPF
> >> state is freed in bpf_ifdetach() handler).
> >>
> >> Addintionally, this will introduce ifaddr / iface messages reversal for
> >> rtsock.
> > Whoops. I misread the patch.
> > It should be OK.
> > 
> >>
> >> It looks like we'd better fix if_bridge (and it is still using mutexes,
> >> what a shame!).
> >>
> >> Can you send me trace with line numbers?
> > However, these two still stands.
> > (And I'm wondering how you're getting any traffic on down/dying interface).
> 
> I'm not getting the traffic from the dying interface, I'm getting the
> traffic from another interface on the bridge (a physical bce interface),
> which injects traffic into the bridge, that calls bridge_input, which
> tries to read ifp->if_addr->ifa_addr from the dying interface, and that
> leads to the panic.
> 
> Line numbers:
> 
> /usr/src/sys/modules/if_bridge/../../net/if_bridge.c:2410 (bridge_input)
> /usr/src/sys/net/if_ethersubr.c:543 (ether_input_internal)
> /usr/src/sys/net/netisr.c:972 (netisr_dispatch_src)
> /usr/src/sys/net/if_ethersubr.c:674 (ether_input)
> /usr/src/sys/dev/bce/if_bce.c:6861 (bce_rx_intr)

I think this certainly suggests moving at least the eventhandler up so that
things like vlans and bridges can detach from an interface while it is still
constructed.  I do think it would be ideal to move all three notifications
to the same place though.  (So your original patch plus moving the
routing socket message.)

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Ordering problem in if_detach_internal regarding if_bridge

2014-06-23 Thread John Baldwin
On Friday, June 20, 2014 11:25:51 am Roger Pau Monné wrote:
> Hello,
> 
> I've stumbled across the following panic when testing Xen netback with 
> if_bridge:
> 
> Kernel page fault with the following non-sleepable locks held:
> exclusive sleep mutex if_bridge (if_bridge) r = 0 (0xf80006306c18) 
locked @ /usr/src/sys/m
> KDB: stack backtrace:
> X_db_symbol_values() at X_db_symbol_values+0x10b/frame 0xfe213490
> kdb_backtrace() at kdb_backtrace+0x39/frame 0xfe213540
> witness_warn() at witness_warn+0x4a8/frame 0xfe213600
> trap() at trap+0xc9d/frame 0xfe2136a0
> trap() at trap+0x669/frame 0xfe2138b0
> calltrap() at calltrap+0x8/frame 0xfe2138b0
> --- trap 0xc, rip = 0x8221a0ef, rsp = 0xfe213970, rbp = 
0xfe2139e0 ---
> bridge_input() at bridge_input+0x5ff/frame 0xfe2139e0
> ether_vlanencap() at ether_vlanencap+0x4a3/frame 0xfe213a10
> netisr_dispatch_src() at netisr_dispatch_src+0x90/frame 0xfe213a80
> ether_ifattach() at ether_ifattach+0x19f/frame 0xfe213ab0
> ath_dfs_get_thresholds() at ath_dfs_get_thresholds+0x81ce/frame 
0xfe213b30
> intr_event_execute_handlers() at intr_event_execute_handlers+0x93/frame 
0xfe213b70
> db_dump_intr_event() at db_dump_intr_event+0x796/frame 0xfe213bb0
> fork_exit() at fork_exit+0x84/frame 0xfe213bf0
> fork_trampoline() at fork_trampoline+0xe/frame 0xfe213bf0
> --- trap 0, rip = 0, rsp = 0xfe213cb0, rbp = 0 ---
> 
> I've tracked this down to if_detach_internal setting ifp->if_addr to 
> NULL before calling EVENTHANDLER_INVOKE(ifnet_departure_event..., which 
> causes a panic in GRAB_OUR_PACKETS in the if_bridge code when it tries 
> to perform IF_LLADDR on an interface that's in the process of being 
> destroyed (ifp->if_addr set to NULL, but the ifnet_departure_event event 
> has not fired yet).
> 
> I have the following naive patch that moves the firing of the event 
> before if_addr is set to NULL, but I'm not familiar with the ordering 
> in if_detach_internal, so I'm not sure if this might cause problems in 
> other parts of the code, could someone familiar with the net stuff 
> comment on the best way to deal with it?

Hmmm, I have no idea if this is ok or not.  I do think the route message 
should go out at the same time as the devctl_notify() call however.  My guess 
is it is actually better to do this earlier so that we allow outside consumers
to detach from an interface before it is destroyed.  I'm not sure if it would
break things, but I would be tempted to move this even earlier right after it
is removed from the global ifnet list but before the taskqueue_drain, etc.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: LAN network performance issues

2014-03-07 Thread John Baldwin
On Friday, March 07, 2014 12:17:05 am jcv wrote:
> Hi - I am seeing some strange IPERF results.. Everything goes through my 
> WIFI/GIGABIT router.
> 
> For these tests everything is plugged directly into the router via 
> Ethernet cable.
> 
> My issue is the transfer rate from Windows to FreeBSD.
> 
> There are 3 different computers in this lab running 3 different OS.
> 
> Here are the results:
> 
> 
> 
> FreeBSD as server:
> 
> [vic@yeaguy ~] iperf -s
> 
> Server listening on TCP port 5001
> TCP window size: 64.0 KByte (default)
> 
> 
> 
> [  4] local 192.168.1.3 port 5001 connected with 192.168.1.8 port 52505
> [ ID] Interval   Transfer Bandwidth
> [  4]  0.0-10.1 sec   157 MBytes  131 Mbits/sec <- WINDOWS 8.1 as 
> client on same LAN/ROUTER
> 
> 
> 
> 
> [  5] local 192.168.1.3 port 5001 connected with 192.168.1.12 port 60926
> [  5]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec <-- MACBOOK PRO as 
> client on same LAN/ROUTER
> 
> 
> Windows as the server:
> 
> 
> Server listening on TCP port 5001
> TCP window size: 64.0 KByte (default)
> 
> [  4] local 192.168.1.8 port 5001 connected with 192.168.1.3 port 60529
> [ ID] Interval   Transfer Bandwidth
> [  4]  0.0-10.0 sec  1014 MBytes   850 Mbits/sec <- Freebsd 10 as 
> client on same LAN/ROUTER
> 
> 
> 
> [  4] local 192.168.1.8 port 5001 connected with 192.168.1.12 port 60933
> [  4]  0.0-10.0 sec  1.08 GBytes   931 Mbits/sec <-- MACBOOK PRO as 
> client on same LAN/ROUTER
> 
> 
> 
> Macbook Pro as the server:
> 
> [  3] local 192.168.1.8 port 52509 connected with 192.168.1.12 port 5001
> [ ID] Interval   Transfer Bandwidth
> [  3]  0.0-10.0 sec   823 MBytes   690 Mbits/sec <-- WINDOWS 8.1 as 
> client on same LAN/ROUTER
> 
> [  3] local 192.168.1.3 port 23190 connected with 192.168.1.12 port 5001
> [ ID] Interval   Transfer Bandwidth
> [  3]  0.0-10.0 sec  1016 MBytes   852 Mbits/sec <-- Freebsd 10 as 
> client on same LAN/ROUTER
> 
> 
> With FreeBSD being the server, Windows transfer to FreeBSD is slow, 
> compared to Macbook to FreeBSD transfer..
> With Windows as the server, FreeBSD and Macbook to Windows transfer is 
> great.
> With Macbook as server, Windows and FreeBSD transfer is good.
> 
> The only bad transfer is Windows to FreeBSD. Windows transfer to Mac is 
> good. Cant really blame Windows for the poor transfer to FreeBSD then. 
> Macbook to FreeBSD is outstanding, cant really blame FreeBSD for poor 
> receive performance.

Can you tell us more about the FreeBSD box such as the NIC being used?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: QLE3142-CU-CK driver (NetXen NX3031 chipset)

2013-09-20 Thread John Baldwin
On Tuesday, September 17, 2013 12:59:44 pm Ryan McIntosh wrote:
> This particular chipset used in this card has been brought up in the past
> under threads about HP's NC375 controller. The outcome was that it needed
> information/assistance from Qlogic to develop a driver. I have 2 of the
> cards mentioned in the title (details below) and I've finally gotten a
> useful response from Qlogic about opening channels to assist FreeBSD with
> Qlogic hardware; however they've requested a FreeBSD developer's contact
> information to get in touch with if anyone is up for the challenge?

QLogic employs one FreeBSD developer already who maintains the qlxgb(4),
qlxge(4), and qlxgbe(4) drivers: David C Somayajulu .

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Does pthread_set_name_np() work?

2013-09-03 Thread John Baldwin
On Wednesday, August 21, 2013 12:36:09 pm Laurie Jennings wrote:
> Im trying to set the names of threads so I can distinguish them in top -H, 
but it doesn't seem to 
> take the thread id as valid.
> 
> err=pthread_set_name_np(pthread_self(),"FOO");

This function returns void, not an error, so you can't trust the return value.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [rfc] migrate lagg to an rmlock

2013-08-29 Thread John Baldwin
On Thursday, August 29, 2013 11:37:08 am Scott Long wrote:
> 
> On Aug 29, 2013, at 7:42 AM, John Baldwin  wrote:
> 
> > On Saturday, August 24, 2013 10:16:33 am Robert Watson wrote:
> >> There are a number of other places in the kernel where migration to an 
> >> rmlock 
> >> makes sense -- however, some care must be taken for four reasons: (1) 
> >> while 
> >> read locks don't experience line contention, write locking becomes 
> >> observably 
> >> e.g., rmlocks might not be suitable for tcbinfo; (2) rmlocks, unlike 
> >> rwlocks, 
> >> more expensive so is not suitable for all rwlock line contention spots -- 
> >> implement reader priority propagation, so you must reason about; and (3) 
> >> historically, rmlocks have not fully implemented WITNESS so you may get 
> >> less 
> >> good debugging output.  if_lagg is a nice place to use rmlocks, as 
> >> reconfigurations are very rare, and it's really all about long-term data 
> >> stability.
> > 
> > 3) should no longer be an issue.  rmlocks now have full WITNESS and 
> > assertion
> > support (including an rm_assert).
> > 
> > However, one thing to consider is that rmlocks pin readers to CPUs while the
> > read lock is held (which rwlocks do not do).
> 
> And this is not a problem for the application that we're giving it in the
> lagg driver.

That is likely true.  I was merely tweaking Robert's general guidelines re: 
rmlock.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: [rfc] migrate lagg to an rmlock

2013-08-29 Thread John Baldwin
On Saturday, August 24, 2013 10:16:33 am Robert Watson wrote:
> There are a number of other places in the kernel where migration to an rmlock 
> makes sense -- however, some care must be taken for four reasons: (1) while 
> read locks don't experience line contention, write locking becomes observably 
> e.g., rmlocks might not be suitable for tcbinfo; (2) rmlocks, unlike rwlocks, 
> more expensive so is not suitable for all rwlock line contention spots -- 
> implement reader priority propagation, so you must reason about; and (3) 
> historically, rmlocks have not fully implemented WITNESS so you may get less 
> good debugging output.  if_lagg is a nice place to use rmlocks, as 
> reconfigurations are very rare, and it's really all about long-term data 
> stability.

3) should no longer be an issue.  rmlocks now have full WITNESS and assertion
support (including an rm_assert).

However, one thing to consider is that rmlocks pin readers to CPUs while the
read lock is held (which rwlocks do not do).

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern/180430: [ofed] [patch] Bad UDP checksum calc for fragmented packets

2013-08-07 Thread John Baldwin
On Monday, August 05, 2013 6:49:01 am Meny Yossefi wrote:
> John, 
> 
> Will this be committed to 9.2 as well ?

Yes, I committed it yesterday.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern/180791: [ofed] [patch] Kernel crash on ifdown and kldunload mlxen

2013-07-25 Thread John Baldwin
The following reply was made to PR kern/180791; it has been noted by GNATS.

From: John Baldwin 
To: bug-follo...@freebsd.org,
 shah...@mellanox.com
Cc:  
Subject: Re: kern/180791: [ofed] [patch] Kernel crash on ifdown and kldunload 
mlxen
Date: Thu, 25 Jul 2013 15:18:06 -0400

 Thanks.  One note is that it seems like the state_lock should be held when
 stop is called.  Also, the callout used for stats should be drained during
 detach.  (If you use callot_init_mtx() instead of callout_init() then
 callout_stop() under the lock is race-free, though I'd still do the
 callout_drain() during detach to be safe.)
 
 Also, I had some other changes I made to this file to make the locking more
 consistent with other NIC drivers in the tree.  Can you look at this and
 test this patch please?
 
 Index: en_netdev.c
 ===
 --- en_netdev.c(revision 253547)
 +++ en_netdev.c(working copy)
 @@ -495,11 +495,6 @@ static void mlx4_en_do_get_stats(struct work_struc
  
queue_delayed_work(mdev->workqueue, &priv->stats_task, 
STATS_DELAY);
}
 -  if (mdev->mac_removed[MLX4_MAX_PORTS + 1 - priv->port]) {
 -  panic("mlx4_en_do_get_stats: Unexpected mac removed for %d\n",
 -  priv->port);
 -  mdev->mac_removed[MLX4_MAX_PORTS + 1 - priv->port] = 0;
 -  }
mutex_unlock(&mdev->state_lock);
  }
  
 @@ -688,8 +683,8 @@ int mlx4_en_start_port(struct net_device *dev)
mlx4_en_set_multicast(dev);
  
/* Enable the queues. */
 -  atomic_clear_int(&dev->if_drv_flags, IFF_DRV_OACTIVE);
 -  atomic_set_int(&dev->if_drv_flags, IFF_DRV_RUNNING);
 +  dev->if_drv_flags &= ~IFF_DRV_OACTIVE;
 +  dev->if_drv_flags |= IFF_DRV_RUNNING;
  
callout_reset(&priv->watchdog_timer, MLX4_EN_WATCHDOG_TIMEOUT,
mlx4_en_watchdog_timeout, priv);
 @@ -761,7 +756,7 @@ void mlx4_en_stop_port(struct net_device *dev)
  
callout_stop(&priv->watchdog_timer);
  
 -  atomic_clear_int(&dev->if_drv_flags, IFF_DRV_RUNNING);
 +  dev->if_drv_flags &= ~(IFF_DRV_RUNNING | IFF_DRV_OACTIVE);
  }
  
  static void mlx4_en_restart(struct work_struct *work)
 @@ -802,19 +797,30 @@ mlx4_en_init(void *arg)
  {
struct mlx4_en_priv *priv;
struct mlx4_en_dev *mdev;
 +
 +  priv = arg;
 +  mdev = priv->mdev;
 +  mutex_lock(&mdev->state_lock);
 +  mlx4_en_init_locked(priv);
 +  mutex_unlock(&mdev->state_lock);
 +}
 +
 +static void
 +mlx4_en_init_locked(struct mlx4_en_priv *priv)
 +{
 +
 +  struct mlx4_en_dev *mdev;
struct ifnet *dev;
int i;
  
 -  priv = arg;
dev = priv->dev;
mdev = priv->mdev;
 -  mutex_lock(&mdev->state_lock);
if (dev->if_drv_flags & IFF_DRV_RUNNING)
mlx4_en_stop_port(dev);
  
if (!mdev->device_up) {
en_err(priv, "Cannot open - device down/disabled\n");
 -  goto out;
 +  return;
}
  
/* Reset HW statistics and performance counters */
 @@ -835,9 +841,6 @@ mlx4_en_init(void *arg)
mlx4_en_set_default_moderation(priv);
if (mlx4_en_start_port(dev))
en_err(priv, "Failed starting port:%d\n", priv->port);
 -
 -out:
 -  mutex_unlock(&mdev->state_lock);
  }
  
  void mlx4_en_free_resources(struct mlx4_en_priv *priv)
 @@ -927,9 +930,14 @@ void mlx4_en_destroy_netdev(struct net_device *dev
if (priv->sysctl)
sysctl_ctx_free(&priv->conf_ctx);
  
 +  mutex_lock(&mdev->state_lock);
 +  mlx4_en_stop_port(dev);
 +  mutex_unlock(&mdev->state_lock);
 +
cancel_delayed_work(&priv->stats_task);
/* flush any pending task for this netdev */
flush_workqueue(mdev->workqueue);
 +  callout_drain(&priv->watchdog_timer);
  
/* Detach the netdev so tasks would not attempt to access it */
mutex_lock(&mdev->state_lock);
 @@ -1091,25 +1099,25 @@ static int mlx4_en_ioctl(struct ifnet *dev, u_long
error = -mlx4_en_change_mtu(dev, ifr->ifr_mtu);
break;
case SIOCSIFFLAGS:
 +  mutex_lock(&mdev->state_lock);
if (dev->if_flags & IFF_UP) {
 -  if ((dev->if_drv_flags & IFF_DRV_RUNNING) == 0) {
 -  mutex_lock(&mdev->state_lock);
 +  if ((dev->if_drv_flags & IFF_DRV_RUNNING) == 0)
mlx4_en_start_port(dev);
 -  mutex_unlock(&mdev->state_lock);
 -  } else
 +  else
mlx4_en_set_multicast(dev);
  

Re: kern/180430: [ofed] [patch] Bad UDP checksum calc for fragmented packets

2013-07-22 Thread John Baldwin
The following reply was made to PR kern/180430; it has been noted by GNATS.

From: John Baldwin 
To: Meny Yossefi 
Cc: "bug-follo...@freebsd.org" 
Subject: Re: kern/180430: [ofed] [patch] Bad UDP checksum calc for fragmented 
packets
Date: Mon, 22 Jul 2013 11:40:08 -0400

 On Monday, July 22, 2013 10:11:51 am Meny Yossefi wrote:
 > Hi John,
 > 
 > 
 > 
 > The problem is that the HW will only calculate csum for parts of the 
 > payload, one fragment at a time,
 > 
 > while the receiver side, in our case the tcp/ip stack, will expect to 
 > validate the packet's payload as a whole.
 
 Ok.
 
 > I agree with the change you offered, though one thing bothers me.
 > 
 > This change will add two conditions to the send flow which will probably 
 > have an effect on performance.
 > 
 > Maybe 'likely' can be useful here ?
 
 FreeBSD tends to not use likely/unlikely unless there is a demonstrable gain
 via benchmark measurements.  However, if the OFED code regularly uses it and
 you feel this is a case where you would normally use it, it can be added.
 
 > BTW, I'm not entirely sure, but I think the CSUM_IP flag is always set, so 
 > maybe the first condition is not necessary.
 > 
 > What do you think ?
 
 If the user uses ifconfig to disable checksum offload and force software
 checksums the flag will not be set.
 
 > -Meny
 > 
 > 
 > 
 > 
 > 
 > -Original Message-
 > From: John Baldwin [mailto:j...@freebsd.org]
 > Sent: Friday, July 19, 2013 6:29 PM
 > To: bug-follo...@freebsd.org; Meny Yossefi
 > Subject: Re: kern/180430: [ofed] [patch] Bad UDP checksum calc for 
 > fragmented packets
 > 
 > 
 > 
 > Oops, my previous reply didn't make it to the PR itself.
 > 
 > 
 > 
 > Is the problem that the hardware checksum overwrites arbitrary data in the 
 > packet (at the location where the UDP header would be)?
 > 
 > 
 > 
 > Also, I've looked at other drivers, and they all look at the CSUM_* flags to 
 > determine the right settings.  IP fragments will not have CSUM_UDP or 
 CSUM_TCP set, so I think the more correct fix is this:
 > 
 > 
 > 
 > Index: en_tx.c
 > 
 > ===
 > 
 > --- en_tx.c   (revision 253470)
 > 
 > +++ en_tx.c(working copy)
 > 
 > @@ -780,8 +780,12 @@ retry:
 > 
 >tx_desc->ctrl.srcrb_flags = 
 > cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
 > 
 >  
 >   MLX4_WQE_CTRL_SOLICITED);
 > 
 >if (mb->m_pkthdr.csum_flags & (CSUM_IP|CSUM_TCP|CSUM_UDP)) {
 > 
 > -  tx_desc->ctrl.srcrb_flags |= 
 > cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
 > 
 > -
 >   MLX4_WQE_CTRL_TCP_UDP_CSUM);
 > 
 > + if (mb->m_pkthdr.csum_flags & CSUM_IP)
 > 
 > + tx_desc->ctrl.srcrb_flags |=
 > 
 > + 
 > cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM);
 > 
 > + if (mb->m_pkthdr.csum_flags & 
 > (CSUM_TCP|CSUM_UDP))
 > 
 > +         tx_desc->ctrl.srcrb_flags |=
 > 
 > + 
 > cpu_to_be32(MLX4_WQE_CTRL_TCP_UDP_CSUM);
 > 
 >priv->port_stats.tx_chksum_offload++;
 > 
 >}
 > 
 > 
 > 
 > --
 > 
 > John Baldwin
 > 
 
 -- 
 John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern/180430: [ofed] [patch] Bad UDP checksum calc for fragmented packets

2013-07-19 Thread John Baldwin
The following reply was made to PR kern/180430; it has been noted by GNATS.

From: John Baldwin 
To: bug-follo...@freebsd.org,
 me...@mellanox.com
Cc:  
Subject: Re: kern/180430: [ofed] [patch] Bad UDP checksum calc for fragmented 
packets
Date: Fri, 19 Jul 2013 11:13:44 -0400

 Oops, my previous reply didn't make it to the PR itself.
 
 Is the problem that the hardware checksum overwrites arbitrary data in the 
 packet (at the location where the UDP header would be)?
 
 Also, I've looked at other drivers, and they all look at the CSUM_*
 flags to determine the right settings.  IP fragments will not have
 CSUM_UDP or CSUM_TCP set, so I think the more correct fix is this:
 
 Index: en_tx.c
 ===
 --- en_tx.c(revision 253470)
 +++ en_tx.c(working copy)
 @@ -780,8 +780,12 @@ retry:
tx_desc->ctrl.srcrb_flags = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
MLX4_WQE_CTRL_SOLICITED);
if (mb->m_pkthdr.csum_flags & (CSUM_IP|CSUM_TCP|CSUM_UDP)) {
 -  tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
 -   
MLX4_WQE_CTRL_TCP_UDP_CSUM);
 +  if (mb->m_pkthdr.csum_flags & CSUM_IP)
 +  tx_desc->ctrl.srcrb_flags |=
 +  cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM);
 +  if (mb->m_pkthdr.csum_flags & (CSUM_TCP|CSUM_UDP))
 +  tx_desc->ctrl.srcrb_flags |=
 +  cpu_to_be32(MLX4_WQE_CTRL_TCP_UDP_CSUM);
priv->port_stats.tx_chksum_offload++;
}
  
 
 -- 
 John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: bind error when using SO_REUSEPORT(implies SO_REUSEADDR)

2013-07-18 Thread John Baldwin
On Wednesday, July 17, 2013 5:23:37 pm Mikolaj Golub wrote:
> On Tue, Jul 16, 2013 at 11:12:46AM -0400, John Baldwin wrote:
> > On Thursday, March 15, 2012 8:07:46 pm Sean Bruno wrote:
> > > On Thu, 2012-03-15 at 16:59 -0700, Sean Bruno wrote:
> > > > Hey, I just found a bind bug ticket in my queue about bind.  I noted
> > > > that on stable/6 stable/7 stable/9 & head the referenced code "fails".
> > > > 
> > > > It seems that this is a problem, but I have no idea if its a real
> > > > problem or not.  Our devs think it is.  Anyway, here is a code snippet
> > > > to show the failure in bind.  On linux/solaris this does not fail.
> > > > 
> > > > http://people.freebsd.org/~sbruno/bind_test.c
> > > > 
> > > > simple compile with gcc -o test test.c and run as normal user.
> > > > 
> > > > Sean
> > > > 
> > > 
> > > this is bind() not bind ... :-)
> > 
> > Did the recent commit to HEAD fix this btw?
> 
> As for me, bind_test.c does not expose any bug in freebsd, it only
> shows different behavior for freebsd and linux.
> 
> On freebsd the test output is:
> 
> serversock addr is 127.0.0.1:27539
> dup bind: Address already in use
> This error was expected, tried to bind to used addr/port
> BUG: binding duplicate socket to server port succeeded
> dup2sock addr is 0.0.0.0:27539
> overlapping explicit bind to same port number succeeded without SO_REUSEPORT
> listen succeeded after explicitly overlapping port bind
> autosock addr is 0.0.0.0:27539
> bug triggered, port number conflict on sockets without SO_REUSEPORT
> listen succeded after implicitly overlapping port bind
> 
> So, the first socket (serversock) is bound to the loopback address,
> then it tries some combinations of binding the second socket to the
> same port but to the wildcard address. When SO_REUSEADDR socket option
> is set, binding to the wildcard address succeeds for freebsd (and
> fails for linux).
> 
> They call this a bug in freebsd, but this is well known and expected
> behavior (see e.g. Stevens' TCP/IP Illustrated Vol1). 
> 
> Or I missed the test's point?

No, that is probably true.  I wasn't sure if it was a bug or not when Sean 
originally posted it.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: bind error when using SO_REUSEPORT(implies SO_REUSEADDR)

2013-07-16 Thread John Baldwin
On Thursday, March 15, 2012 8:07:46 pm Sean Bruno wrote:
> On Thu, 2012-03-15 at 16:59 -0700, Sean Bruno wrote:
> > Hey, I just found a bind bug ticket in my queue about bind.  I noted
> > that on stable/6 stable/7 stable/9 & head the referenced code "fails".
> > 
> > It seems that this is a problem, but I have no idea if its a real
> > problem or not.  Our devs think it is.  Anyway, here is a code snippet
> > to show the failure in bind.  On linux/solaris this does not fail.
> > 
> > http://people.freebsd.org/~sbruno/bind_test.c
> > 
> > simple compile with gcc -o test test.c and run as normal user.
> > 
> > Sean
> > 
> 
> this is bind() not bind ... :-)

Did the recent commit to HEAD fix this btw?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: misc/179033: [dc] dc ethernet driver seems to have issues with some multiport card and mother board combinations

2013-07-16 Thread John Baldwin
On Wednesday, June 26, 2013 12:37:13 am Mr. Clif wrote:
> Hi John,
> 
> Thanks for working on this. I'm very interested in getting this fixed 
> for everyone that uses the Affected Atom boards and other small format 
> boards that work well in small custom routers.
> 
> However right now I have a big network upgrade I'm working on and don't 
> have time to get to it until late July, I'm hoping. So please forgive me 
> for the long delay.
> 
>  Thanks for your help,
>  Clif

I've tested your specific case more by hacking the PCI bus driver to assign
a bogus range to my NIC on my netbook and verifying it rejected the request
and allocated a new range.  I did have to fix a bug though, so once you get
a chance to test, please test

http://www.freebsd.org/~jhb/patches/pci_isa_enable2.patch instead.

I will go ahead and commit a slightly cleaned up version (with less debugging)
today, but the patch above will output enough debugging to verify it is working
without requiring a verbose boot.

> John Baldwin wrote:
> > On Monday, June 10, 2013 3:13:11 pm Mr. Clif wrote:
> >> Hi John and Pyun,
> >>
> >> Ok got the new kernel installed and tested. Yes it works! :-) Maybe that
> >> will also fix a simular problem with the sun cards (cas[03]), except I
> >> don't see a define like that in if_cas.c. Suggestions?
> > So I have a possible "real" fix for this.  However, I do not have any 
> > hardware
> > I can find that has a PCI-PCI bridge with the ISA-enable bit set.  I know it
> > compiles and boots fine on other systems.  Can you please try this and 
> > capture
> > the dmesg output?  It would also be good to capture devinfo -u output before
> > and after.
> >
> > http://www.freebsd.org/~jhb/patches/pci_isa_enable.patch
> >
> 
> 

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern/180430: [ofed] [patch] Bad UDP checksum calc for fragmented packets

2013-07-11 Thread John Baldwin
On Wednesday, July 10, 2013 6:59:42 am lini...@freebsd.org wrote:
> Old Synopsis: Bad UDP checksum calc for fragmented packets
> New Synopsis: [ofed] [patch] Bad UDP checksum calc for fragmented packets
> 
> Responsible-Changed-From-To: freebsd-bugs->freebsd-net
> Responsible-Changed-By: linimon
> Responsible-Changed-When: Wed Jul 10 10:59:03 UTC 2013
> Responsible-Changed-Why: 
> Over to maintainer(s).
> 
> http://www.freebsd.org/cgi/query-pr.cgi?pr=180430

Is the problem that the hardware checksum overwrites arbitrary data in the 
packet (at the location where the UDP header would be)?

Also, I've looked at other drivers, and they all look at the CSUM_*
flags to determine the right settings.  IP fragments will not have
CSUM_UDP or CSUM_TCP set, so I think the more correct fix is this:

Index: en_tx.c
===
--- en_tx.c (revision 253202)
+++ en_tx.c (working copy)
@@ -780,8 +780,12 @@ retry:
tx_desc->ctrl.srcrb_flags = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
MLX4_WQE_CTRL_SOLICITED);
if (mb->m_pkthdr.csum_flags & (CSUM_IP|CSUM_TCP|CSUM_UDP)) {
-   tx_desc->ctrl.srcrb_flags |= cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM |
-
MLX4_WQE_CTRL_TCP_UDP_CSUM);
+   if (mb->m_pkthdr.csum_flags & CSUM_IP)
+   tx_desc->ctrl.srcrb_flags |=
+   cpu_to_be32(MLX4_WQE_CTRL_IP_CSUM);
+   if (mb->m_pkthdr.csum_flags & (CSUM_TCP|CSUM_UDP)) {
+   tx_desc->ctrl.srcrb_flags |=
+   cpu_to_be32(MLX4_WQE_CTRL_TCP_UDP_CSUM);
    priv->port_stats.tx_chksum_offload++;
}
 

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: misc/179033: [dc] dc ethernet driver seems to have issues with some multiport card and mother board combinations

2013-06-28 Thread John Baldwin
On Wednesday, June 26, 2013 12:37:13 am Mr. Clif wrote:
> Hi John,
> 
> Thanks for working on this. I'm very interested in getting this fixed 
> for everyone that uses the Affected Atom boards and other small format 
> boards that work well in small custom routers.
> 
> However right now I have a big network upgrade I'm working on and don't 
> have time to get to it until late July, I'm hoping. So please forgive me 
> for the long delay.

That is fine.  I've been able to test this on a little netbook I have that has
bridges with the ISA enable bit set and have fixed a few bugs.  The updated
patch is at the URL below.  I wasn't able to test your specific use case yet
however (of the BIOS using an invalid range).

>  Thanks for your help,
>  Clif
> 
> 
> John Baldwin wrote:
> > On Monday, June 10, 2013 3:13:11 pm Mr. Clif wrote:
> >> Hi John and Pyun,
> >>
> >> Ok got the new kernel installed and tested. Yes it works! :-) Maybe that
> >> will also fix a simular problem with the sun cards (cas[03]), except I
> >> don't see a define like that in if_cas.c. Suggestions?
> > So I have a possible "real" fix for this.  However, I do not have any 
> > hardware
> > I can find that has a PCI-PCI bridge with the ISA-enable bit set.  I know it
> > compiles and boots fine on other systems.  Can you please try this and 
> > capture
> > the dmesg output?  It would also be good to capture devinfo -u output before
> > and after.
> >
> > http://www.freebsd.org/~jhb/patches/pci_isa_enable.patch
> >
> 
> 

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern/179999: [ofed] [patch] Bug assigning HCA from IB to ETH

2013-06-27 Thread John Baldwin
The following reply was made to PR kern/17; it has been noted by GNATS.

From: John Baldwin 
To: bug-follo...@freebsd.org,
 shah...@mellanox.com
Cc:  
Subject: Re: kern/17: [ofed] [patch] Bug assigning HCA from IB to ETH
Date: Thu, 27 Jun 2013 14:10:42 -0400

 Thanks, I think the sysfs fix has a few issues though (it writes to buf[] even 
 if the copyin() fails, and it doesn't enforce a bounds check).  It does seem 
 that this should probably be using sysctl_handle_string() instead of doing it 
 by hand, but for now I've just adjusted your patch.  Can you please test this 
 version?
 
 Index: ofed/drivers/net/mlx4/main.c
 ===
 --- ofed/drivers/net/mlx4/main.c   (revision 252306)
 +++ ofed/drivers/net/mlx4/main.c   (working copy)
 @@ -479,11 +479,11 @@
int i;
int err = 0;
  
 -  if (!strcmp(buf, "ib\n"))
 +  if (!strcmp(buf, "ib"))
info->tmp_type = MLX4_PORT_TYPE_IB;
 -  else if (!strcmp(buf, "eth\n"))
 +  else if (!strcmp(buf, "eth"))
info->tmp_type = MLX4_PORT_TYPE_ETH;
 -  else if (!strcmp(buf, "auto\n"))
 +  else if (!strcmp(buf, "auto"))
info->tmp_type = MLX4_PORT_TYPE_AUTO;
else {
mlx4_err(mdev, "%s is not supported port type\n", buf);
 Index: ofed/include/linux/sysfs.h
 ===
 --- ofed/include/linux/sysfs.h (revision 252306)
 +++ ofed/include/linux/sysfs.h (working copy)
 @@ -104,10 +104,15 @@
error = SYSCTL_OUT(req, buf, len);
if (error || !req->newptr || ops->store == NULL)
goto out;
 -  error = SYSCTL_IN(req, buf, PAGE_SIZE);
 +  len = req->newlen - req->newidx;
 +  if (len >= PAGE_SIZE)
 +  error = EINVAL;
 +  else 
 +  error = SYSCTL_IN(req, buf, len);
if (error)
goto out;
 -  len = ops->store(kobj, attr, buf, req->newlen);
 +  ((char *)buf)[len] = '\0';
 +  len = ops->store(kobj, attr, buf, len);
if (len < 0)
error = -len;
  out:
 
 -- 
 John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Failed to allocate receive buffer problem

2013-06-25 Thread John Baldwin
On Wednesday, June 12, 2013 3:06:26 am Alex Liptsin wrote:
> Hi.
> 
> I have a problem that when running a ping (or any other traffic) over IPoIB 
port,
> Traffic fails after some time.
> At destination server DMESG I see that errors:
> 
> Jun 11 14:42:11 h-qa-033 kernel: ib1: failed to allocate receive buffer 253
> Jun 11 14:42:12 h-qa-033 kernel: ib1: failed to allocate receive buffer 254
> Jun 11 14:42:13 h-qa-033 kernel: ib1: failed to allocate receive buffer 255
> Jun 11 14:42:14 h-qa-033 kernel: ib1: failed to allocate receive buffer 0
> Jun 11 14:42:15 h-qa-033 kernel: ib1: failed to allocate receive buffer 1
> Jun 11 14:42:16 h-qa-033 kernel: ib1: failed to allocate receive buffer 2
> Jun 11 14:42:17 h-qa-033 kernel: ib1: failed to allocate receive buffer 3
> Jun 11 14:42:18 h-qa-033 kernel: ib1: failed to allocate receive buffer 4
> Jun 11 14:42:19 h-qa-033 kernel: ib1: failed to allocate receive buffer 5
> Jun 11 14:42:20 h-qa-033 kernel: ib1: failed to allocate receive buffer 6
> Jun 11 14:42:21 h-qa-033 kernel: ib1: failed to allocate receive buffer 7
> 
> I work with FreeBSD 9.1.
> 
> Is it a bug or some configuration issues?

Do you see memory allocation errors in netstat -m?

Specifically this line:

0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

If so, it may be that the IPoIB layer has an mbuf leak.  The rest of netstat -
m might be useful here as you can see if any of the zones are full.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: misc/179033: [dc] dc ethernet driver seems to have issues with some multiport card and mother board combinations

2013-06-24 Thread John Baldwin
On Monday, June 10, 2013 3:13:11 pm Mr. Clif wrote:
> Hi John and Pyun,
> 
> Ok got the new kernel installed and tested. Yes it works! :-) Maybe that 
> will also fix a simular problem with the sun cards (cas[03]), except I 
> don't see a define like that in if_cas.c. Suggestions?

So I have a possible "real" fix for this.  However, I do not have any hardware 
I can find that has a PCI-PCI bridge with the ISA-enable bit set.  I know it
compiles and boots fine on other systems.  Can you please try this and capture
the dmesg output?  It would also be good to capture devinfo -u output before 
and after.

http://www.freebsd.org/~jhb/patches/pci_isa_enable.patch

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: How to compile ipoib module manually?

2013-06-05 Thread John Baldwin
On Tuesday, June 04, 2013 5:18:46 am Alex Liptsin wrote:
> I commented on that lines, because I want to compile and load that modules 
manually.
> I had succeed to compile and load mlx4, mlx4ib and mlxen from /sys/modules:
> 
> [root@h-qa-033 mlxen]# kldstat
> Id Refs AddressSize Name
> 1   14 0x8020 13acbd8  kernel
> 21 0x81612000 21e5 if_mos.ko
> 33 0x81615000 124ebmlx4.ko
> 41 0x81628000 e225 mlx4ib.ko
> 51 0x81637000 ec60 mlxen.ko
> 
> The problem is that IPOIB module is missing in /sys/modules.
> 
> 1.  Where can I find it?
> 
> 2.  How can I compile ipoib support?

You will have to create one.  You should be able to use the existing module 
Makefiles as a guide.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: misc/179033: [dc] dc ethernet driver seems to have issues with some multiport card and mother board combinations

2013-05-30 Thread John Baldwin
On Thursday, May 30, 2013 1:12:14 am YongHyeon PYUN wrote:
> On Wed, May 29, 2013 at 08:58:10PM -0700, Mr. Clif wrote:
> > Sorry for the confusion Pyun,
> > 
> > I started looking at it in the context of pfsense, but they rejected my 
> > bug report which was understandable because it's an upstream issue. They 
> > suggested I resubmit it to you guys if I could reproduce it. So I booted 
> > FreeBSD and lo and behold the same two ports failed in exactly the same 
> 
> Ok, I'd like to fix that.

Hmmm, the dc(4) driver is using the I/O port BARs rather than the
memory BARs for its registers and this bug seems to be that the dc(4)
device can't properly access its registers on dc0 and dc1 on the
Atom box.  The one thing I see is that the BIOS on the Atom box assigns
addresses in the 0x1100-0x11ff range for dc0 and dc1.  Those addresses
conflict with ISA I/O aliases for the 0x100-0x1ff range.  The Dell BIOS
is more careful to avoid these ranges.

I think the fix is that I need to fix the PCI-PCI bridge to reject these
resource ranges if the ISA enable bit is set in the bridge's control
register.  However, for the time being you can change dc(4) to use the
memory BAR instead of the I/O port BAR as a workaround:

Index: if_dc.c
===
--- if_dc.c (revision 251132)
+++ if_dc.c (working copy)
@@ -128,7 +128,7 @@ __FBSDID("$FreeBSD$");
 #include 
 #include 
 
-#defineDC_USEIOSPACE
+//#define  DC_USEIOSPACE
 
 #include 
 

If this fixes it then I can take this PR as a test case for handling the ISA 
enable bit in the PCI-PCI bridge code.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Create pkey on FreeBSD 9.1

2013-05-30 Thread John Baldwin
On Thursday, May 30, 2013 3:29:46 am Alex Liptsin wrote:
> Hi John.
> 
> I did it, but there is no ping between the vlans.  Ping without VLANs on 
that ports pass.

Unfortunately I do not have an IB setup to test this.  I also don't know
how IB treats vlans (e.g. does it use an 802.1(q) type header?).  Can you
tcpdump on the ib0 interface and see if your pings on ib0.100 show up and if 
they have the appropriate headers?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: How to switch Datgram/Connected mtu modes?

2013-05-29 Thread John Baldwin
On Sunday, May 26, 2013 7:43:29 am Alex Liptsin wrote:
> Hello.
> 
> I work with FreeBSD 9.1 and Mellanox devices.
> 
> How can I configure MTU in connected mode on FreeBSD 9.1?
> In Linux to enable connected mode for interface ib0, I enter:
> 
>echo connected > /sys/class/net/ib0/mode
> 
> 
> 
> Switching between CM and UD mode can be done in run time:
> 
>echo datagram > /sys/class/net/ib0/mode sets the mode of ib0 to UD
> 
>echo connected > /sys/class/net/ib0/mode sets the mode ib0 to CM
> 
> There is no such directories at FreeBSD. Wat shall I do?

Have you tried looking for dev.ib.0 sysctls?  It looks like the OFED bits in 
FreeBSD map Linux sysfs entries to sysctl nodes, but I don't have a box with 
IB handy to see what it looks like at runtime.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Create pkey on FreeBSD 9.1

2013-05-29 Thread John Baldwin
On Thursday, May 23, 2013 2:36:25 pm Ryan Stone wrote:
> On Thu, May 23, 2013 at 4:32 AM, Alex Liptsin  wrote:
> 
> > Hello.
> >
> > I have FreeBSD 9.1 installed.
> > There is mellanox adapter inside.
> > OFED support is already installed.
> >
> > I try to add pkeys on ib0 port.
> >
> > Usually in  Linux I did:
> >
> > echo 0x800c >  /sys/class/net/ib0/create_child
> >
> > ifconfig -a
> > To Make sure you see a new interface: ib0.800c
> >
> > How can I do it on FreeBSD? There is no "/sys/class/net/ib0/create_child"
> > directory.
> >
> > Regards,
> > Alex Liptsin
> >
> > ___
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> >
> 
> From reading the source it looks like this is done by attaching a vlan
> interface to the interface.  So try:
> 
> ifconfig vlan create vlandev ib0 vlan 0xc
> 
> This will create a new vlanX interface (ifconfig will its precise name with
> its unit number to stdout).

Simpler though is just 'ifconfig ib0.12 create' (and how most folks
expect subinterfaces to be named).

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: RFC: removing redundant checks in ether_input_internal()

2013-05-29 Thread John Baldwin
On Wednesday, May 22, 2013 10:53:29 am Andre Oppermann wrote:
> On 22.05.2013 14:58, Luigi Rizzo wrote:
> > if_ethersubr.c :: ether_input_internal() is only called as follows:
> >
> >  static void
> >  ether_nh_input(struct mbuf *m)
> >  {
> >
> >  ether_input_internal(m->m_pkthdr.rcvif, m);
> >  }
> >
> > hence the following checks in the body are unnecessary:
> >
> >  if (m->m_pkthdr.rcvif == NULL) {
> >  if_printf(ifp, "discard frame w/o interface pointer\n");
> >  ifp->if_ierrors++;
> >  m_freem(m);
> >  return;
> >  }
> >  #ifdef DIAGNOSTIC
> >  if (m->m_pkthdr.rcvif != ifp) {
> >  if_printf(ifp, "Warning, frame marked as received on 
%s\n",
> >  m->m_pkthdr.rcvif->if_xname);
> >  }
> >  #endif
> >
> > Any objection if i remove them ?
> 
> No, but they should remain as KASSERTs.  None of these should trigger in
> production and all of them are an indication that something is very wrong
> with the packet or the caller.

Eh, but if the only caller is ether_nh_input() then by definition you know
that m->m_pkthdr.rcvif == ifp.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: bpf hold buffer in-use flag

2013-05-23 Thread John Baldwin
On Thursday, May 23, 2013 5:05:39 pm Guy Helmer wrote:
> 
> On Jan 9, 2013, at 2:35 PM, John Baldwin  wrote:
> 
> > On Tuesday, November 13, 2012 4:40:57 pm Guy Helmer wrote:
> >> To try to completely resolve the race in bpfread(), I have put together 
> > these changes to add a flag to indicate when the hold buffer cannot be 
> > modified because it is in use. Since it's my first time using mtx_sleep() 
> > and 
> > wakeup(), I wanted to run these past the list to see if I can get any 
> > feedback 
> > on the approach.
> >> 
> >> 
> >> Index: bpf.c
> >> ===
> >> --- bpf.c  (revision 242997)
> >> +++ bpf.c  (working copy)
> >> @@ -819,6 +819,7 @@ bpfopen(struct cdev *dev, int flags, int fmt, stru
> >> * particular buffer method.
> >> */
> >>bpf_buffer_init(d);
> >> +  d->bd_hbuf_in_use = 0;
> >>d->bd_bufmode = BPF_BUFMODE_BUFFER;
> >>d->bd_sig = SIGIO;
> >>d->bd_direction = BPF_D_INOUT;
> >> @@ -872,6 +873,9 @@ bpfread(struct cdev *dev, struct uio *uio, int iof
> >>callout_stop(&d->bd_callout);
> >>timed_out = (d->bd_state == BPF_TIMED_OUT);
> >>d->bd_state = BPF_IDLE;
> >> +  while (d->bd_hbuf_in_use)
> >> +  mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock,
> >> +  PRINET|PCATCH, "bd_hbuf", 0);
> > 
> > You need to check the return value here, otherwise the PCATCH is useless 
> > (you 
> > will just go back to sleep instead of failing with an error if this is 
> > interrupted by a signal). 
> 
> Thanks for the feedback (sorry it's taken so long to get to it). Would this
> change correctly handle interruptions?

Yes.

> Index: bpf.c
> ===
> --- bpf.c (revision 250941)
> +++ bpf.c (working copy)
> @@ -856,9 +856,14 @@
>   callout_stop(&d->bd_callout);
>   timed_out = (d->bd_state == BPF_TIMED_OUT);
>   d->bd_state = BPF_IDLE;
> - while (d->bd_hbuf_in_use)
> - mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock,
> + while (d->bd_hbuf_in_use) {
> + error = mtx_sleep(&d->bd_hbuf_in_use, &d->bd_lock,
>   PRINET|PCATCH, "bd_hbuf", 0);
> + if (error == EINTR || error == ERESTART) {
> + BPFD_UNLOCK(d);
> + return (error);
> + }
> + }

Maybe simplify the check to just 'if (error != 0)'?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: High CPU interrupt load on intel I350T4 with igb on 8.3

2013-05-20 Thread John Baldwin
On Friday, April 26, 2013 7:31:07 am Clément Hermann (nodens) wrote:
> Hi list,
> 
> We use pf+ALTQ for trafic shaping on some routers.
> 
> We are switching to new servers : Dell PowerEdge R620 with 2 8-cores 
> Intel Processor (E5-2650L), 8GB RAM and Intel I350T4 (quad port) using 
> igb driver. The old hardware is using em driver, the CPU load is high 
> but mostly due to kernel and a large pf ruleset.
> 
> On the new hardware, we see high CPU Interrupt load (up to 95%), even 
> though there is not much trafic currently (peaks about 150Mbps and 
> 40Kpps). All queues are used and binded to a cpu according to top, but a 
> lot of CPU time is spent on igb queues (interrupt or wait). The load is 
> fine when we stay below 20Kpps.
> 
> We see no mbuf shortage, no dropped packet, but there is little margin 
> left on CPU time (about 25% idle at best, most of CPU time is spent on 
> interrupts), which is disturbing.
> 
> We have done some tuning, but to no avail :

If you have the processing_limit set to -1, you should never see CPU time 
spent in the igb task threads (any such time means there is a bug).  One such 
bug was fixed in 8.x here (that is after 8.3):

http://svnweb.freebsd.org/base?view=revision&revision=235553

This may not help with any issues in pf(4), but we had workloads at work (not 
involving pf) where this bug could cause boxes to spend 100% CPU in igb 
threads.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: shm_map questions

2013-04-22 Thread John Baldwin
On Saturday, April 20, 2013 9:18:24 pm Laurie Jennings wrote:
> That does help. Is there a way for the kernel to access the memory map 
directlyby segment name?

There is not, no.  It wouldn't be hard to add, but the issue there is that
the existing shm_map/unmap API assumes you have an open file descriptor and
is tailored to having a userland process provide memory rather than having
the kernel provide a SHM to userland, so even if you added a shm_open() that
gave you a reference on the underlying shm object (struct shmfd *), you would
need a slightly different shm_map/unmap that took that object directly
rather than an fd.

> Laurie
> 
> --- On Thu, 4/18/13, John Baldwin  wrote:
> 
> From: John Baldwin 
> Subject: Re: shm_map questions
> To: freebsd-net@freebsd.org
> Cc: "Laurie Jennings" 
> Date: Thursday, April 18, 2013, 6:50 AM
> 
> On Thursday, April 11, 2013 10:58:14 am Laurie Jennings wrote:
> > Im working on a simple project that shares a memory segment between a user 
> processand a kernel module. I'm having some problems with shm_map and there 
> doesn't seem to be much info on it.
> > Im not sure what happened to the memory when the user process that creates 
> it terminates.  I have some questions.
> > 1) Does the kernel mapping keep the segment from being garbage collected 
> when the use process that creates it terminated? I've experienced 
shm_unmap() 
> panic when tryingto unmap a segment
> > scenario:  
> > User process Maps SegmentKernel maps it  with shm_map()User Process 
> TerminatesKernel tries to shm_unmap() and it panics.
> 
> The kernel mapping bumps the refcount on the underlying vm object, so it 
will
> not go away.  OTOH, you should be keeping your own reference count on the
> associated fd so that you can call shm_unmap().  That is, the model should 
be
> something like:
> 
> struct mydata *foo;
> 
> foo->fp = fget(fd);
> shm_map(fp, &foo->p);
> /* Don't call fdrop */
> 
> and then when unmapping:
> 
> struct mydata *foo;
> 
> shm_unmap(foo->fp, foo->p);
> fdrop(foo->fp);
> 
> > 2) Is there a way for the kernel process to know when the user process has 
> goneaway? A ref count?
> 
> You can install a process_exit EVENTHANDLER if you want to destroy this when 
a
> process goes away.  I have used shm_map/unmap for other objects that already
> had a reference count so I did my shm_unmap when that object was destroyed.
> 
> > 3) Does a SHM_ANON segment persist as long as the kernel has it mapped, or 
> doesit get garbage collected when the creating user process terminates?
> 
> It goes away when the backing 'struct file' goes away.  If you follow the 
> model above of keeping a reference count on the associated struct file then
> it won't go away until you fdrop() after the shm_unmap.
> 
> > 4) When using a named segment, can the kernel "reuse" a mapping for a new 
> userprocess?
> > Example:
> > User process creates shm segment with path /fooKernel Maps shm segment 
with 
> shm_map()User process terminates.User process runs again, opening segment 
/foo
> > Does the kernel need to re-map, or is the original mapping valid?
> 
> The mapping is not per-process, so if you have mapped a shm for /foo and
> mapped it, it will stay mapped until you call shm_unmap.  Multiple processes
> can shm_open /foo and mmap it and they will all share the same memory.
> 
> You could even share a SHM_ANON fd among multiple processes by passing it
> across a UNIX domain socket.
> 
> Hope this helps.
> 
> -- 
> John Baldwin
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> 

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Network connections are lost from time to time

2013-04-19 Thread John Baldwin
On Friday, April 19, 2013 4:14:43 pm C. L. Martinez wrote:
> On Friday, April 19, 2013, John Baldwin  wrote:
> > On Friday, April 19, 2013 12:32:18 pm C. L. Martinez wrote:
> >> On Friday, April 19, 2013, John Baldwin  wrote:
> >> > On Friday, April 19, 2013 3:11:41 am C. L. Martinez wrote:
> >> >> Hi all,
> >> >>
> >> >>  I have a strange problem with my FreeBSD 9.1 (fully patched): I loose
> >> ssh
> >> >> sessions from time to time frequently.
> >> >>
> >> >>  This fbsd box is installed in an ESXi 5.1 server and I have another
> >> three
> >> >> fbsd 9.1 in the same ESXi host that do not have this problem, but
> maybe
> >> the
> >> >> problem is with my sysctl.conf and loader.conf settings:
> >> >
> >> > Which NIC driver are you using?
> >> >
> >> > --
> >> > John Baldwin
> >>
> >>
> >> e1000.
> >
> > igb?  There are some fixes to handle out of order packets on transmit that
> > could break new connections in some cases.  That is probably worth
> testing.  I
> > think you can just grab the sys/dev/e1000 from 9-stable and drop it into a
> > 9.1 tree to test.
> >
> > --
> Nop, I am using em driver ...

Ok, have you thought about running a tcpdump and then examining the dump 
around the time that a drop occurs?  (Specifically, if you know the connection 
that drops you can get wireshark to only display the packets for that dump).  
Also, have you compared netstat -s before and after drops?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Network connections are lost from time to time

2013-04-19 Thread John Baldwin
On Friday, April 19, 2013 12:32:18 pm C. L. Martinez wrote:
> On Friday, April 19, 2013, John Baldwin  wrote:
> > On Friday, April 19, 2013 3:11:41 am C. L. Martinez wrote:
> >> Hi all,
> >>
> >>  I have a strange problem with my FreeBSD 9.1 (fully patched): I loose
> ssh
> >> sessions from time to time frequently.
> >>
> >>  This fbsd box is installed in an ESXi 5.1 server and I have another
> three
> >> fbsd 9.1 in the same ESXi host that do not have this problem, but maybe
> the
> >> problem is with my sysctl.conf and loader.conf settings:
> >
> > Which NIC driver are you using?
> >
> > --
> > John Baldwin
> 
> 
> e1000.

igb?  There are some fixes to handle out of order packets on transmit that 
could break new connections in some cases.  That is probably worth testing.  I 
think you can just grab the sys/dev/e1000 from 9-stable and drop it into a
9.1 tree to test.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern/176446: [netinet] [patch] Concurrency in ixgbe driving out-of-order packet process and spurious RST

2013-04-19 Thread John Baldwin
The following reply was made to PR kern/176446; it has been noted by GNATS.

From: John Baldwin 
To: freebsd-net@freebsd.org
Cc: Jack Vogel ,
 bug-follo...@freebsd.org
Subject: Re: kern/176446: [netinet] [patch] Concurrency in ixgbe driving 
out-of-order packet process and spurious RST
Date: Fri, 19 Apr 2013 12:09:11 -0400

 I want to make some progress on this, so let's break this up into smaller 
 parts.
 
 First, I think both calls to rearm_queues() should be removed.  In the case of 
 the local timer, this can only re-enable interrupts if the interrupt handler 
 is already scheduled or running or its associated task is running.  In the 
 last case this means the ithread can run concurrently with the interrupt 
 handler causing out-of-order processing.  The rxeof case has the same issue.  
 Normally the code calling rxeof is going to re-enable the interrupt if rxeof 
 runs to completion, and if not it is going to schedule the taskqueue.  The 
 effect of the rxeof change was to always re-enable interrupts before 
 scheduling the taskqueue which can result in those running concurrently.
 
 Index: /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c
 ===
 --- /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c  (revision 
249553)
 +++ /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c  (working copy)
 @@ -1386,23 +1386,6 @@
}
  }
  
 -static inline void
 -ixgbe_rearm_queues(struct adapter *adapter, u64 queues)
 -{
 -  u32 mask;
 -
 -  if (adapter->hw.mac.type == ixgbe_mac_82598EB) {
 -  mask = (IXGBE_EIMS_RTX_QUEUE & queues);
 -  IXGBE_WRITE_REG(&adapter->hw, IXGBE_EICS, mask);
 -  } else {
 -  mask = (queues & 0x);
 -  IXGBE_WRITE_REG(&adapter->hw, IXGBE_EICS_EX(0), mask);
 -  mask = (queues >> 32);
 -  IXGBE_WRITE_REG(&adapter->hw, IXGBE_EICS_EX(1), mask);
 -  }
 -}
 -
 -
  static void
  ixgbe_handle_que(void *context, int pending)
  {
 @@ -2069,7 +2055,6 @@
  goto watchdog;
  
  out:
 -  ixgbe_rearm_queues(adapter, adapter->que_mask);
callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter);
return;
  
 @@ -4596,14 +4577,8 @@
  
/*
** We still have cleaning to do?
 -  ** Schedule another interrupt if so.
*/
 -  if ((staterr & IXGBE_RXD_STAT_DD) != 0) {
 -  ixgbe_rearm_queues(adapter, (u64)(1 << que->msix));
 -  return (TRUE);
 -  }
 -
 -  return (FALSE);
 +  return ((staterr & IXGBE_RXD_STAT_DD) != 0);
  }
  
  
 -- 
 John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern/176446: [netinet] [patch] Concurrency in ixgbe driving out-of-order packet process and spurious RST

2013-04-19 Thread John Baldwin
The following reply was made to PR kern/176446; it has been noted by GNATS.

From: John Baldwin 
To: freebsd-net@freebsd.org
Cc: Jack Vogel ,
 bug-follo...@freebsd.org,
 Mike Karels 
Subject: Re: kern/176446: [netinet] [patch] Concurrency in ixgbe driving 
out-of-order packet process and spurious RST
Date: Fri, 19 Apr 2013 12:27:09 -0400

 A second patch.  This is not something I mentioned before, but I had this in 
 my checkout.  In the legacy IRQ case this could also result in out-of-order 
 processing.  It also fixes a potential OACTIVE-stuck type bug that we used to 
 have in igb.  I have no way to test this, so it would be good if some other 
 folks could test this.
 
 The patch changes ixgbe_txeof() return void and changes the few places that 
 checked its return value to ignore it.  While it is true that ixgbe has a tx 
 processing limit (which I think is dubious.. TX completion processing is very 
 cheap unlike RX processing, so it seems to me like it should always run to 
 completion as in igb), in the common case I think the result will be to do 
 what igb used to do: poll the ring at 100% CPU (either in the interrupt 
 handler or in the task it keeps rescheduling) waiting for pending TX packets 
 to be completed (which is pointless: the host CPU can't make the NIC transmit 
 packets any faster by polling).
 
 It also changes the interrupt handlers to restart packet transmission 
 synchronously rather than always deferring that to a task (the former is what 
 (nearly) all other drivers do).  It also fixes the interrupt handlers to be 
 consistent (one looped on txeof but not the others).  In the case of the
 legacy interrupt handler it is possible it could fail to restart packet
 transmission if there were no pending RX packets after rxeof returned and
 txeof fully cleaned its ring without this change.
 
 It also fixes the legacy interrupt handler to not re-enable the interrupt if 
 it schedules the task but to wait until the task completes (this could result
 in concurrent, out-of-order RX processing).
 
 Index: /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c
 ===
 --- /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c  (revision 
249553)
 +++ /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c  (working copy)
 @@ -149,7 +149,7 @@
  static void ixgbe_enable_intr(struct adapter *);
  static void ixgbe_disable_intr(struct adapter *);
  static void ixgbe_update_stats_counters(struct adapter *);
 -static bool   ixgbe_txeof(struct tx_ring *);
 +static void   ixgbe_txeof(struct tx_ring *);
  static bool   ixgbe_rxeof(struct ix_queue *);
  static void   ixgbe_rx_checksum(u32, struct mbuf *, u32);
  static void ixgbe_set_promisc(struct adapter *);
 @@ -1431,7 +1414,10 @@
}
  
/* Reenable this interrupt */
 -  ixgbe_enable_queue(adapter, que->msix);
 +  if (que->res != NULL)
 +  ixgbe_enable_queue(adapter, que->msix);
 +  else
 +  ixgbe_enable_intr(adapter);
return;
  }
  
 @@ -1449,8 +1435,9 @@
struct adapter  *adapter = que->adapter;
struct ixgbe_hw *hw = &adapter->hw;
struct  tx_ring *txr = adapter->tx_rings;
 -  boolmore_tx, more_rx;
 -  u32 reg_eicr, loop = MAX_LOOP;
 +  struct ifnet*ifp = adapter->ifp;
 +  boolmore;
 +  u32 reg_eicr;
  
  
reg_eicr = IXGBE_READ_REG(hw, IXGBE_EICR);
 @@ -1461,17 +1448,19 @@
return;
}
  
 -  more_rx = ixgbe_rxeof(que);
 +  more = ixgbe_rxeof(que);
  
IXGBE_TX_LOCK(txr);
 -  do {
 -  more_tx = ixgbe_txeof(txr);
 -  } while (loop-- && more_tx);
 +  ixgbe_txeof(txr);
 +#if __FreeBSD_version >= 80
 +  if (!drbr_empty(ifp, txr->br))
 +  ixgbe_mq_start_locked(ifp, txr, NULL);
 +#else
 +  if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd))
 +  ixgbe_start_locked(txr, ifp);
 +#endif
IXGBE_TX_UNLOCK(txr);
  
 -  if (more_rx || more_tx)
 -  taskqueue_enqueue(que->tq, &que->que_task);
 -
/* Check for fan failure */
if ((hw->phy.media_type == ixgbe_media_type_copper) &&
(reg_eicr & IXGBE_EICR_GPI_SDP1)) {
 @@ -1484,7 +1473,10 @@
if (reg_eicr & IXGBE_EICR_LSC)
taskqueue_enqueue(adapter->tq, &adapter->link_task);
  
 -  ixgbe_enable_intr(adapter);
 +  if (more)
 +  taskqueue_enqueue(que->tq, &que->que_task);
 +  else
 +  ixgbe_enable_intr(adapter);
return;
  }
  
 @@ -1501,27 +1493,24 @@
struct adapter  *adapter = que->adapter;
struct tx_ring  *txr = que->txr;
struct rx_ring  *rxr = que->rxr;
 -  boolmore_tx, more_rx;
 +  struct ifnet*ifp = adapter->if

Re: kern/176446: [netinet] [patch] Concurrency in ixgbe driving out-of-order packet process and spurious RST

2013-04-19 Thread John Baldwin
uled to handle it.
-   */
+   ixgbe_txeof(txr);
 #ifdef IXGBE_LEGACY_TX
if (!IFQ_DRV_IS_EMPTY(&adapter->ifp->if_snd))
+   ixgbe_start_locked(txr, ifp);
 #else
-   if (!drbr_empty(adapter->ifp, txr->br))
+   if (!drbr_empty(ifp, txr->br))
+   ixgbe_mq_start_locked(ifp, txr, NULL);
 #endif
-   more_tx = 1;
IXGBE_TX_UNLOCK(txr);
 
/* Do AIM now? */
@@ -1575,7 +1564,7 @@
 rxr->packets = 0;
 
 no_calc:
-   if (more_tx || more_rx)
+   if (more)
taskqueue_enqueue(que->tq, &que->que_task);
else /* Reenable this interrupt */
ixgbe_enable_queue(adapter, que->msix);
@@ -3557,7 +3545,7 @@
  *  tx_buffer is put back on the free queue.
  *
  **/
-static bool
+static void
 ixgbe_txeof(struct tx_ring *txr)
 {
struct adapter  *adapter = txr->adapter;
@@ -3605,13 +3593,13 @@
IXGBE_CORE_UNLOCK(adapter);
IXGBE_TX_LOCK(txr);
}
-   return FALSE;
+   return;
}
 #endif /* DEV_NETMAP */
 
if (txr->tx_avail == txr->num_desc) {
txr->queue_status = IXGBE_QUEUE_IDLE;
-   return FALSE;
+   return;
}
 
/* Get work starting point */
@@ -3705,12 +3693,8 @@
if ((!processed) && ((ticks - txr->watchdog_time) > IXGBE_WATCHDOG))
txr->queue_status = IXGBE_QUEUE_HUNG;
 
-   if (txr->tx_avail == txr->num_desc) {
+   if (txr->tx_avail == txr->num_desc)
    txr->queue_status = IXGBE_QUEUE_IDLE;
-   return (FALSE);
-   }
-
-   return TRUE;
 }
 
 /*


-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: kern/176446: [netinet] [patch] Concurrency in ixgbe driving out-of-order packet process and spurious RST

2013-04-19 Thread John Baldwin
I want to make some progress on this, so let's break this up into smaller 
parts.

First, I think both calls to rearm_queues() should be removed.  In the case of 
the local timer, this can only re-enable interrupts if the interrupt handler 
is already scheduled or running or its associated task is running.  In the 
last case this means the ithread can run concurrently with the interrupt 
handler causing out-of-order processing.  The rxeof case has the same issue.  
Normally the code calling rxeof is going to re-enable the interrupt if rxeof 
runs to completion, and if not it is going to schedule the taskqueue.  The 
effect of the rxeof change was to always re-enable interrupts before 
scheduling the taskqueue which can result in those running concurrently.

Index: /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c
===
--- /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c   (revision 
249553)
+++ /home/jhb/work/freebsd/svn/head/sys/dev/ixgbe/ixgbe.c   (working copy)
@@ -1386,23 +1386,6 @@
}
 }
 
-static inline void
-ixgbe_rearm_queues(struct adapter *adapter, u64 queues)
-{
-   u32 mask;
-
-   if (adapter->hw.mac.type == ixgbe_mac_82598EB) {
-   mask = (IXGBE_EIMS_RTX_QUEUE & queues);
-   IXGBE_WRITE_REG(&adapter->hw, IXGBE_EICS, mask);
-   } else {
-   mask = (queues & 0x);
-   IXGBE_WRITE_REG(&adapter->hw, IXGBE_EICS_EX(0), mask);
-   mask = (queues >> 32);
-   IXGBE_WRITE_REG(&adapter->hw, IXGBE_EICS_EX(1), mask);
-   }
-}
-
-
 static void
 ixgbe_handle_que(void *context, int pending)
 {
@@ -2069,7 +2055,6 @@
 goto watchdog;
 
 out:
-   ixgbe_rearm_queues(adapter, adapter->que_mask);
callout_reset(&adapter->timer, hz, ixgbe_local_timer, adapter);
return;
 
@@ -4596,14 +4577,8 @@
 
/*
** We still have cleaning to do?
-   ** Schedule another interrupt if so.
*/
-   if ((staterr & IXGBE_RXD_STAT_DD) != 0) {
-   ixgbe_rearm_queues(adapter, (u64)(1 << que->msix));
-   return (TRUE);
-   }
-
-   return (FALSE);
+   return ((staterr & IXGBE_RXD_STAT_DD) != 0);
 }
 
 
-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Network connections are lost from time to time

2013-04-19 Thread John Baldwin
On Friday, April 19, 2013 3:11:41 am C. L. Martinez wrote:
> Hi all,
> 
>  I have a strange problem with my FreeBSD 9.1 (fully patched): I loose ssh
> sessions from time to time frequently.
> 
>  This fbsd box is installed in an ESXi 5.1 server and I have another three
> fbsd 9.1 in the same ESXi host that do not have this problem, but maybe the
> problem is with my sysctl.conf and loader.conf settings:

Which NIC driver are you using?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: shm_map questions

2013-04-18 Thread John Baldwin
On Thursday, April 11, 2013 10:58:14 am Laurie Jennings wrote:
> Im working on a simple project that shares a memory segment between a user 
processand a kernel module. I'm having some problems with shm_map and there 
doesn't seem to be much info on it.
> Im not sure what happened to the memory when the user process that creates 
it terminates.  I have some questions.
> 1) Does the kernel mapping keep the segment from being garbage collected 
when the use process that creates it terminated? I've experienced shm_unmap() 
panic when tryingto unmap a segment
> scenario:  
> User process Maps SegmentKernel maps it  with shm_map()User Process 
TerminatesKernel tries to shm_unmap() and it panics.

The kernel mapping bumps the refcount on the underlying vm object, so it will
not go away.  OTOH, you should be keeping your own reference count on the
associated fd so that you can call shm_unmap().  That is, the model should be
something like:

struct mydata *foo;

foo->fp = fget(fd);
shm_map(fp, &foo->p);
/* Don't call fdrop */

and then when unmapping:

struct mydata *foo;

shm_unmap(foo->fp, foo->p);
fdrop(foo->fp);

> 2) Is there a way for the kernel process to know when the user process has 
goneaway? A ref count?

You can install a process_exit EVENTHANDLER if you want to destroy this when a
process goes away.  I have used shm_map/unmap for other objects that already
had a reference count so I did my shm_unmap when that object was destroyed.

> 3) Does a SHM_ANON segment persist as long as the kernel has it mapped, or 
doesit get garbage collected when the creating user process terminates?

It goes away when the backing 'struct file' goes away.  If you follow the 
model above of keeping a reference count on the associated struct file then
it won't go away until you fdrop() after the shm_unmap.

> 4) When using a named segment, can the kernel "reuse" a mapping for a new 
userprocess?
> Example:
> User process creates shm segment with path /fooKernel Maps shm segment with 
shm_map()User process terminates.User process runs again, opening segment /foo
> Does the kernel need to re-map, or is the original mapping valid?

The mapping is not per-process, so if you have mapped a shm for /foo and
mapped it, it will stay mapped until you call shm_unmap.  Multiple processes
can shm_open /foo and mmap it and they will all share the same memory.

You could even share a SHM_ANON fd among multiple processes by passing it
across a UNIX domain socket.

Hope this helps.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Small patch in OFED/sdp

2013-04-03 Thread John Baldwin
On Tuesday, April 02, 2013 2:33:18 pm Vijay Singh wrote:
> Hi, this is based on the the understanding that the SS_NBIO is a
> socket state, and not a state of the socket buffer.

Committed, thanks!

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: KVM with freeBSD and SR-IOV vlan doesn't working!

2013-04-01 Thread John Baldwin
On Wednesday, March 27, 2013 5:31:27 am kindule wrote:
> Recently, we use KVM and SR-IOV in our project. And we choose freeBSD10 as
> the guest os.As we use the ip address in the igb0 of our freeBSD10 guest, it
> working no problem.However when i use vlan in our igb0 of the freeBSD10
> guest,we can see the vlan assigned right and we can ping the vlan address
> without problem.But we add a gateway of the vlan area.we can't connnected to
> the gateway.
> there are some os messages:
> Host: Debian 7.0 and KVM 1.2
> Guest: FreeBSD10-current
> 
> And thanks for your help!

Hmm, does the same vlan setup work on a standalone machine?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: close(2) while accept(2) is blocked

2013-04-01 Thread John Baldwin
On Thursday, March 28, 2013 12:54:31 pm Andriy Gapon wrote:
> 
> So, this started as a simple question, but the answer was quite unexpected to 
> me.
> 
> Let's say we have an opened and listen-ed socket and let's assume that we know
> that one thread is blocked in accept(2) and another thread is calling 
> close(2).
> What is going to happen?
> 
> Turns out that practically nothing.  For kernel the close call would be 
> almost a nop.
> My understanding is this:
> - when socket is created, its reference count is 1
> - when accept(2) is called, fget in kernel increments the reference count 
> (kept in
> an associated struct file)
> - when close(2) is called, the reference count is decremented
> 
> The reference count is still greater than zero, so fdrop does not call 
> fo_close.
> That means that in the case of a socket soclose is not called.
> 
> I am sure that the reference counting in this case is absolutely correct with
> respect to managing kernel side structures.  But I am not that it is correct 
> with
> respect to hiding the explicit close(2) call from other threads that may be
> waiting on the socket.
> In other words, I am not sure if fo_close is supposed to signify that there 
> are no
> uses of a file, or that userland close-d the file.  Or perhaps these should 
> be two
> different methods.
> 
> Additional note is that shutdown(2) doesn't wake up the thread in accept(2)
> either.  At least that's true for unix domain sockets.
> Not sure if this is a bug.
> 
> But the summary seems to be is that currently it is not possible to break a 
> thread
> out of accept(2) (at least without resorting to signals).

I think you need to split the 'struct file' reference count into two different
counts similar to the how we have vref/vrele vs vhold/vdrop for vnodes.  The
fget for accept and probably most other system calls should probably be 
equivalent
to vhold, whereas things like open/dup (and storing an fd in a cmsg) should be
more like vref.  close() should then be a vrele().

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


  1   2   3   4   >