driver never fails to load because i am getting timeout errors. I did not crash/panic, but some sort of a system lockup ( various user space net daemons stopped the boot process, probably because the interface bringup was in some error state, i think is some netlink socket hanging ). This is when i added the spin locks which appear to stop the hang. I think that once i patch the e1000e , we might have some more knowledge of why this is happening. Please note that this problem happens in my various intel pc boards and is not bounded to a single type of board.
On Tue, Dec 3, 2013 at 1:50 PM, Jeroen Van den Keybus < [email protected]> wrote: > Just a thought since you mention booting: is it possible your driver is > sometimes simply loaded before the master is (and fails to register) ? You > mention that you crashed upon boot without the spinlocks and the only way > to do that should be that you run as a regular netdev device (line 4170) > incl. irqs. Could also explain why the e1000 has the problem. > > I suspect that adding the link status check merely causes an extra delay > which could lead to the master being loaded earlier. > > J. > > > 2013/12/3 Raz <[email protected]> > >> All i am doing is more of a trial and error. I do not know the realtek >> driver at all. >> The spinlock are needed because they are protected in the original driver >> code flow . i had a boot lockup in one of my trials without them. This >> patch does not eliminate the problem entirely, but from 10 trials with 6 >> drives with a 100% failures to 1 out of 10 I believe it important enough to >> mail to the community. as for e1000e i do not know what the problem is, i >> need to check it and email you. >> >> >> >> On Tue, Dec 3, 2013 at 1:16 PM, Jeroen Van den Keybus < >> [email protected]> wrote: >> >>> Why the spinlock ? This driver instance shouldn't ever be reentering. >>> >>> I'm a bit worried that it would complicate the use of e.g. RTAI and >>> Xenomai. >>> >>> How comes the e1000 has the same issue ? >>> >>> J. >>> >>> >>> >>> 2013/12/3 Raz <[email protected]> >>> >>>> The bellow patch seemed to eliminate the problem. I believe the problem >>>> relates to resetting some registers when link up is detected. >>>> >>>> diff --git a/local_src/r8169-3.2/r8169.c b/local_src/r8169-3.2/r8169.c >>>> index 6df1793..a483fb5 100644 >>>> --- a/local_src/r8169-3.2/r8169.c >>>> +++ b/local_src/r8169-3.2/r8169.c >>>> @@ -1290,6 +1290,9 @@ static void __rtl8169_check_link_status(struct >>>> net_device *dev, >>>> >>>> if (tp->ecdev) { >>>> ecdev_set_link(tp->ecdev, tp->link_ok(ioaddr) ? 1 : 0); >>>> + spin_lock_irqsave(&tp->lock, flags); >>>> + rtl_link_chg_patch(tp); >>>> + spin_unlock_irqrestore(&tp->lock, flags); >>>> return; >>>> } >>>> >>>> >>>> >>>> On Tue, Dec 3, 2013 at 11:56 AM, Jeroen Van den Keybus < >>>> [email protected]> wrote: >>>> >>>>> Perhaps try hooking up a normal eth interface to the drive and see >>>>> what the autoneg comes up with using ethtool. In the past, I have had >>>>> trouble interfacing an FPGA IP core to a PC Ethernet card when the core >>>>> was >>>>> hard wired to 100M FD instead of advertising this using autoneg. The PC >>>>> card tried to autoneg and then fell back to 100M HD. >>>>> >>>>> You could try testing with an EK1100 in between the PC and the drive. >>>>> >>>>> J. >>>>> >>>>> >>>>> 2013/12/3 Raz <[email protected]> >>>>> >>>>>> I do not have ethtool over the ethercat device as it is removed. How >>>>>> can I tell ? eth0 is 100Mbps but it is my public interface. eth1 is my >>>>>> ethercat interface. >>>>>> >>>>>> There is always a link. the first slave is a drive, not an io device >>>>>> . This drive is running xilinix with port stack and ip core of beckhof. >>>>>> I am trying to debug now the realtek driver, let see... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Dec 3, 2013 at 11:36 AM, Jeroen Van den Keybus < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> It would be very useful to know whether e.g. the interfaces ended up >>>>>>> in 100M half duplex or so. Is there a link in those cases ? What's the >>>>>>> first EtherCAT station ? Maybe it doesn't handle autoneg properly during >>>>>>> its reset phase ? >>>>>>> >>>>>>> J. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2013/12/3 Raz <[email protected]> >>>>>>> >>>>>>>> hey >>>>>>>> Problem happens with intel e1000e as well as realtek. One way to >>>>>>>> bypass it is to boot the master while the ethernet-ethercat cable is >>>>>>>> disconnected, and once master claims the interface , connect this >>>>>>>> cable. >>>>>>>> This appears to work. >>>>>>>> So , There some sort of of initialisation error. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Dec 2, 2013 at 11:32 AM, Raz <[email protected]> wrote: >>>>>>>> >>>>>>>>> I still do not have a scenario. it "sometimes" happens. The >>>>>>>>> -DRTL8169_DEBUG is something i did not know, so i will check and see. >>>>>>>>> thx >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Dec 2, 2013 at 11:27 AM, Jeroen Van den Keybus < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Is there a difference between cold and warm boot ? Does unloading >>>>>>>>>> the ec driver, loading/unloading the stock r8169 driver and then >>>>>>>>>> reloading >>>>>>>>>> the ec driver work better ? Same scenario but with Realtek drivers >>>>>>>>>> (r8168) >>>>>>>>>> ? Also perhaps compile with -DRTL8169_DEBUG ? >>>>>>>>>> >>>>>>>>>> Just some thoughts. >>>>>>>>>> >>>>>>>>>> J. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2013/12/2 Raz <[email protected]> >>>>>>>>>> >>>>>>>>>>> The timeouts happens after the system boots and not while slaves >>>>>>>>>>> are in in OP mode. So my transmit is irrelevant here, even though a >>>>>>>>>>> transmit happens only from a single thread of through an ioctl ( >>>>>>>>>>> SDO reads >>>>>>>>>>> and so on..) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Dec 2, 2013 at 11:01 AM, Jeroen Van den Keybus < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> 1. why do you disable the rtl8169_phy_timer timer ? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The rtl8169_phy_timer is regularly polled in ec_poll instead. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> 2. In rtl_hw_start_8168 : why do disable RTL_W16(IntrMask, >>>>>>>>>>>>> tp->intr_event); ? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> The drivers are all non-blocking and interrupt-free. All work >>>>>>>>>>>> that interrupt handlers normally do is done in ec_poll instead. >>>>>>>>>>>> >>>>>>>>>>>> If you cannot send packets anymore, I suspect that you may have >>>>>>>>>>>> overrun the tx queue, i.e. sent a packet before the previous one >>>>>>>>>>>> has been >>>>>>>>>>>> completed. You're also not calling the ethercat transmission >>>>>>>>>>>> functions from >>>>>>>>>>>> different threads, right ? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> thank you >>>>>>>>>>>>> raz >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> https://sites.google.com/site/ironspeedlinux/ >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> etherlab-users mailing list >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>> http://lists.etherlab.org/mailman/listinfo/etherlab-users >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> https://sites.google.com/site/ironspeedlinux/ >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> https://sites.google.com/site/ironspeedlinux/ >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> https://sites.google.com/site/ironspeedlinux/ >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> https://sites.google.com/site/ironspeedlinux/ >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> https://sites.google.com/site/ironspeedlinux/ >>>> >>> >>> >> >> >> -- >> https://sites.google.com/site/ironspeedlinux/ >> > > -- https://sites.google.com/site/ironspeedlinux/
_______________________________________________ etherlab-users mailing list [email protected] http://lists.etherlab.org/mailman/listinfo/etherlab-users
