(apologies if it happens to get posted it multiple times, an earlier version of 
this message was seemingly cancelled by me while waiting for moderation)


Hi everyone,


For a few years now I have been using the Etherlab master stack with Gavin 
Lambert's patchset to drive embedded systems for my research. It's an amazing 
piece of software and I am truly impressed by this work.  For my applications I 
have a thin C++ wrapper that allows me to manage the ECRT code through classes 
and reuse it depending on the project. I coded the wrapper myself. I'm not 
fully sure if it is correct, but it has been working fine so far...


However, recently, I have installed the stack on a new computer with Ubuntu 
18.04, Kernel 4.19.177-0419177-lowlatency. I applied all the patches from GL 
and use ec_e1000e driver with a supposedly compatible Intel Card (Model 
EXPI9402PT). I use the stack to drive Technosoft Motion iPos4808 BX-CAT drives.


Once I run my code for the first time, everything seems to work fine until the 
application exits. Then one of three things can happen:


Either:

  1.  I can run the application again from start to the end, no issue.
  2.  The application stops working during either ecrt_master_activate(master) 
or ecrt_request_master(0). Neither of these functions returns and the code just 
hangs there.
  3.  The communication works fine for some time, then starts acting up. It 
stops working entirely during the operation (suddenly the data I'm receiving 
ends up corrupted) or while exiting (the code hangs, I think also on 
ecrt_master_deactivate(master) or ecrt_release_master(0)).

When the application malfunctions, I can only force quit it. If I do, I have to 
reboot the computer, as, whatever ethercat-related I do next, I receive the 
error that the module ec_e1000 is still in use.

Dmesg from a normal run:

[ 1166.127739] EtherCAT: Requesting master 0...
[ 1166.127743] EtherCAT: Successfully requested master 0.
[ 1166.127980] EtherCAT 0: Domain0: Logical address 0x00000000, 21 byte, 
expected working counter 3.
[ 1166.127982] EtherCAT 0:   Datagram domain0-0-main: Logical offset 
0x00000000, 21 byte, type LRW at 0000000098dbfebf.
[ 1166.128029] EtherCAT 0: Master thread exited.
[ 1166.128030] EtherCAT 0: Stopping EoE thread.
[ 1166.128052] EtherCAT 0: EoE thread exited.
[ 1166.128053] EtherCAT 0: Starting EoE thread.
[ 1166.128382] EtherCAT 0: Starting EtherCAT-OP thread.
[ 1166.131218] EtherCAT WARNING 0: No app_time received up to now, abort DC 
time offset calculation.
[ 1166.176215] EtherCAT 0: Slave states on main device: INIT.
[ 1166.200217] EtherCAT 0: Slave states on main device: PREOP.
[ 1166.443626] EtherCAT 0: Domain 0: Working counter changed to 2/3.
[ 1166.456205] EtherCAT 0: Slave states on main device: SAFEOP.
[ 1166.464205] EtherCAT 0: Slave states on main device: OP.
[ 1167.128196] EtherCAT WARNING 0: 2 datagrams UNMATCHED!
[ 1167.131459] EtherCAT WARNING: Datagram 0000000098dbfebf (domain0-0-main) was 
SKIPPED 2 times.
[ 1167.447503] EtherCAT 0: Domain 0: 3 working counter changes - now 3/3.
[ 1170.264093] EtherCAT 0: Domain 0: Working counter changed to 0/3.
(This continues until the end)
[ 1193.785359] EtherCAT 0: Domain 0: Working counter changed to 0/3.
[ 1193.894348] EtherCAT 0: Slave states on main device: SAFEOP + ERROR.
[ 1194.793518] EtherCAT 0: Master thread exited.
[ 1194.793520] EtherCAT 0: Stopping EoE thread.
[ 1194.793538] EtherCAT 0: EoE thread exited.
[ 1194.793547] EtherCAT 0: Starting EoE thread.
[ 1194.793674] EtherCAT 0: Starting EtherCAT-IDLE thread.
[ 1194.793856] EtherCAT 0: 0 slave(s) responding on main device.
[ 1194.795324] EtherCAT ERROR 0-main-0: AL status message 0x001B: "Sync manager 
watchdog".
[ 1194.797319] EtherCAT 0: 1 slave(s) responding on main device.
[ 1194.797321] EtherCAT 0: Slave states on main device: SAFEOP + ERROR.
[ 1194.799335] EtherCAT 0-main-0: Acknowledged state SAFEOP.
[ 1194.801321] EtherCAT 0: Slave states on main device: SAFEOP.
[ 1194.805320] EtherCAT 0: Slave states on main device: INIT.
[ 1194.817333] EtherCAT 0: Slave states on main device: PREOP.
[ 1195.281308] EtherCAT WARNING: Datagram 00000000f3c9ebae (master-fsm) was 
SKIPPED 1 time.
[ 1195.793926] EtherCAT 0: Releasing master...
[ 1195.793929] EtherCAT 0: Released.
[ 1195.797305] EtherCAT 0: Stopping EoE thread.
[ 1195.797324] EtherCAT 0: EoE thread exited.
[ 1195.810428] EtherCAT 0: Scanning bus.
[ 1196.038438] EtherCAT 0: eoe0s0 MAC address derived from NIC part of ecm0 MAC 
address
[ 1196.038615] EtherCAT 0-main-0: Linked to EoE handler eoe0s0
[ 1196.038620] EtherCAT 0: Bus scanning completed in 241 ms.
[ 1196.038622] EtherCAT 0: Using slave main-0 as DC reference clock.
[ 1196.038623] EtherCAT 0: Starting EoE thread.
[ 1196.049270] EtherCAT 0: Slave states on main device: INIT.
[ 1196.050455] IPv6: ADDRCONF(NETDEV_UP): eoe0s0: link is not ready
[ 1196.061266] EtherCAT 0: Slave states on main device: PREOP.

Dmesg from a "normal" run that precedes to crash:

[ 3289.782513] EtherCAT: Requesting master 0...
[ 3289.969220] EtherCAT 0: eoe0s0 MAC address derived from NIC part of ecm0 MAC 
address
[ 3289.970078] EtherCAT 0-main-0: Linked to EoE handler eoe0s0
[ 3289.970087] EtherCAT 0: Bus scanning completed in 229 ms.
[ 3289.970089] EtherCAT 0: Using slave main-0 as DC reference clock.
[ 3289.970091] EtherCAT 0: Starting EoE thread.
[ 3289.970100] EtherCAT: Successfully requested master 0.
[ 3289.970321] EtherCAT 0: Domain0: Logical address 0x00000000, 21 byte, 
expected working counter 3.
[ 3289.970323] EtherCAT 0:   Datagram domain0-0-main: Logical offset 
0x00000000, 21 byte, type LRW at 00000000e95eb248.
[ 3289.970341] EtherCAT 0: Master thread exited.
[ 3289.970342] EtherCAT 0: Stopping EoE thread.
[ 3289.970349] EtherCAT 0: EoE thread exited.
[ 3289.970350] EtherCAT 0: Starting EoE thread.
[ 3289.970386] EtherCAT 0: Starting EtherCAT-OP thread.
[ 3289.977769] EtherCAT WARNING 0: No app_time received up to now, abort DC 
time offset calculation.
[ 3289.985470] IPv6: ADDRCONF(NETDEV_UP): eoe0s0: link is not ready
[ 3289.985571] IPv6: ADDRCONF(NETDEV_UP): eoe0s0: link is not ready
[ 3289.991646] IPv6: ADDRCONF(NETDEV_UP): eoe0s0: link is not ready
[ 3289.995472] IPv6: ADDRCONF(NETDEV_UP): eoe0s0: link is not ready
[ 3290.019763] EtherCAT 0: Slave states on main device: INIT.
[ 3290.043756] EtherCAT 0: Slave states on main device: PREOP.
[ 3290.270841] EtherCAT 0: Domain 0: Working counter changed to 3/3.
[ 3290.279764] EtherCAT 0: Slave states on main device: SAFEOP.
[ 3290.287760] EtherCAT 0: Slave states on main device: OP.
[ 3290.651875] ec_e1000e: eno1 NIC Link is Up 1000 Mbps Full Duplex, Flow 
Control: None
[ 3290.651920] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link becomes ready
[ 3290.969766] EtherCAT WARNING 0: 5 datagrams UNMATCHED!
[ 3290.974674] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 5 times.
[ 3290.976777] IPv6: ADDRCONF(NETDEV_CHANGE): eoe0s0: link becomes ready
[ 3291.274830] EtherCAT 0: Domain 0: 12 working counter changes - now 3/3.
[ 3291.972739] EtherCAT WARNING 0: 5 datagrams UNMATCHED!
[ 3291.974880] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 5 times.
[ 3292.094807] EtherCAT WARNING 0-main-0: No sending response for eoe0s0 after 
100 tries.
[ 3292.278983] EtherCAT 0: Domain 0: 2 working counter changes - now 3/3.
[ 3292.990788] EtherCAT WARNING 0-main-0: No sending response for eoe0s0 after 
100 tries.
[ 3294.103893] EtherCAT WARNING 0-main-0: No sending response for eoe0s0 after 
100 tries.
[ 3294.887689] EtherCAT 0: Domain 0: Working counter changed to 0/3.
[ 3294.973683] EtherCAT WARNING 0: 2 datagrams UNMATCHED!
[ 3294.984490] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 2 times.
[ 3295.891819] EtherCAT 0: Domain 0: 15 working counter changes - now 3/3.
[ 3295.973667] EtherCAT WARNING 0: 5 datagrams UNMATCHED!
[ 3295.987844] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 5 times.
[ 3298.943805] EtherCAT 0: Domain 0: Working counter changed to 0/3.
[ 3299.944793] EtherCAT 0: Domain 0: 13 working counter changes - now 3/3.
[ 3299.976592] EtherCAT WARNING 0: 6 datagrams UNMATCHED!
[ 3300.000785] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 6 times.
[ 3300.538790] EtherCAT WARNING 0-main-0: No sending response for eoe0s0 after 
100 tries.
[ 3303.009524] EtherCAT 0: Domain 0: Working counter changed to 0/3.
[ 3303.980507] EtherCAT WARNING 0: 9 datagrams UNMATCHED!
[ 3304.013729] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 9 times.
[ 3304.013732] EtherCAT 0: Domain 0: 17 working counter changes - now 3/3.
[ 3304.981498] EtherCAT WARNING 0: 1 datagram UNMATCHED!
[ 3305.017954] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 1 time.
[ 3305.017956] EtherCAT 0: Domain 0: 2 working counter changes - now 3/3.
[ 3306.726450] EtherCAT 0: Domain 0: Working counter changed to 0/3.
[ 3306.981453] EtherCAT WARNING 0: 6 datagrams UNMATCHED!
[ 3307.022444] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 7 times.
[ 3307.730694] EtherCAT 0: Domain 0: 19 working counter changes - now 3/3.
[ 3307.982442] EtherCAT WARNING 0: 4 datagrams UNMATCHED!
[ 3308.026763] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 3 times.
[ 3310.479387] EtherCAT 0: Domain 0: Working counter changed to 0/3.
[ 3310.983378] EtherCAT WARNING 0: 5 datagrams UNMATCHED!
[ 3311.035400] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 5 times.
[ 3311.141385] EtherCAT WARNING 0-main-0: Failed to receive mbox check datagram 
for eoe0s0.
[ 3311.274500] EtherCAT WARNING 0-main-0: No sending response for eoe0s0 after 
100 tries.
[ 3311.483392] EtherCAT 0: Domain 0: 11 working counter changes - now 3/3.
[ 3314.532303] EtherCAT 0: Domain 0: Working counter changed to 0/3.
[ 3314.986298] EtherCAT WARNING 0: 3 datagrams UNMATCHED!
[ 3315.048313] EtherCAT WARNING: Datagram 00000000e95eb248 (domain0-0-main) was 
SKIPPED 5 times.
[ 3315.536416] EtherCAT 0: Domain 0: 13 working counter changes - now 3/3.
[ 3315.989282] EtherCAT WARNING 0: 2 datagrams UNMATCHED!
[ 3317.715249] EtherCAT 0: Slave states on main device: SAFEOP + ERROR.
[ 3318.613273] EtherCAT 0: Master thread exited.
[ 3318.613275] EtherCAT 0: Stopping EoE thread.
[ 3318.613343] EtherCAT 0: EoE thread exited.
[ 3318.613353] EtherCAT 0: Starting EoE thread.
[ 3318.613500] EtherCAT 0: Starting EtherCAT-IDLE thread.
[ 3318.613614] EtherCAT 0: 0 slave(s) responding on main device.
[ 3318.616245] EtherCAT ERROR 0-main-0: AL status message 0x001B: "Sync manager 
watchdog".
[ 3318.618242] EtherCAT 0: 1 slave(s) responding on main device.
[ 3318.618245] EtherCAT 0: Slave states on main device: SAFEOP + ERROR.
[ 3318.620256] EtherCAT 0-main-0: Acknowledged state SAFEOP.
[ 3318.622244] EtherCAT 0: Slave states on main device: SAFEOP.
[ 3318.627235] EtherCAT 0: Slave states on main device: INIT.
[ 3318.639238] EtherCAT 0: Slave states on main device: PREOP.
[ 3318.783218] EtherCAT WARNING: Datagram 0000000019626fd7 (master-fsm) was 
SKIPPED 1 time.
[ 3319.613733] EtherCAT 0: Releasing master...
[ 3319.613737] EtherCAT 0: Released.
[ 3319.615206] EtherCAT 0: Stopping EoE thread.
(Here Dmesg stream breaks, so maybe it's something related to the EoE thread?)


Dmesg of a run that hangs forever:
[ 3363.180585] EtherCAT: Requesting master 0...
[ 3365.930011] EtherCAT 0: Waiting for slave scan interrupted by signal.
[ 3365.930023] EtherCAT ERROR 0: Failed to enter OPERATION phase!
[ 3370.924180] ec_e1000e: enp3s0f1 NIC Link is Down
[ 3370.942351] EtherCAT 0: Stopping EoE thread.
(Again, the last command is "Stopping EoE thread", but the application hangs 
forever at that point.)


The way I start and end the Ethercat master operation is equivalent to what I 
found in the examples, but I distributed and encapsulated it in a few classes 
for reuse.
The code I use to initialize the master:

int EthercatBus::begin_operation()
{
    if(!network_operating)
    {
    //ECRT Related

    if(sc.size())
    {
    mass_configure_DOs(); // This function configures SDOs/PDOs
    }
    ecrt_master_activate(master);
    datagram_begin = ecrt_domain_data(domain1);

    //Initialize the EtherCAT thread

    
ethercat_thread=make_shared<boost::thread>(&EthercatBus::function_ethercat_thread,this);
 //Cyclic task starts here
    network_operating = 1;
    cout << "Ethercat Network Operational"<< endl;


    if(!slaves_configured && sc.size())
    {
    switch_on_slaves(); //This function pushes each slave through 
slave-specific initialization using PDOs + Controlword
    slaves_configured = 1;
    }
    else if(!sc.size())
    {
    slaves_configured = 1;
    }
    }

    cout << "Slaves On"<< endl;
}



The code which I use to deinitialize and release the master is here:

int EthercatBus::end_operation()
{
    if(network_operating)
    {

        cout << "Shutting down network" << endl;

        for (vector<shared_ptr<EthercatSlave>>::iterator iter=sc.begin(); iter 
!= sc.end(); iter++)
        {

            int locA = (*iter)->alias;
            int locPos = (*iter)->position;
            cout << "Shutting down slave : " << locA << "/" << locPos << endl;

            (*iter)->switch_off(); //Slave specific deinitialization procedure. 
Cyclic tasks still run at this point.

            cout << "Slave : " << locA << "/" << locPos << " Shut Down" << endl;
        }

        usleep (1e6);


        ethercat_thread->interrupt(); // Cyclic task stopped
        ethercat_thread->join();
        ethercat_thread.reset();

        usleep (1e6);


        cout << "Thread Returned" << endl;
        ecrt_master_deactivate(master);
        cout << "EthercatBus - Master Deactivated" << endl;
        usleep (1e6);
        ecrt_release_master(master);


        cout << "EthercatBus - Master Released" << endl;

        for (vector<shared_ptr<EthercatSlave>>::iterator iter=sc.begin(); iter 
!= sc.end();)
        {
        sc.erase(iter);
        }

        cout << "Ethercat Network shut down" << endl;
        network_operating=0;
    }
}

The most non-standard thing in my code, which may be the source of an error, is 
definitely the use of a PI controller for timing inside a cyclic task:

  while(!boost::this_thread::interruption_requested()) //Cycle until the thread 
is interrupted.
      {

          cycle_frames_receive();

          t2=t1; //Store the beginning time of the previous iteration.
          t1 = high_resolution_clock::now();
          tdiff=duration_cast<microseconds>(t1-t2).count();

          local_counter++;

          if(local_counter==frequency) //Once every second
          {

              local_counter=0;
          }


          cycle_frames_send();

          //PI control over the frame period. Robust up to 10 [kHz]

          t_error_P=(1e6/frequency - tdiff); // Proportional Error
          t_error_I= t_error_I + t_error_P; //Integral Error
          msec = msec + gain_P*(t_error_P) + gain_I*t_error_I //; Closing the 
control;

          if(first_iter)
          {
              msec=1e6/frequency;
              first_iter=false;
          }

          if (msec>0){usleep(msec);} //Saturation, as you cannot wait less than 
0 microseconds.

        }

instead of a regular timer, but this way has always provided me with a more 
exact and stable performance.

If full source code is required, I can supply it.

The entire stack so far worked on 5+ different computers. Similar issues were 
happening in the past sometimes, but like once every 1000 runs and always while 
exiting the application (especially in a forceful manner, never during the 
cyclic communication). On this particular machine the master usually gets 
messed up after 2/3 runs.

Besides iPOS drives, I work also with Copley Controls Xenus drives. I have 
never had any issues with those. Applications with them run smoothly (so maybe 
it's the problem with Technosof slaves?).

I thought it is a problem of the ec_e1000e driver, but the issue persists when 
I use ec_generic driver.

Has anyone ever encountered a similar problem? I have zero knowledge of how the 
ecrt stack is coded internally. If anyone could tell me what I am doing wrong, 
or at least point me towards any way, in which I could establish that, I would 
be extremely obliged.

Yours faithfully,
Dr. Jakub Sikorski

Surgical Robotics Laboratory
Horstring W-130
University of Twente
7500AE Enschede
The Netherlands
j.sikor...@utwente.nl







-- 
Etherlab-users mailing list
Etherlab-users@etherlab.org
https://lists.etherlab.org/mailman/listinfo/etherlab-users

Reply via email to