Re: [etherlab-dev] Missing Vendor ID / Product Code

2019-06-10 Thread Graeme Foot
Hi,

Unfortunately "0008-fsm_sii-loading-check.patch" (below) didn't fix my main 
problem.  It turns out it is an inherent problem with how the masters external 
datagram ring works.  I have attached a patch that plugs the hole causing the 
problem I was having but there may be other cases where issues could occur.

Patch: 
/features/parallel-slave/0009-ec_master_exec_slave_fsms-external-datagram-fix.patch


The guts of the problem:

ec_master_exec_slave_fsms() calls ec_master_get_external_datagram() to get a 
datagram from the external datagram ring.  The datagram is then passed to 
ec_fsm_slave_exec() of the slaves with some work to do.  This call will then 
return either 1 for fsm still in progress or 0 for fsm is complete.  The master 
assumes that if the fsm is still in progress then the datagram has been 
consumed and is in use, but there are various cases where this is not true.  If 
any of these cases occur then in the first loop of ec_master_exec_slave_fsms() 
these slaves fsm's may be executed multiple times while another slaves fsm is 
waiting on its datagram to return.

If too many slaves, or cycles, occur during this time then the waiting slaves 
datagram either gets its state set to EC_DATAGRAM_INVALID or gets reused by 
another slave.  This can lead to "cancelled" datagram replies or the two slaves 
getting the results from the second slaves datagram (as the first datagram 
index will be replaced and its reply is lost).


In my case this was occurring due to using the "0001-load-sii-from-file.patch" 
patch.  During the SII config stage of a slave this patch will create a kthread 
to attempt to read the SII file from disk.  In the meantime the 
ec_fsm_slave_exec() command will continue returning a value of 1 (fsm in 
progress) but will not be using the presented datagrams (setting the datagram 
state to EC_DATAGRAM_INVALID).

During initial startup and configuration of the master the 
ec_master_exec_slave_fsms() call is made from ec_master_idle_thread() in a loop 
with (in my configuration) a call to schedule() before resuming the loop.  This 
means that multiple loops may occur before a reply to a slaves datagram 
returns, leaving plenty of time for the in-use datagrams to be recycled 
resulting in its state or data being overwritten.


The patch I have attached now also tests the datagrams state for 
EC_DATAGRAM_INVALID before incrementing the external datagram ring index.  This 
solves my problem where the datagrams state is being set to EC_DATAGRAM_INVALID 
while waiting for the kthread to complete.

I suspect there may be other instances where this problem could occur.  One 
case I have thought of, but haven't been able to confirm, is when multiple 
protocols try to access a slaves mailbox at the same time (e.g. COE, EOE, FOE 
etc).  Only one protocol is allowed to communicate at a time.  The other 
protocols will be offered a datagram from the ring, but they aren't able to use 
it until their turn comes up.  In these cases if ec_read_mbox_locked() fails 
the datagram state is also set to EC_DATAGRAM_INVALID so the patch should also 
cover this case.


Regards,
Graeme.


From: etherlab-dev  On Behalf Of Graeme Foot
Sent: Monday, 4 March 2019 2:36 PM
To: etherlab-dev@etherlab.org
Subject: Re: [etherlab-dev] Missing Vendor ID / Product Code

Hi,

I think I've finally solved the problem.  The slaves with the issue are 
returning with the "EEPROM not loaded" bit set when reading the SII information 
(bit 12 if the EEPROM status word).  If this bit is set then the slave has not 
yet finished reading the SII information from the EEPROM and the data returned 
may not be valid.  The master code was not checking for this bit.  I have 
attached a patch to do so:
/features/parallel-slave/0008-fsm_sii-loading-check.patch

The patch checks if the bit is set and keeps re-reading the EEPROM data until 
it is not.  At this point the data returned is still incorrect so a complete 
read is requested (where a write is first sent asking for the slave to load the 
data that needs to be read).  There is a 500ms timeout waiting for the bit to 
be clear.  If the bit does not clear then the EEPROM load may have failed (e.g. 
incorrect CRC value).


My previous patch (features/sii-read-failure/0001-sii-read-retry.patch) should 
no longer be required, but it may help to make reading of the SII data more 
robust.  I've attached the latest version of this one also.  It is now:
features/sii-read-failure/0001-slave-scan-retry.patch


Regards,
Graeme Foot.


From: etherlab-dev 
mailto:etherlab-dev-boun...@etherlab.org>> 
On Behalf Of Graeme Foot
Sent: Friday, 12 October 2018 11:38 AM
To: Gavin Lambert mailto:gavin.lamb...@tomra.com>>; 
etherlab-dev@etherlab.org<mailto:etherlab-dev@etherlab.org>
Subject: Re: [etherlab-dev] Missing Vendor ID / Product Code

Hi,

I've had a chance to play with my testrig and have managed to consistently 
reproduce the problem wh

Re: [etherlab-dev] Missing Vendor ID / Product Code

2018-10-11 Thread Graeme Foot
Hi,

I've had a chance to play with my testrig and have managed to consistently 
reproduce the problem when hot-plugging a module (I haven't had the problem 
again on a production machine from a normal startup that I can test on yet).

My system:
- CX2020
- EK1110 (alias 10001)
- EK1100 (alias 2)
- EL1008 (alias 1)

I start the system without the EL1008 plugged in and get it running.  I then 
plug in the EL1008 and the SII information fails to read, resulting in a zero 
alias, vendorID and product code etc.  I have attached a patch which resolves 
the issue on my testrig (but I don't know if it will resolve my production 
issue).

With --enable-sii-override set, the patch detects a zero vendorID or product 
code (hopefully no one has a device with a zero product code) and then retries 
scanning the slave after a 100ms timeout.  If --enable-sii-override is not set 
then it will do the retry if reading the SII size fails.

One strange thing I found during testing was that:
- When --enable-sii-override is not set ec_fsm_sii_success() would fail during 
the ec_fsm_slave_scan_state_sii_size() state; but
- When --enable-sii-override is set ec_fsm_sii_success() does not fail during 
the ec_fsm_slave_scan_state_sii_device() state, so I instead check for a zero 
vendorID or product code.

Gavin, the patch is against your previous patchset.  I put the patch under:
features/sii-read-failure/0001-sii-read-retry.patch

Let me know if you think there's anything dodgy with it.


Graeme.


From: Gavin Lambert 
Sent: Wednesday, 8 August 2018 3:13 PM
To: Graeme Foot ; etherlab-us...@etherlab.org
Subject: RE: Missing Vendor ID / Product Code

There’s lots of things that can cause that.  Most often, I’ve seen this when 
packets get lost or corrupted, so the initial discovery datagrams get lost or 
fail.  Usually bad wiring or shielding is the culprit.

I think it might be possible to get something similar due to an unfortunate 
timing coincidence – if the devices are being connected “live” then a dodgy 
plug-in could make the device visible in the initial device count scan, but 
then disconnected before it finishes the identity discovery, but then 
reconnected again before it does the next device count scan (so it doesn’t try 
again).  Replugging the devices (with less unfortunate timing) or restarting 
the etherlab service should both recover from that case, however.

Or, of course, you might have found a bug. 

It's hard to say for sure what actually happened without seeing syslogs and/or 
reproducing it.

From: Graeme Foot
Sent: Wednesday, 8 August 2018 14:12
To: etherlab-us...@etherlab.org
Subject: [etherlab-users] Missing Vendor ID / Product Code

Hi,

I updated my EtherCAT system to use Gavin's patch set (revision 10, 20171108).  
It has been running fine on a few machines, but have just had a machine being 
commissioned where one of the slave modules had a zero Vendor ID and Product 
Code (and I suspect it failed to read any information from the slave).  
Unfortunately it occurred while I was not available so our engineers reverted 
to the previous version (which detected the module correctly) and shipped the 
machine, so I have very minimal information and no logs.

The module with the problem was the 17th module, the first EL2612 of 5.  It is 
directly after an EL9410 power module.  It has an explicit alias set.  The 
engineers had tried repowering the whole system and replacing the module.

Until I get a machine to test on with the same behaviour I was wondering if 
anyone else has had problems with slaves not initialising correctly.

Thanks,
Graeme Foot.


0001-sii-read-retry.patch
Description: 0001-sii-read-retry.patch
___
etherlab-dev mailing list
etherlab-dev@etherlab.org
http://lists.etherlab.org/mailman/listinfo/etherlab-dev