This looks like excellent analysis so far.

Some questions, pardon me I haven't followed this closely recently, but
this sort of sensor system work load analysis stuff is part of what I do...
What are the specific Ethernet devices and drivers, the Ethernet bus and
interface controllers, and what are the specific storage system devices and
drivers, and storage system bus controllers and drivers?

I'm wondering if there are polled modes or busy waits, or very small
buffering at work in the device and bus model, that's what those questions
are about.

You may get a lot more detailed analysis of the workload and it's effect on
the software system if it can be set up to self capture an LTTng trace
while it's running. The trace can be visualized in a current version of
Eclipse using LTTng viewer plugins.

A lighter weight tool that runs most whether that can give you a less
detailed view if this is the venerable oprofile. Or instrument your
application with clock_gettime(CLOCK_MONOTONIC, &ts) and compare start and
end times to see where it may be blocking unexpectedly.

In recent experience I was using TCP for moving sensor data over Ethernet
over USB (cdc_ether) with the Nagel algorithm turned off (sockopt
TCP_NODELAY) so that each sensor measurement resulted in data being flushed
to the interface. Turns out the interrupt load was huge on the receiving
end, an Atom PC. Allowing the high rate messages to combine and be sent
using the Nagel algorithm, while selectively flushing high priority command
and acknowledge messages, reduced the interrupt load drastically on the
receiving Atom end.

Another thing that destroyed real time response was using syslog(2) for
high rate logging... The syslogd couldn't react fast enough and the
/dev/log pipe backed up, causing syslog to block each process, and sensor
processes then failed to meet deadlines... I replaced syslog calls with an
in memory circular buffer of messages and an asynchronous thread to push
them out to a potentially blocking syslog call as soon as possible,
dropping messages when the buffer overflowed. My application was real time
again.

Anyway, just more thoughts to think about.

On Wed, Jan 6, 2016, 01:14 Nathan Bergey <nat...@psas.pdx.edu> wrote:

> I've been looking into the partial data loss issue we experienced from
> Launch 12.
>
>
>
> Background
> ----------------
>
> All the sensor data on the rocket are sent via UDP packets to the flight
> computer. Most of the sensors are physically on an ethernet link from a
> microcontroller somewhere on the rocket.
>
> So that we can know whether or not we missed anything, every packet that
> is sent has a sequence number. If the flight computer receives a packet out
> of order, (e.g. 1,2,3,7) then it records a "Sequence Error" (SEQE) event to
> the log.
>
> While parsing through the flight computer log from Launch 12 we noticed
> that every once and a while (roughly every 20 seconds or so) the FC
> reported a sequence error for almost every single sensor all at once.
>
> This means we didn't receive several milliseconds of data every 20 seconds.
>
> It appears to happen to every sensor at the same time which suggests a
> common point of failure, either the switch, the ethernet cable to the
> flight computer, or some problem with the flight computer itself.
>
>
> Correlation
> ---------------
>
> In attempt to figure out where the issue is I charted several variables,
> like acceleration, system voltage, etc. with an overlay of times we missed
> data. The hope is to find something that correlates with the dropped
> packets that will point in the direction of a cause.
>
>
> IO Wait
> ----------
>
> The smoking gun in turned out to be the amount to time the kernel spent in
> IO Wait. Attached is a chart of CPU mode share between several types of
> usage, System, User, IO Wait, etc. This data is from the python system
> monitor code we wrote using the `psutil` python package.
>
> Docs: http://pythonhosted.org/psutil/#cpu
>
>
> Attached is a chart of the CPU and missing data.
>
>
> ​
> This chart spans from approximately L-20 minutes all the way to the time
> the flight computer shut down after landing. Just before the L-8 minute
> mark we turn on JGPS and start slamming the flight computer with data. This
> is when the problems start. The correlation with lost data is very evident.
>
>
> Flush
> -------
>
> Looking at the 'disk write' data the same pattern is there so it's clear
> the IO Wait is linux flushing the data to disk on a regular interval. I
> believe this is expected behavior.
>
>
> Reproduction
> ------------------
>
> Tonight in the lab I was able to recreate the dropped data using the
> following steps:
>
>  1. Using the launch director table bring up the ground systems (LTC,
> TM4K, etc.)
>  2. Turn on the flight computer
>  3. Make sure everything is running normally
>  4. Turn on the IMU
>  5. Turn on JGPS
>  6. Wait at least 40 seconds
>  7. Turn off JGPS and IMU
>  8. Stop the fc process
>  9. Examine the logs
>
>
> There appears to be the same pattern of missed packets.
>
>
> Conclusion
> ---------------
>
> So what happened? The good news is we have a really good idea where to
> look for the answer and a way to reproduce the problem while probing the
> flight computer in the lab!
>
> But, this is as far as I go. I don't know enough kernel stuff to know
> what's next.
>
> My hope is someone can poke at this and make a nice writeup in:
>
> https://github.com/psas/Launch-12/tree/gh-pages/debrief
>
> with the answer to what happened and why.
>
>
> -Nathan
> PSAS
> _______________________________________________
> psas-avionics mailing list
> psas-avionics@lists.psas.pdx.edu
> http://lists.psas.pdx.edu/mailman/listinfo/psas-avionics
>
_______________________________________________
psas-avionics mailing list
psas-avionics@lists.psas.pdx.edu
http://lists.psas.pdx.edu/mailman/listinfo/psas-avionics

Reply via email to