In addition to storage device, bus, and interface controller type, and if
the device has internal buffering, lazy initialization, or that sort of
thing, Bart reminds me that I forgot to ask what sort of filesystem is
being used for logging? ext4? flashfs? etc.

Is the logging done inline in the application or is there any kind of
memory buffering being used? (I would probably suggest that a seperate
process or at least thread handle logging, to avoid the more real time part
of the application blocking on logging writes).

Also, what mount options are being given? I would pay particular attention
to atime/relatime, write gathering timeout, and whether discard/TRIM is
being used. Is there a swap partition being used at all (I would probably
avoid activating a swap device in an embedded real time system).

Is the logging/writing done to the rootfs filesystem? I'd probably ensure
logging is on a separate partition from the rootfs, ideally a separate
device, but that's probably overkill.

The goal of all this is to avoid global kernel scheduling halts during
device writes, and if that some part of the application has to block for a
long time on the writes to storage, that it is limited to just that
 process, such that the rest of the incoming sensor input Ethernet, kernel
and application can still run and be scheduled.


On Wed, Jan 6, 2016 at 9:26 AM Bart Massey <> wrote:

> One possibility that occurs to me: flash writes are actually quite slow
> unless there are erased blocks to write to. You may need to ensure that the
> unused portion of the flash drive has been erased before launch. I'm not
> exactly sure how best to do that, but I'm pretty sure there are ways.
> On Wed, Jan 6, 2016 at 7:43 AM James Perkins <> wrote:
>> This looks like excellent analysis so far.
>> Some questions, pardon me I haven't followed this closely recently, but
>> this sort of sensor system work load analysis stuff is part of what I do...
>> What are the specific Ethernet devices and drivers, the Ethernet bus and
>> interface controllers, and what are the specific storage system devices and
>> drivers, and storage system bus controllers and drivers?
>> I'm wondering if there are polled modes or busy waits, or very small
>> buffering at work in the device and bus model, that's what those questions
>> are about.
>> You may get a lot more detailed analysis of the workload and it's effect
>> on the software system if it can be set up to self capture an LTTng trace
>> while it's running. The trace can be visualized in a current version of
>> Eclipse using LTTng viewer plugins.
>> A lighter weight tool that runs most whether that can give you a less
>> detailed view if this is the venerable oprofile. Or instrument your
>> application with clock_gettime(CLOCK_MONOTONIC, &ts) and compare start and
>> end times to see where it may be blocking unexpectedly.
>> In recent experience I was using TCP for moving sensor data over Ethernet
>> over USB (cdc_ether) with the Nagel algorithm turned off (sockopt
>> TCP_NODELAY) so that each sensor measurement resulted in data being flushed
>> to the interface. Turns out the interrupt load was huge on the receiving
>> end, an Atom PC. Allowing the high rate messages to combine and be sent
>> using the Nagel algorithm, while selectively flushing high priority command
>> and acknowledge messages, reduced the interrupt load drastically on the
>> receiving Atom end.
>> Another thing that destroyed real time response was using syslog(2) for
>> high rate logging... The syslogd couldn't react fast enough and the
>> /dev/log pipe backed up, causing syslog to block each process, and sensor
>> processes then failed to meet deadlines... I replaced syslog calls with an
>> in memory circular buffer of messages and an asynchronous thread to push
>> them out to a potentially blocking syslog call as soon as possible,
>> dropping messages when the buffer overflowed. My application was real time
>> again.
>> Anyway, just more thoughts to think about.
>> On Wed, Jan 6, 2016, 01:14 Nathan Bergey <> wrote:
>>> I've been looking into the partial data loss issue we experienced from
>>> Launch 12.
>>> Background
>>> ----------------
>>> All the sensor data on the rocket are sent via UDP packets to the flight
>>> computer. Most of the sensors are physically on an ethernet link from a
>>> microcontroller somewhere on the rocket.
>>> So that we can know whether or not we missed anything, every packet that
>>> is sent has a sequence number. If the flight computer receives a packet out
>>> of order, (e.g. 1,2,3,7) then it records a "Sequence Error" (SEQE) event to
>>> the log.
>>> While parsing through the flight computer log from Launch 12 we noticed
>>> that every once and a while (roughly every 20 seconds or so) the FC
>>> reported a sequence error for almost every single sensor all at once.
>>> This means we didn't receive several milliseconds of data every 20
>>> seconds.
>>> It appears to happen to every sensor at the same time which suggests a
>>> common point of failure, either the switch, the ethernet cable to the
>>> flight computer, or some problem with the flight computer itself.
>>> Correlation
>>> ---------------
>>> In attempt to figure out where the issue is I charted several variables,
>>> like acceleration, system voltage, etc. with an overlay of times we missed
>>> data. The hope is to find something that correlates with the dropped
>>> packets that will point in the direction of a cause.
>>> IO Wait
>>> ----------
>>> The smoking gun in turned out to be the amount to time the kernel spent
>>> in IO Wait. Attached is a chart of CPU mode share between several types of
>>> usage, System, User, IO Wait, etc. This data is from the python system
>>> monitor code we wrote using the `psutil` python package.
>>> Docs:
>>> Attached is a chart of the CPU and missing data.
>>> ‚Äč
>>> This chart spans from approximately L-20 minutes all the way to the time
>>> the flight computer shut down after landing. Just before the L-8 minute
>>> mark we turn on JGPS and start slamming the flight computer with data. This
>>> is when the problems start. The correlation with lost data is very evident.
>>> Flush
>>> -------
>>> Looking at the 'disk write' data the same pattern is there so it's clear
>>> the IO Wait is linux flushing the data to disk on a regular interval. I
>>> believe this is expected behavior.
>>> Reproduction
>>> ------------------
>>> Tonight in the lab I was able to recreate the dropped data using the
>>> following steps:
>>>  1. Using the launch director table bring up the ground systems (LTC,
>>> TM4K, etc.)
>>>  2. Turn on the flight computer
>>>  3. Make sure everything is running normally
>>>  4. Turn on the IMU
>>>  5. Turn on JGPS
>>>  6. Wait at least 40 seconds
>>>  7. Turn off JGPS and IMU
>>>  8. Stop the fc process
>>>  9. Examine the logs
>>> There appears to be the same pattern of missed packets.
>>> Conclusion
>>> ---------------
>>> So what happened? The good news is we have a really good idea where to
>>> look for the answer and a way to reproduce the problem while probing the
>>> flight computer in the lab!
>>> But, this is as far as I go. I don't know enough kernel stuff to know
>>> what's next.
>>> My hope is someone can poke at this and make a nice writeup in:
>>> with the answer to what happened and why.
>>> -Nathan
>>> PSAS
>>> _______________________________________________
>>> psas-avionics mailing list
>> _______________________________________________
>> psas-avionics mailing list
psas-avionics mailing list

Reply via email to