On Mon, Feb 04, 2019 at 11:03:18AM +0000, Georg Ottinger wrote:
> I don't know how the race condition is triggered in detail.  All I know is 
> that if Touch Acquisition is enabled the adc_demo_error2 will provoke a 
> systemhang. If Touch is disabled the issue goes away. The architecture of 
> at91_adc.c uses (at least) two trigger mechanisms of the ADC peripheral: 
> Touch Trigger and Channel Trigger (started by at91_adc_read_raw() ). In my 
> theory the following happens:

I think you trigger this race condition because when a touchscreen
conversion is done, it performs a conversion on *all active channels*.

Regards

Ludovic

> 
> 1.) at91_adc_read_raw is called, a ADC conversation gets started
> 2.) Touch trigger initiates a new conversation (killing the first 
> conversation)
> 
> .... the conversation started from  at91_adc_read_raw() never finishes -> 
> TIMEOUT!
> 
> The following Email is a summary of our findings (I mailed to the microchip 
> people):
> 
> -----
> 
> I want to let you know our findings of our ADC bughunting marathon. We put 
> around 100 hours of work to identify this issues and I think we found 4  of 
> them (one of them is more hardware related). Although we could develop 
> workarounds for this issues - not all of them are fully understood - so if 
> possible including someone experienced with the ADC peripheral (maybe from 
> the hardware team) should be considered.
> 
> Attached you will find our version of at91_adc.c together with a simple error 
> demonstrator program called adc_demo_error.c
> 
> * 1st issue: Unwanted PEN/NOPEN Interrupts at low temperatures
> 
> When in Pendetect mode we received a lot of unwanted interrupts. This issue 
> only appeared in our test settings when our board approached -20 degrees 
> Celsius. We suspect that the internal Pull-down Resistor connected before the 
> Pendetect-Schmitt trigger (selected by PENDETSENS), has a negative 
> temperature coefficient and at low temperature the impedance might be too 
> high and so false detects are triggered by capturing surrounding noise -> we 
> could solve this issue by changing the value  ts_pen_detect_sensitivity = 2 
> to ts_pen_detect_sensitivity = 3.
> 
> This issue can be considered to be related to our specific configuration, 
> nevertheless we want to mention that the current Datasheet States on Page 
> 1711 states the following:
> 
> "PENDETSENS: Pen Detection Sensitivity-Modifies the pen detection input 
> pull-up resistor value. See the section 'Electrical Characteristics' for 
> further details."
> 
> There are two issues with this description. First: This resistor is a 
> pull-down resistor (measured and logical). Second the 'Electrical 
> Characteristics' sections is lacking the detailed description of this 
> resistor value.
> 
> * 2nd issue: at91_adc_read_raw() keeps Channel and corresponding interrupt 
> enabled in case of timeout.
> 
> Having a brief look at at91_adc_read_raw() it is obvious that in the case of 
> a timeout the setting of AT91_ADC_CHDR and AT91_ADC_IDR registers is omitted.
> If 2 different channels are queried  we can end up with a situation where two 
> Interrupts are enabled - but only one Interrupt is cleared in the Interrupt 
> handler. Resulting in a interrupt loop and a system hang.
> 
> This Issue can be reproduced by compiling and running the error demonstration 
> with two channel support. (-D2ND_CHANNEL)
> 
> $ arm-buildroot-linux-gnueabihf-gcc adc_demo_error.c -D2ND_CHANNEL -o 
> adc_demo_error2
> 
> * 3rd issue: Race condition between PEN/NOPEN Interrupts using periodic 
> trigger and ADC_START in at91_adc_read_raw()
> 
> This issue represents one way how timeout situations are provoked. Actually 
> we consider the architecture of at91_adc.c a bit risky because it is using 
> two trigger sources (periodic trigger for Touch, and ADC_START for reading 
> raw values). We believe that changing the trigger source during an ongoing 
> conversion can have unwanted side effects, like destroying an ongoing 
> conversion. We currently don't fully understand what is going on, but we 
> developed a workaround that is enhancing the timeout handling in 
> at91_adc_read_raw() with a simple retry mechanism. (restarting with ADC_START)
> 
> * 4th issue: when touch operation is enabled (TSMODE) a delay time between a 
> channel disable (AT91_ADC_CHDR) and channel enable (AT91_ADC_CHER) is needed 
> otherwise the following conversion will fail.
> 
> This issue represents the second way how timeout situations can arise. We 
> measured around 40 to 42 ADC Cycles (at different ADC Clocks 500khz, 1mhz, 
> 2mhz, 3mhz, 5mhz) of delay needed between the channel disable and the channel 
> enable. At higher adc clock frequency (10mhz, 20 mhz) the issue doesn't show 
> up (most likely because the function calls are delaying already enough). We 
> also observed that the need delay time increases further according to the 
> transfer time settings (TRANSFER) by 0, 2, 4 or 8  ADC cycles.
> 
> This Issue can be reproduced by compiling and running the error demonstrator 
> (single channel) and including a simple error message in the kernel module in 
> the timeout case.
> 
> $ arm-buildroot-linux-gnueabihf-gcc adc_demo_error.c -o adc_demo_error
> 
> This issue is least understood by us and I think that talking to the hardware 
> people regarding this issue can be worthwhile. 
> As a workaround we put at91_adc_read_raw() to sleep after the writing 
> AT91_ADC_CHDR for 50 to 75 ADC Cycles. 
> 
> ------------------------------------
> 
> We will run another round of tests next week - and after that I would like to 
> prepare a pull request concerning the 2nd Issue. For the 3rd and the 4th 
> Issue I would like to hear your opinion if our modifications are sound before 
> I will write an pull request.
> 
> Thanks for your time,
> Best wishes,
> Georg Ottinger
> 
> -----
> 
> -----Ursprüngliche Nachricht-----
> Von: Jonathan Cameron [mailto:jonathan.came...@huawei.com] 
> Gesendet: Montag, 04. Februar 2019 10:46
> An: Georg Ottinger <g.ottin...@abatec.at>
> Cc: Jonathan Cameron <ji...@kernel.org>; eugen.hris...@microchip.com; Stefan 
> Etzlstorfer <s.etzlstor...@abatec.at>; Hartmut Knaack <knaac...@gmx.de>; 
> Lars-Peter Clausen <l...@metafoo.de>; Peter Meerwald-Stadler 
> <pme...@pmeerw.net>; Nicolas Ferre <nicolas.fe...@microchip.com>; Alexandre 
> Belloni <alexandre.bell...@bootlin.com>; Ludovic Desroches 
> <ludovic.desroc...@microchip.com>; David S. Miller <da...@davemloft.net>; Ard 
> Biesheuvel <ard.biesheu...@linaro.org>; Kees Cook <keesc...@chromium.org>; 
> linux-...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; 
> linux-kernel@vger.kernel.org; Maxime Ripard <maxime.rip...@bootlin.com>
> Betreff: Re: [PATCH] iio: adc: at91: disable adc channel interrupt in timeout 
> case
> 
> On Mon, 4 Feb 2019 07:17:07 +0000
> Georg Ottinger <g.ottin...@abatec.at> wrote:
> 
> > Actually this issue occurred to us with an concrete product, where we 
> > experienced a system hang at -20 °C.
> > It was triggered by a race condition between the Touch Trigger and the 
> > Channel Trigger of the ADC. Once triggered we got in to the situation where 
> > an ongoing Channel Conversion was lost (Timeout case).When we queried a 
> > second channel than we got a system hang. Investigating this issue we 
> > developed an error demonstrator - reading alternating two channels as fast 
> > as possible (when Touch is enabled). This also provokes this issue at room 
> > temperature.
> > 
> > For the error demonstrator use following commandline to compile:
> > 
> > $ arm-buildroot-linux-gnueabihf-gcc adc_demo_error.c -D2ND_CHANNEL -o 
> > adc_demo_error2
> > 
> > -------------
> > // adc_demo_error.c
> > #include <unistd.h>
> > #include <fcntl.h>
> > 
> > #define VLEN 10
> > 
> > int main()
> > {
> >   int fd_adc,fd_adc2;
> >   int ret;
> >   char dummy[VLEN];
> >   
> >   fd_adc = open 
> > ("/sys/devices/platform/ahb/ahb:apb/f8018000.adc/iio:device0/in_voltag
> > e4_raw",O_RDONLY);
> > #ifdef 2ND_CHANNEL
> >   fd_adc2 = open 
> > ("/sys/devices/platform/ahb/ahb:apb/f8018000.adc/iio:device0/in_voltag
> > e5_raw",O_RDONLY);
> > #endif
> > 
> >   while(1) {
> > 
> >     lseek(fd_adc, 0, SEEK_SET);
> >     ret = read(fd_adc, dummy, VLEN);
> > #ifdef 2ND_CHANNEL
> >     lseek(fd_adc2, 0, SEEK_SET);
> >     ret = read(fd_adc2, dummy, VLEN);
> > #endif
> > 
> >   }
> > }
> > 
> > ------------
> > 
> > 
> > Greeting, Georg
> Hi Georg,
> 
> Thanks for the detailed error report and reproducer.
> 
> What has me wondering now is where the race is that is triggering this?
> Could you talk us through it?  I don't know this driver that well so probably 
> much quicker for you to fill in the gaps rather than me try to figure out the 
> race path!
> 
> I have no problem with defense in depth (with appropriate comments) but I'd 
> like to close that down as well.  If it really is unsolvable I'd like at 
> least to have some clear comments in the code explaining why.
> 
> Thanks,
> 
> Jonathan
> 
> > 
> > -----Ursprüngliche Nachricht-----
> > Von: Jonathan Cameron [mailto:ji...@kernel.org]
> > Gesendet: Samstag, 02. Februar 2019 11:21
> > An: Georg Ottinger <g.ottin...@abatec.at>
> > Cc: eugen.hris...@microchip.com; Stefan Etzlstorfer 
> > <s.etzlstor...@abatec.at>; Hartmut Knaack <knaac...@gmx.de>; 
> > Lars-Peter Clausen <l...@metafoo.de>; Peter Meerwald-Stadler 
> > <pme...@pmeerw.net>; Nicolas Ferre <nicolas.fe...@microchip.com>; 
> > Alexandre Belloni <alexandre.bell...@bootlin.com>; Ludovic Desroches 
> > <ludovic.desroc...@microchip.com>; David S. Miller 
> > <da...@davemloft.net>; Ard Biesheuvel <ard.biesheu...@linaro.org>; 
> > Kees Cook <keesc...@chromium.org>; linux-...@vger.kernel.org; 
> > linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; 
> > Maxime Ripard <maxime.rip...@bootlin.com>
> > Betreff: Re: [PATCH] iio: adc: at91: disable adc channel interrupt in 
> > timeout case
> > 
> > On Wed, 30 Jan 2019 14:42:02 +0100
> > <g.ottin...@abatec.at> wrote:
> > 
> > > From: Georg Ottinger <g.ottin...@abatec.at>
> > > 
> > > Having a brief look at at91_adc_read_raw() it is obvious that in the 
> > > case of a timeout the setting of AT91_ADC_CHDR and AT91_ADC_IDR 
> > > registers is omitted. If 2 different channels are queried we can end 
> > > up with a situation where two interrupts are enabled, but only one 
> > > interrupt is cleared in the interrupt handler. Resulting in a 
> > > interrupt loop and a system hang.
> > > 
> > > Signed-off-by: Georg Ottinger <g.ottin...@abatec.at>
> > 
> > Whilst I agree this looks like a correct change, I would like Maxime to 
> > take a look as he is obviously much more familiar with the driver than I am.
> > 
> > I suspect you can only actually get there in the event of a hardware 
> > failure as that isn't actually a timeout you should ever see.
> > Could be wrong though!
> > 
> > Thanks,
> > 
> > Jonathan
> > 
> > > ---
> > >  drivers/iio/adc/at91_adc.c | 28 +++++++++++++++++-----------
> > >  1 file changed, 17 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/iio/adc/at91_adc.c b/drivers/iio/adc/at91_adc.c 
> > > index 75d2f7358..596841a3c 100644
> > > --- a/drivers/iio/adc/at91_adc.c
> > > +++ b/drivers/iio/adc/at91_adc.c
> > > @@ -704,23 +704,29 @@ static int at91_adc_read_raw(struct iio_dev *idev,
> > >           ret = wait_event_interruptible_timeout(st->wq_data_avail,
> > >                                                  st->done,
> > >                                                  msecs_to_jiffies(1000));
> > > -         if (ret == 0)
> > > -                 ret = -ETIMEDOUT;
> > > -         if (ret < 0) {
> > > -                 mutex_unlock(&st->lock);
> > > -                 return ret;
> > > -         }
> > > -
> > > -         *val = st->last_value;
> > >  
> > > +         /* Disable interrupts, regardless if adc conversion was
> > > +          * successful or not
> > > +          */
> > >           at91_adc_writel(st, AT91_ADC_CHDR,
> > >                           AT91_ADC_CH(chan->channel));
> > >           at91_adc_writel(st, AT91_ADC_IDR, BIT(chan->channel));
> > >  
> > > -         st->last_value = 0;
> > > -         st->done = false;
> > > +         if (ret > 0) {
> > > +                 /* a valid conversion took place */
> > > +                 *val = st->last_value;
> > > +                 st->last_value = 0;
> > > +                 st->done = false;
> > > +                 ret = IIO_VAL_INT;
> > > +         } else if (ret == 0) {
> > > +                 /* conversion timeout */
> > > +                 dev_err(&idev->dev, "ADC Channel %d timeout.\n",
> > > +                         chan->channel);
> > > +                 ret = -ETIMEDOUT;
> > > +         }
> > > +
> > >           mutex_unlock(&st->lock);
> > > -         return IIO_VAL_INT;
> > > +         return ret;
> > >  
> > >   case IIO_CHAN_INFO_SCALE:
> > >           *val = st->vref_mv;
> > 
> 
> 

Reply via email to