> >
> > > >
> > > > Hi Yongxin,
> > > >
> > > > >
> > > > > Hi experts,
> > > > >
> > > > > Could anyone tell me whether Linux nvdimm driver is RT compatible?
> > > > >
> > > > > When I was testing PMEM performance with fio using the following
> > > > > command,
> > > > > I
> > > > > got calltrace below.
> > > > >
> > > > > # fio -filename=/dev/pmem0s -direct=1 -iodepth 1 -thread -rw=randrw
> > > > > -rwmixread=70 -ioengine=psync -bs=16k -size=1G -numjobs=30
> > > > > -runtime=100
> > > > > -group_reporting -name=mytest
> > > > >
> > > > >     BUG: scheduling while atomic: fio/2514/0x00000002
> > > > >     Modules linked in: intel_rapl nd_pmem nd_btt skx_edac iTCO_wdt
> > > > >     iTCO_vendor_support intel_powerclamp coretemp
> > > > >     crct10dif_pclmul crct10dif_common aesni_intel aes_x86_64
> > > > >     crypto_simd
> > > > >     cryptd i40e nvme glue_helper nvme_core lpc_ich i2c_i801 nfit
> > > > >     pcc_cpufreq wmi libnvdimm acpi_pad
> > > > >     acpi_power_meter
> > > > >     Preemption disabled at:
> > > > >     [<ffffffffc03608d9>] nd_region_acquire_lane+0x19/0x80 [libnvdimm]
> > > > >     CPU: 44 PID: 2514 Comm: fio Tainted: G        W
> > > > >     4.18.20-rt8-preempt-rt #1
> > > > >     Call Trace:
> > > > >      dump_stack+0x4f/0x6a
> > > > >      ? nd_region_acquire_lane+0x19/0x80 [libnvdimm]
> > > > >      __schedule_bug.cold.17+0x38/0x55
> > > > >      __schedule+0x484/0x6c0
> > > > >      ? _raw_spin_lock+0x17/0x40
> > > > >      schedule+0x3d/0xe0
> > > > >      rt_spin_lock_slowlock_locked+0x118/0x2a0
> > > > >      rt_spin_lock_slowlock+0x57/0x90
> > > > >      rt_spin_lock+0x52/0x60
> > > > >      btt_write_pg.isra.16+0x280/0x4b0 [nd_btt]
> > > > >      btt_make_request+0x1b1/0x320 [nd_btt]
> > > > >      generic_make_request+0x1dc/0x3f0
> > > > >      submit_bio+0x49/0x140
> > > > >
> > > > > nd_region_acquire_lane() disables preemption with get_cpu() which
> > > > > causes
> > > > > "scheduling while atomic" spews on RT.
> > > > >
> > > > > Is it safe to replace get_cpu()/put_cpu() with
> > > > > get_cpu_light()/put_cpu_light() in
> > > > > nd_region_acquire_lane()/nd_region_release_lane()?
> > > >
> > > > get_cpu disables preemption on a core and get_cpu_light will just avoid
> > > > task migration across the cores. If we change this to get/put_cpu_light
> > > > I think there is possibility of a race here.
> > > >
> > > > > After this replacement, the codes protected by
> > > > > nd_region_release_lane/nd_region_release_lane would become
> > > > > pre-emptible.
> > > > > So are these codes reentrant?
> > > >
> > > > Another thing I see is when get_cpu is called in
> > > > "nd_region_acquire_lane"
> > > > there is no corresponding call for "put_cpu" in same function. This is
> > > > called only in "nd_region_release_lane". It means for the duration of
> > > > these calls preemption will be disabled.
> > > >
> > > > Here, we cannot change preemption disable to migration disable for
> > > > mainline.
> > > > As there is code which can result in race.
> > > >
> > > > For RT only patchset as well, IIUC will have same problem as there is
> > > > spin_lock
> > > > protection on a condition and path before/without spinlock may also
> > > > race.
> > > > We can have something like local_lock_* =>  local_lock_cpu
> > > >
> > > > This are my two cents.
> > >
> > > The btt implementation assumes that the btt code will not be
> > > re-entered on the same cpu. So the pre-emption disable is required for
> > > performance, but for correctness the implementation can fall back to
> > > using a lock along with migration being disabled. In other words I
> > > think changing nd_region_acquire_lane() like the following would be
> > > sufficient for RT kernels:
> >
> > o.k
> > Not sure if its good idea to have different implementation for both
> > mainline
> > and RT Kernel. It at all its possible to have single implementation. But I
> > don't understand all of optimizations in original code so this might be
> > okay.
> >
> 
> I agree, it was just clarifying the implementation, but I would
> otherwise not recommend shipping different driver implementations.
> Unless / until get_cpu_light() is available in mainline I think this
> patch should be considered reference only.
> 
> > >
> > > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> > > index e2818f94f292..e720fc42bbc0 100644
> > > --- a/drivers/nvdimm/region_devs.c
> > > +++ b/drivers/nvdimm/region_devs.c
> > > @@ -947,18 +947,15 @@ int nd_blk_region_init(struct nd_region *nd_region)
> > >  unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
> > >  {
> > >         unsigned int cpu, lane;
> > > +       struct nd_percpu_lane *ndl_lock, *ndl_count;
> > >
> > > -       cpu = get_cpu();
> > > -       if (nd_region->num_lanes < nr_cpu_ids) {
> > > -               struct nd_percpu_lane *ndl_lock, *ndl_count;
> > > +       cpu = get_cpu_light();
> > >
> > > -               lane = cpu % nd_region->num_lanes;
> > > -               ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> > > -               ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> > > -               if (ndl_count->count++ == 0)
> > > -                       spin_lock(&ndl_lock->lock);
> > > -       } else
> > > -               lane = cpu;
> > > +       lane = cpu % nd_region->num_lanes;
> > > +       ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> > > +       ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> > > +       if (ndl_count->count++ == 0)
> > > +               spin_lock(&ndl_lock->lock);
> >
> > I could not understand why spinlock here with condition?.
> > Are two tasks sharing same lane in same cpu and we are increasing
> > the count with first consumer. Other task won't wait/spin as count is
> > already elevated?
> 
> When the number of cpus exceeds the number of lanes, or if RT makes
> this routine re-entrant on the same cpu, then the lock is needed to
> ensure single user of the lane. It is allowed for a cpu to acquire the
> same lane twice, hence the ndl_count.

o.k. Thanks for explaining. I just have a small comment:

For non-RT as preemption is disabled on a CPU, only check is for reentrant
code. This is true because there is no chance that any other task(on same CPU)
would in any case execute the critical section.

For RT, as preemption is enabled. Unless something is explicitly done in
design some other task can simultaneously execute the critical section. If 
there is any possibility of this then we are not protected for this.

Or I am missing something here.

Thanks,
Pankaj

> 
> > Or I am misunderstanding it.
> >
> > Thanks,
> > Pankaj
> >
> >
> > >
> > >         return lane;
> > >  }
> > > @@ -966,17 +963,14 @@ EXPORT_SYMBOL(nd_region_acquire_lane);
> > >
> > >  void nd_region_release_lane(struct nd_region *nd_region, unsigned int
> > >  lane)
> > >  {
> > > -       if (nd_region->num_lanes < nr_cpu_ids) {
> > > -               unsigned int cpu = get_cpu();
> > > -               struct nd_percpu_lane *ndl_lock, *ndl_count;
> > > -
> > > -               ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> > > -               ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> > > -               if (--ndl_count->count == 0)
> > > -                       spin_unlock(&ndl_lock->lock);
> > > -               put_cpu();
> > > -       }
> > > -       put_cpu();
> > > +       unsigned int cpu = get_cpu();
> >
> > Is this get_cpu_light here?
> 
> Yes, good catch.
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Reply via email to