> >
> > Hi Yongxin,
> >
> > >
> > > Hi experts,
> > >
> > > Could anyone tell me whether Linux nvdimm driver is RT compatible?
> > >
> > > When I was testing PMEM performance with fio using the following command,
> > > I
> > > got calltrace below.
> > >
> > > # fio -filename=/dev/pmem0s -direct=1 -iodepth 1 -thread -rw=randrw
> > > -rwmixread=70 -ioengine=psync -bs=16k -size=1G -numjobs=30 -runtime=100
> > > -group_reporting -name=mytest
> > >
> > >     BUG: scheduling while atomic: fio/2514/0x00000002
> > >     Modules linked in: intel_rapl nd_pmem nd_btt skx_edac iTCO_wdt
> > >     iTCO_vendor_support intel_powerclamp coretemp
> > >     crct10dif_pclmul crct10dif_common aesni_intel aes_x86_64 crypto_simd
> > >     cryptd i40e nvme glue_helper nvme_core lpc_ich i2c_i801 nfit
> > >     pcc_cpufreq wmi libnvdimm acpi_pad
> > >     acpi_power_meter
> > >     Preemption disabled at:
> > >     [<ffffffffc03608d9>] nd_region_acquire_lane+0x19/0x80 [libnvdimm]
> > >     CPU: 44 PID: 2514 Comm: fio Tainted: G        W
> > >     4.18.20-rt8-preempt-rt #1
> > >     Call Trace:
> > >      dump_stack+0x4f/0x6a
> > >      ? nd_region_acquire_lane+0x19/0x80 [libnvdimm]
> > >      __schedule_bug.cold.17+0x38/0x55
> > >      __schedule+0x484/0x6c0
> > >      ? _raw_spin_lock+0x17/0x40
> > >      schedule+0x3d/0xe0
> > >      rt_spin_lock_slowlock_locked+0x118/0x2a0
> > >      rt_spin_lock_slowlock+0x57/0x90
> > >      rt_spin_lock+0x52/0x60
> > >      btt_write_pg.isra.16+0x280/0x4b0 [nd_btt]
> > >      btt_make_request+0x1b1/0x320 [nd_btt]
> > >      generic_make_request+0x1dc/0x3f0
> > >      submit_bio+0x49/0x140
> > >
> > > nd_region_acquire_lane() disables preemption with get_cpu() which causes
> > > "scheduling while atomic" spews on RT.
> > >
> > > Is it safe to replace get_cpu()/put_cpu() with
> > > get_cpu_light()/put_cpu_light() in
> > > nd_region_acquire_lane()/nd_region_release_lane()?
> >
> > get_cpu disables preemption on a core and get_cpu_light will just avoid
> > task migration across the cores. If we change this to get/put_cpu_light
> > I think there is possibility of a race here.
> >
> > > After this replacement, the codes protected by
> > > nd_region_release_lane/nd_region_release_lane would become pre-emptible.
> > > So are these codes reentrant?
> >
> > Another thing I see is when get_cpu is called in "nd_region_acquire_lane"
> > there is no corresponding call for "put_cpu" in same function. This is
> > called only in "nd_region_release_lane". It means for the duration of
> > these calls preemption will be disabled.
> >
> > Here, we cannot change preemption disable to migration disable for
> > mainline.
> > As there is code which can result in race.
> >
> > For RT only patchset as well, IIUC will have same problem as there is
> > spin_lock
> > protection on a condition and path before/without spinlock may also race.
> > We can have something like local_lock_* =>  local_lock_cpu
> >
> > This are my two cents.
> 
> The btt implementation assumes that the btt code will not be
> re-entered on the same cpu. So the pre-emption disable is required for
> performance, but for correctness the implementation can fall back to
> using a lock along with migration being disabled. In other words I
> think changing nd_region_acquire_lane() like the following would be
> sufficient for RT kernels:

o.k
Not sure if its good idea to have different implementation for both mainline
and RT Kernel. It at all its possible to have single implementation. But I 
don't understand all of optimizations in original code so this might be okay.  

> 
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index e2818f94f292..e720fc42bbc0 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -947,18 +947,15 @@ int nd_blk_region_init(struct nd_region *nd_region)
>  unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
>  {
>         unsigned int cpu, lane;
> +       struct nd_percpu_lane *ndl_lock, *ndl_count;
> 
> -       cpu = get_cpu();
> -       if (nd_region->num_lanes < nr_cpu_ids) {
> -               struct nd_percpu_lane *ndl_lock, *ndl_count;
> +       cpu = get_cpu_light();
> 
> -               lane = cpu % nd_region->num_lanes;
> -               ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> -               ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> -               if (ndl_count->count++ == 0)
> -                       spin_lock(&ndl_lock->lock);
> -       } else
> -               lane = cpu;
> +       lane = cpu % nd_region->num_lanes;
> +       ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> +       ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> +       if (ndl_count->count++ == 0)
> +               spin_lock(&ndl_lock->lock);

I could not understand why spinlock here with condition?.
Are two tasks sharing same lane in same cpu and we are increasing
the count with first consumer. Other task won't wait/spin as count is
already elevated?

Or I am misunderstanding it.

Thanks,
Pankaj


> 
>         return lane;
>  }
> @@ -966,17 +963,14 @@ EXPORT_SYMBOL(nd_region_acquire_lane);
> 
>  void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
>  {
> -       if (nd_region->num_lanes < nr_cpu_ids) {
> -               unsigned int cpu = get_cpu();
> -               struct nd_percpu_lane *ndl_lock, *ndl_count;
> -
> -               ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> -               ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> -               if (--ndl_count->count == 0)
> -                       spin_unlock(&ndl_lock->lock);
> -               put_cpu();
> -       }
> -       put_cpu();
> +       unsigned int cpu = get_cpu();

Is this get_cpu_light here?

> +       struct nd_percpu_lane *ndl_lock, *ndl_count;
> +
> +       ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> +       ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> +       if (--ndl_count->count == 0)
> +               spin_unlock(&ndl_lock->lock);
> +       put_cpu_light();
>  }
>  EXPORT_SYMBOL(nd_region_release_lane);
> 

Thanks,
Pankaj
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

Reply via email to