> >
> > Hi Yongxin,
> >
> > >
> > > Hi experts,
> > >
> > > Could anyone tell me whether Linux nvdimm driver is RT compatible?
> > >
> > > When I was testing PMEM performance with fio using the following command,
> > > I
> > > got calltrace below.
> > >
> > > # fio -filename=/dev/pmem0s -direct=1 -iodepth 1 -thread -rw=randrw
> > > -rwmixread=70 -ioengine=psync -bs=16k -size=1G -numjobs=30 -runtime=100
> > > -group_reporting -name=mytest
> > >
> > > BUG: scheduling while atomic: fio/2514/0x00000002
> > > Modules linked in: intel_rapl nd_pmem nd_btt skx_edac iTCO_wdt
> > > iTCO_vendor_support intel_powerclamp coretemp
> > > crct10dif_pclmul crct10dif_common aesni_intel aes_x86_64 crypto_simd
> > > cryptd i40e nvme glue_helper nvme_core lpc_ich i2c_i801 nfit
> > > pcc_cpufreq wmi libnvdimm acpi_pad
> > > acpi_power_meter
> > > Preemption disabled at:
> > > [<ffffffffc03608d9>] nd_region_acquire_lane+0x19/0x80 [libnvdimm]
> > > CPU: 44 PID: 2514 Comm: fio Tainted: G W
> > > 4.18.20-rt8-preempt-rt #1
> > > Call Trace:
> > > dump_stack+0x4f/0x6a
> > > ? nd_region_acquire_lane+0x19/0x80 [libnvdimm]
> > > __schedule_bug.cold.17+0x38/0x55
> > > __schedule+0x484/0x6c0
> > > ? _raw_spin_lock+0x17/0x40
> > > schedule+0x3d/0xe0
> > > rt_spin_lock_slowlock_locked+0x118/0x2a0
> > > rt_spin_lock_slowlock+0x57/0x90
> > > rt_spin_lock+0x52/0x60
> > > btt_write_pg.isra.16+0x280/0x4b0 [nd_btt]
> > > btt_make_request+0x1b1/0x320 [nd_btt]
> > > generic_make_request+0x1dc/0x3f0
> > > submit_bio+0x49/0x140
> > >
> > > nd_region_acquire_lane() disables preemption with get_cpu() which causes
> > > "scheduling while atomic" spews on RT.
> > >
> > > Is it safe to replace get_cpu()/put_cpu() with
> > > get_cpu_light()/put_cpu_light() in
> > > nd_region_acquire_lane()/nd_region_release_lane()?
> >
> > get_cpu disables preemption on a core and get_cpu_light will just avoid
> > task migration across the cores. If we change this to get/put_cpu_light
> > I think there is possibility of a race here.
> >
> > > After this replacement, the codes protected by
> > > nd_region_release_lane/nd_region_release_lane would become pre-emptible.
> > > So are these codes reentrant?
> >
> > Another thing I see is when get_cpu is called in "nd_region_acquire_lane"
> > there is no corresponding call for "put_cpu" in same function. This is
> > called only in "nd_region_release_lane". It means for the duration of
> > these calls preemption will be disabled.
> >
> > Here, we cannot change preemption disable to migration disable for
> > mainline.
> > As there is code which can result in race.
> >
> > For RT only patchset as well, IIUC will have same problem as there is
> > spin_lock
> > protection on a condition and path before/without spinlock may also race.
> > We can have something like local_lock_* => local_lock_cpu
> >
> > This are my two cents.
>
> The btt implementation assumes that the btt code will not be
> re-entered on the same cpu. So the pre-emption disable is required for
> performance, but for correctness the implementation can fall back to
> using a lock along with migration being disabled. In other words I
> think changing nd_region_acquire_lane() like the following would be
> sufficient for RT kernels:
o.k
Not sure if its good idea to have different implementation for both mainline
and RT Kernel. It at all its possible to have single implementation. But I
don't understand all of optimizations in original code so this might be okay.
>
> diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
> index e2818f94f292..e720fc42bbc0 100644
> --- a/drivers/nvdimm/region_devs.c
> +++ b/drivers/nvdimm/region_devs.c
> @@ -947,18 +947,15 @@ int nd_blk_region_init(struct nd_region *nd_region)
> unsigned int nd_region_acquire_lane(struct nd_region *nd_region)
> {
> unsigned int cpu, lane;
> + struct nd_percpu_lane *ndl_lock, *ndl_count;
>
> - cpu = get_cpu();
> - if (nd_region->num_lanes < nr_cpu_ids) {
> - struct nd_percpu_lane *ndl_lock, *ndl_count;
> + cpu = get_cpu_light();
>
> - lane = cpu % nd_region->num_lanes;
> - ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> - ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> - if (ndl_count->count++ == 0)
> - spin_lock(&ndl_lock->lock);
> - } else
> - lane = cpu;
> + lane = cpu % nd_region->num_lanes;
> + ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> + ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> + if (ndl_count->count++ == 0)
> + spin_lock(&ndl_lock->lock);
I could not understand why spinlock here with condition?.
Are two tasks sharing same lane in same cpu and we are increasing
the count with first consumer. Other task won't wait/spin as count is
already elevated?
Or I am misunderstanding it.
Thanks,
Pankaj
>
> return lane;
> }
> @@ -966,17 +963,14 @@ EXPORT_SYMBOL(nd_region_acquire_lane);
>
> void nd_region_release_lane(struct nd_region *nd_region, unsigned int lane)
> {
> - if (nd_region->num_lanes < nr_cpu_ids) {
> - unsigned int cpu = get_cpu();
> - struct nd_percpu_lane *ndl_lock, *ndl_count;
> -
> - ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> - ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> - if (--ndl_count->count == 0)
> - spin_unlock(&ndl_lock->lock);
> - put_cpu();
> - }
> - put_cpu();
> + unsigned int cpu = get_cpu();
Is this get_cpu_light here?
> + struct nd_percpu_lane *ndl_lock, *ndl_count;
> +
> + ndl_count = per_cpu_ptr(nd_region->lane, cpu);
> + ndl_lock = per_cpu_ptr(nd_region->lane, lane);
> + if (--ndl_count->count == 0)
> + spin_unlock(&ndl_lock->lock);
> + put_cpu_light();
> }
> EXPORT_SYMBOL(nd_region_release_lane);
>
Thanks,
Pankaj
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm