On 1/21/2022 5:27 PM, Luck, Tony wrote: > On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote: >> On 1/21/2022 4:31 PM, Jane Chu wrote: >>> On baremetal Intel platform with DCPMEM installed and configured to >>> provision daxfs, say a poison was consumed by a load from a user thread, >>> and then daxfs takes action and clears the poison, confirmed by "ndctl >>> -NM". >>> >>> Now, depends on the luck, after sometime(from a few seconds to 5+ hours) >>> the ghost of the previous poison will surface, and it takes >>> unload/reload the libnvdimm drivers in order to drive the phantom poison >>> away, confirmed by ARS. >>> >>> Turns out, the issue is quite reproducible with the latest stable Linux. >>> >>> Here is the relevant console message after injected 8 poisons in one >>> page via >>> # ndctl inject-error namespace0.0 -n 2 -B 8210 >> >> There is a cut-n-paste error, the above line should be >> "# ndctl inject-error namespace0.0 -n 8 -B 8210" > > You say "in one page" here. What is the page size?
The page size is 4K, the size of base page on x86. I said "one page", as 8 (poisons) * 256B = 2KiB, only half page. >> >> -jane >> >>> then, cleared them all, and wait for 5+ hours, notice the time stamp. >>> BTW, the system is idle otherwise. >>> >>> [ 2439.742296] mce: Uncorrected hardware memory error in user-access at >>> 1850602400 >>> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to >>> fsdax_poison_v1:8457 due to hardware memory corruption >>> [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page: >>> Recovered >>> [ 2439.769949] mce: [Hardware Error]: Machine check events logged >>> -1850603000 uncached-minus<->write-back >>> [ 2439.769984] x86/PAT: memtype_reserve failed [mem >>> 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus >>> [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map >>> [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype >>> [mem 0x1850602000-0x1850602fff] > > This error is reported in PFN=1850602 (at offset 0x400 = 1K) yes. > >>> >>> At this point, >>> # ndctl list -NMu -r 0 >>> { >>> "dev":"namespace0.0", >>> "mode":"fsdax", >>> "map":"dev", >>> "size":"15.75 GiB (16.91 GB)", >>> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb", >>> "sector_size":4096, >>> "align":2097152, >>> "blockdev":"pmem0" >>> } >>> >>> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic >>> Hardware Error Source: 1 >>> [21352.001528] {2}[Hardware Error]: event severity: recoverable >>> [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable >>> [21352.014156] {2}[Hardware Error]: section_type: memory error >>> [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200 > > This error is in the following page: PFN=1850603 (at offset 0x200 = 512b) > I see, this is the next page... the issue is reproducible with a single poison injection. > Is that what you mean by "phantom error" ... from a different > address from those that were injected? All 8 poisons were cleared by the driver via DSM, and verified by "ndctl -NMu -r 0", that covers every page in the 16GiB /dev/pmem. It's phantom because unload->reload libnvdimm, followed by a full ARS scan confirms the poison isn't there, hence phantom. thanks, -jane > > -Tony