- IIRC on disk file size may differ if you use raidz with not optimal block sizes
- do you run zdb on exported pool? Zdb may show "problems" on imported pools because they are online and change every second.
- feel free to discuss it in zfsonlinux mailing list

24.01.2020, 23:12, "Nick Skingle" <[email protected]>:

Hi All,

 

We are seeing an anomaly across all of our RaidInc Lustre filesystems

 

Problem description:

File Size < on disk size - currently unexplained, size on disk is 2-3 x file size.

 

Observations:

  1. A potential ZFS filesystem corruption across RaidInc Storage in London?
  2. zdb check for leaks, it walks the entire block tree constructing the space maps in memory and then compares them to the ones stored on disk. If they differ it reports the leak.
    1. Presuming from the below investigation that the “space leaks” mean the pool is corrupted somehow. zdb (ZFS debug) has detected tons of corruptions.
  3. zdb did not report space leaks on ZFS Houston SI’s.  
  4. Does zdb leaked space means trouble with the pool and could explain the file size < disk size discrepancy? 
  5. Is it possible that errors got injected due to failover or hardware errors?
  6. It seems to be at least inconsistent which is supposed to never happen with ZFS. Is this indicative of a larger problem? Numerous lockups, etc.?

 

Investigation:

For the troubleshooting, the following file located in WEY, was selected. There are no snapshots/reservations/quotas involved here.

 

lconnect03]</users/jerome.cousin>$ du -h --apparent-size /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/*

33K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/aux_data

19K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/descriptor.yaml

104G   /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin

14G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_header.bin

 

[lconnect03]</users/jerome.cousin>$ du -h /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/*

33K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/aux_data

56K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/descriptor.yaml

237G   /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin

31G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_header.bin

 

  1. Copy of the dataset onto the same storage.
    • Disk size is different.
    • Checksum matches.

 

[lconnect03]</users/jerome.cousin>$ cp -rp /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC

 

[lconnect03]</users/jerome.cousin>$ md5sum  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/*

md5sum: /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/aux_data: Is a directory

f861b60d2b1b844e5ae252345aa20497  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/descriptor.yaml

e8ac57c241e52b38b60907e4e767b451  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin

0826bc74e525697d769248aabcb195cd  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_header.bin

 

[lconnect03]</users/jerome.cousin>$  md5sum  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/*

md5sum: /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/aux_data: Is a directory

f861b60d2b1b844e5ae252345aa20497  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/descriptor.yaml

e8ac57c241e52b38b60907e4e767b451  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_data.bin

0826bc74e525697d769248aabcb195cd  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_header.bin

 

[lconnect03]</users/jerome.cousin>$ du -h  /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/*

33K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/aux_data

56K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/descriptor.yaml

99G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_data.bin

13G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_header.bin

 

[lconnect03]</users/jerome.cousin>$ du -h --apparent-size /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/*

33K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/aux_data

19K    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/descriptor.yaml

104G   /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_data.bin

14G    /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_JC/trace_header.bin

 

 

  1. Printing the OST name hosting the given file.

[lconnect01]</users/jerome.cousin>$ ./lustre-find-ost-for-file /lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin

15

/lus/lsi022/4388cog/p005j02_2010_SRME_1238A018_2copy/trace_data.bin: ['lsi022-OST000f'] (lsi022-oss6.lon.compute.pgs.com)

 

  1. Run zdb to check for leaks

[root@lsi022-oss6 ~]# zfs list

NAME                          USED  AVAIL  REFER  MOUNTPOINT

lsi022-OST17                 48.3T  18.5T   219K  none

lsi022-OST17/lsi022-OST0005  48.3T  18.5T  48.3T  none

lsi022-OST19                 49.5T  17.3T   219K  none

lsi022-OST19/lsi022-OST0009  49.5T  17.3T  49.5T  none

lsi022-OST21                 47.3T  19.5T   219K  none

lsi022-OST21/lsi022-OST000f  47.3T  19.5T  47.3T  none

lsi022-OST23                 51.1T  15.7T   219K  none

lsi022-OST23/lsi022-OST0013  51.1T  15.7T  51.1T  none

 

[root@lsi022-oss6 ~]# zdb -b lsi022-OST21

Traversing all blocks to verify nothing leaked ...

 

loading space map for vdev 0 of 1, metaslab 180 of 181 ...

62.0T completed (12801MB/s) estimated time remaining: 0hr 00min 07sec       

leaked space: vdev 0, offset 0x1d80003de000, size 1081344

[…]

See attachment.

 

Please would someone be able to advise.

 

Thanks

Nick

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.o…



____________________________________
Sincerely,
George Melikov

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to