On Mon, Feb 08, 2021 at 11:11:47PM +0100, Goffredo Baroncelli wrote: > On 2/7/21 11:06 PM, Chris Murphy wrote: > > systemd-journald journals on Btrfs default to nodatacow, upon log > > rotation it's submitted for defragmenting with BTRFS_IOC_DEFRAG. The > > result looks curious. I can't tell what the logic is from the results. > > > > The journal file starts out being fallocated with a size of 8MB, and > > as it grows there is an append of 8MB increments, also fallocated. > > This leads to a filefrag -v that looks like this (ext4 and btrfs > > nodatacow follow the same behavior, both are provided for reference): > > > > ext4 > > https://pastebin.com/6vuufwXt > > > > btrfs > > https://pastebin.com/Y18B2m4h > > > > Following defragment with BTRFS_IOC_DEFRAG it looks like this: > > https://pastebin.com/1ufErVMs > > > > It appears at first glance to be significantly more fragmented. Closer > > inspection shows that most of the extents weren't relocated. But > > what's up with the peculiar interleaving? Is this an improvement over > > the original allocation? > > I am not sure how read the filefrag output: I see several lines like > [...] > 5: 1691.. 1693: 125477.. 125479: 3: > 6: 1694.. 1694: 125480.. 125480: 1: > unwritten > [...] > > What means "unwritten" ? The kernel documentation [*] says: > [...] > * FIEMAP_EXTENT_UNWRITTEN > Unwritten extent - the extent is allocated but its data has not been > initialized. This indicates the extent's data will be all zero if read > through the filesystem but the contents are undefined if read directly from > the device. > [..] > So it seems that the data didn't touch the platters (!) > > My educate guess is that there is something strange in the sequence: > - write > - sync > - close log > - move log > - defrag log > > May be the defrag starts before all the data reach the platters ?
defrag will put the file's contents back into delalloc, and it won't be allocated until a flush (fsync, sync, or commit interval). Defrag is roughly equivalent to simply copying the data to a new file in btrfs, except the logical extents are atomically updated to point to the new location. FIEMAP has an option flag to sync the data before returning a map. DEFRAG has an option to start IO immediately so it will presumably be done by the time you look at the extents with FIEMAP. > For what matters, I create a file with the same fragmentation like your one > > $ sudo filefrag -v data.txt > Filesystem type is: 9123683e > File size of data.txt is 25165824 (6144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 0: 1597171.. 1597171: 1: > 1: 1.. 1599: 163433285.. 163434883: 1599: 1597172: > 2: 1600.. 1607: 1601255.. 1601262: 8: 163434884: > 3: 1608.. 1689: 1604137.. 1604218: 82: 1601263: > 4: 1690.. 1690: 1597484.. 1597484: 1: 1604219: > 5: 1691.. 1693: 1597465.. 1597467: 3: 1597485: > 6: 1694.. 1694: 1597966.. 1597966: 1: 1597468: > 7: 1695.. 1722: 1599557.. 1599584: 28: 1597967: > 8: 1723.. 1723: 1599211.. 1599211: 1: 1599585: > 9: 1724.. 1955: 1648394.. 1648625: 232: 1599212: > 10: 1956.. 1956: 1599695.. 1599695: 1: 1648626: > 11: 1957.. 2047: 1625881.. 1625971: 91: 1599696: > 12: 2048.. 2417: 1648804.. 1649173: 370: 1625972: > 13: 2418.. 2420: 1597468.. 1597470: 3: 1649174: > 14: 2421.. 2478: 1624667.. 1624724: 58: 1597471: > 15: 2479.. 2479: 1596416.. 1596416: 1: 1624725: > 16: 2480.. 2482: 1601045.. 1601047: 3: 1596417: > 17: 2483.. 2483: 1596854.. 1596854: 1: 1601048: > 18: 2484.. 2523: 1602715.. 1602754: 40: 1596855: > 19: 2524.. 2527: 1597471.. 1597474: 4: 1602755: > 20: 2528.. 2598: 1624725.. 1624795: 71: 1597475: > 21: 2599.. 2599: 1596858.. 1596858: 1: 1624796: > 22: 2600.. 2607: 1601263.. 1601270: 8: 1596859: > 23: 2608.. 2608: 1596863.. 1596863: 1: 1601271: > 24: 2609.. 2611: 1601271.. 1601273: 3: 1596864: > 25: 2612.. 2612: 1596864.. 1596864: 1: 1601274: > 26: 2613.. 2615: 1601274.. 1601276: 3: 1596865: > 27: 2616.. 2616: 1596981.. 1596981: 1: 1601277: > 28: 2617.. 2691: 1649174.. 1649248: 75: 1596982: > 29: 2692.. 2696: 1597475.. 1597479: 5: 1649249: > 30: 2697.. 2756: 1634995.. 1635054: 60: 1597480: > 31: 2757.. 2758: 1597480.. 1597481: 2: 1635055: > 32: 2759.. 2762: 1601351.. 1601354: 4: 1597482: > 33: 2763.. 2764: 1597482.. 1597483: 2: 1601355: > 34: 2765.. 2837: 1649249.. 1649321: 73: 1597484: > 35: 2838.. 2838: 1597038.. 1597038: 1: 1649322: > 36: 2839.. 2855: 1601538.. 1601554: 17: 1597039: > 37: 2856.. 2856: 1597045.. 1597045: 1: 1601555: > 38: 2857.. 2904: 1624547.. 1624594: 48: 1597046: > 39: 2905.. 2926: 1600795.. 1600816: 22: 1624595: > 40: 2927.. 2942: 1602034.. 1602049: 16: 1600817: > 41: 2943.. 2963: 1600817.. 1600837: 21: 1602050: > 42: 2964.. 2979: 1602183.. 1602198: 16: 1600838: > 43: 2980.. 3001: 1600927.. 1600948: 22: 1602199: > 44: 3002.. 3043: 1621164.. 1621205: 42: 1600949: > 45: 3044.. 3053: 1599231.. 1599240: 10: 1621206: > 46: 3054.. 3066: 1601952.. 1601964: 13: 1599241: > 47: 3067.. 3067: 1597056.. 1597056: 1: 1601965: > 48: 3068.. 3084: 1602375.. 1602391: 17: 1597057: > 49: 3085.. 3094: 1599290.. 1599299: 10: 1602392: > 50: 3095.. 3096: 1601355.. 1601356: 2: 1599300: > 51: 3097.. 3107: 1600717.. 1600727: 11: 1601357: > 52: 3108.. 3156: 1642892.. 1642940: 49: 1600728: > 53: 3157.. 3157: 1597059.. 1597059: 1: 1642941: > 54: 3158.. 3251: 1649322.. 1649415: 94: 1597060: > 55: 3252.. 3254: 1599241.. 1599243: 3: 1649416: > 56: 3255.. 3304: 1645466.. 1645515: 50: 1599244: > 57: 3305.. 3305: 1597100.. 1597100: 1: 1645516: > 58: 3306.. 3312: 1601357.. 1601363: 7: 1597101: > 59: 3313.. 3319: 1599300.. 1599306: 7: 1601364: > 60: 3320.. 3331: 1601611.. 1601622: 12: 1599307: > 61: 3332.. 3339: 1600838.. 1600845: 8: 1601623: > 62: 3340.. 3343: 1601419.. 1601422: 4: 1600846: > 63: 3344.. 3351: 1600846.. 1600853: 8: 1601423: > 64: 3352.. 3432: 1649416.. 1649496: 81: 1600854: > 65: 3433.. 3433: 1597109.. 1597109: 1: 1649497: > 66: 3434.. 3489: 1649497.. 1649552: 56: 1597110: > 67: 3490.. 3491: 1599227.. 1599228: 2: 1649553: > 68: 3492.. 3521: 1619348.. 1619377: 30: 1599229: > 69: 3522.. 3523: 1599307.. 1599308: 2: 1619378: > 70: 3524.. 3530: 1601688.. 1601694: 7: 1599309: > 71: 3531.. 3539: 1600949.. 1600957: 9: 1601695: > 72: 3540.. 3579: 1629356.. 1629395: 40: 1600958: > 73: 3580.. 3580: 1597124.. 1597124: 1: 1629396: > 74: 3581.. 3601: 1604219.. 1604239: 21: 1597125: > 75: 3602.. 3603: 1599585.. 1599586: 2: 1604240: > 76: 3604.. 3614: 1602636.. 1602646: 11: 1599587: > 77: 3615.. 3616: 1599587.. 1599588: 2: 1602647: > 78: 3617.. 3677: 1649553.. 1649613: 61: 1599589: > 79: 3678.. 3680: 1599692.. 1599694: 3: 1649614: > 80: 3681.. 3723: 1647818.. 1647860: 43: 1599695: > 81: 3724.. 3726: 1599821.. 1599823: 3: 1647861: > 82: 3727.. 3756: 1622218.. 1622247: 30: 1599824: > 83: 3757.. 3759: 1600630.. 1600632: 3: 1622248: > 84: 3760.. 3766: 1603288.. 1603294: 7: 1600633: > 85: 3767.. 3768: 1600633.. 1600634: 2: 1603295: > 86: 3769.. 3950: 76053306.. 76053487: 182: 1600635: > 87: 3951.. 3958: 1600958.. 1600965: 8: 76053488: > 88: 3959.. 3986: 1619921.. 1619948: 28: 1600966: > 89: 3987.. 3995: 1600966.. 1600974: 9: 1619949: > 90: 3996.. 4036: 1649614.. 1649654: 41: 1600975: > 91: 4037.. 4045: 1600975.. 1600983: 9: 1649655: > 92: 4046.. 4050: 1601423.. 1601427: 5: 1600984: > 93: 4051.. 4052: 1600854.. 1600855: 2: 1601428: > 94: 4053.. 4055: 1601555.. 1601557: 3: 1600856: > 95: 4056.. 4056: 1597129.. 1597129: 1: 1601558: > 96: 4057.. 4059: 1601745.. 1601747: 3: 1597130: > 97: 4060.. 4060: 1597134.. 1597134: 1: 1601748: > 98: 4061.. 4063: 1602050.. 1602052: 3: 1597135: > 99: 4064.. 4064: 1597137.. 1597137: 1: 1602053: > 100: 4065.. 4079: 1604297.. 1604311: 15: 1597138: > 101: 4080.. 4088: 1600987.. 1600995: 9: 1604312: > 102: 4089.. 4095: 1603295.. 1603301: 7: 1600996: > 103: 4096.. 4106: 1600996.. 1601006: 11: 1603302: > 104: 4107.. 4117: 1622600.. 1622610: 11: 1601007: > 105: 4118.. 4119: 1601007.. 1601008: 2: 1622611: > 106: 4120.. 4129: 1622611.. 1622620: 10: 1601009: > 107: 4130.. 4131: 1601009.. 1601010: 2: 1622621: > 108: 4132.. 4141: 1622621.. 1622630: 10: 1601011: > 109: 4142.. 4145: 1601011.. 1601014: 4: 1622631: > 110: 4146.. 4155: 1622986.. 1622995: 10: 1601015: > 111: 4156.. 4157: 1601015.. 1601016: 2: 1622996: > 112: 4158.. 4168: 1622996.. 1623006: 11: 1601017: > 113: 4169.. 4170: 1601017.. 1601018: 2: 1623007: > 114: 4171.. 4180: 1623007.. 1623016: 10: 1601019: > 115: 4181.. 4182: 1601019.. 1601020: 2: 1623017: > 116: 4183.. 4192: 1624473.. 1624482: 10: 1601021: > 117: 4193.. 4195: 1601021.. 1601023: 3: 1624483: > 118: 4196.. 4205: 1624796.. 1624805: 10: 1601024: > 119: 4206.. 4207: 1601024.. 1601025: 2: 1624806: > 120: 4208.. 4217: 1624806.. 1624815: 10: 1601026: > 121: 4218.. 4220: 1601026.. 1601028: 3: 1624816: > 122: 4221.. 4230: 1625972.. 1625981: 10: 1601029: > 123: 4231.. 4408: 1648626.. 1648803: 178: 1625982: > 124: 4409.. 4411: 1602199.. 1602201: 3: 1648804: > 125: 4412.. 4434: 1601328.. 1601350: 23: 1602202: > 126: 4435.. 4437: 1602647.. 1602649: 3: 1601351: > 127: 4438.. 4439: 1601029.. 1601030: 2: 1602650: > 128: 4440.. 4442: 1602755.. 1602757: 3: 1601031: > 129: 4443.. 4480: 1601650.. 1601687: 38: 1602758: > 130: 4481.. 4491: 1629530.. 1629540: 11: 1601688: > 131: 4492.. 4560: 1624404.. 1624472: 69: 1629541: > 132: 4561.. 4571: 1629541.. 1629551: 11: 1624473: > 133: 4572.. 4582: 1601031.. 1601041: 11: 1629552: > 134: 4583.. 4586: 1603302.. 1603305: 4: 1601042: > 135: 4587.. 4620: 1602537.. 1602570: 34: 1603306: > 136: 4621.. 4631: 1629716.. 1629726: 11: 1602571: > 137: 4632.. 4634: 1601042.. 1601044: 3: 1629727: > 138: 4635.. 6143: 156004864.. 156006372: 1509: 1601045: last,eof > data.txt: 139 extents found > > the I tried to defrag it > > $ btrfs fi defra data.txt > $ sudo filefrag -v data.txt > Filesystem type is: 9123683e > File size of data.txt is 25165824 (6144 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 6143: 164002967.. 164009110: 6144: last,eof > data.txt: 1 extent found > > So it seems that the defrag works Be very careful how you set up this test case. If you use fallocate on a file, it has a _permanent_ effect on the inode, and alters a lot of normal btrfs behavior downstream. You won't see these effects if you just write some data to a file without using prealloc. > [*] https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt > > > > https://pastebin.com/1ufErVMs > > > > If I unwind the interleaving, it looks like all the extents fall into > > two localities and within each locality the extents aren't that far > > apart - so my guess is that this file is also not meaningfully > > fragmented, in practice. Surely the drive firmware will reorder the > > reads to arrive at the least amount of seeks? > > > > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5