2017-01-09 2:09 GMT+01:00 Zygo Blaxell <ce3g8...@umail.furryterror.org>: > On Wed, Jan 04, 2017 at 07:58:55AM -0500, Austin S. Hemmelgarn wrote: >> On 2017-01-03 16:35, Peter Becker wrote: >> >As i understand the duperemove source-code right (i work on/ try to >> >improve this code since 5 or 6 weeks on multiple parts), duperemove >> >does hashing and calculation before they call extend_same. >> >Duperemove stores all in a hashfile and read this. after all files >> >hashed, and duplicates detected, the progress all in order without >> >reading new data form disk / hashfile. so the byte-by-byte comparison >> >of extend_same ioctl should consume the full possible bandwidth of the >> >disks. >> Not necessarily. You've actually got a significant amount of processing >> between each disk operation. General ordering inside the ioctl is: >> 1. Do generic ioctl setup. >> 2. Lock the extents. >> 3. Read the ranges into memory. >> 4. Compare the ranges. >> 5. If the ranges are identical, write out the changes needed to reflink >> them. >> 6. Unlock all the extents. >> 7. Do generic ioctl cleanup. >> 1 and 7 in particular are pretty heavy. Ioctls were not intended to be >> called with this kind of frequency, and that fact really shows in the setup >> and teardown (overhead is way higher than a syscall). > > Steps 1 and 7 are not heavy at all. ioctl setup is an order of magnitude > higher than other system calls, but still up to 11 orders of magnitude > faster than the other steps. The other steps are *slow*, and step 5 > is orders of magnitude slower than all the others combined. > > Most of the time in step 5 is spent deleting the dst extent refs > (or waiting for transaction commits, but everything waits for those). > It gets worse when you have big files (1G and larger), more extents, > and more extent references in the same inode. On a 100G file the overhead > of manipulating shared extent refs is so large that the rest of the > extent-same ioctl is just noise by comparison (microseconds vs minutes). > > The commit 1d57ee9 "btrfs: improve delayed refs iterations" (merged in > v4.10-rc1) helps a bit with this, but deleting shared refs is still > one of the most expensive things you can do in btrfs.
Would it not be better to bulk insert all extent refs and bulk delete the dst extend refs? I am currently dealing intensively with the btrfs implementation and the extent_same ioctl. But still in a very early stage of planing. (understand and plan) I'm not sure if this task would be possible for a newbie in btrfs source/development. I plan to introduce a new extent_same ioctl with accept an array of the parameters of the current btrfs_extent_same function (struct inode *src, u64 loff, u64 olen, struct inode *dst, u64 dst_loff) called btrfs_extent_same_v2. This would reduce the numbers of ioc calls, allows bulk insert, bulk remove, ... Additional the new version should also introduce a new options parameter, for example to bypassing the cmp if the caller realy knows what he does. All suggestions and hints for this are very welcome. >> The operation ended >> up being an ioctl instead of a syscall (or extension to another syscall) >> because: >> 1. Manipulating low-level filesystem state is part of what they're intended >> to be used for. >> 2. Introducing a new FS specific ioctl is a whole lot less controversial >> than introducing a new FS specific syscall. >> > >> >1. dbfile_load_hashes >> >2. find_all_dupes >> >3. dedupe_results >> >-> call the following in N threads: >> >>dedupe_extent_list >> >>>list_for_each_entry >> >>>>add_extent_to_dedupe #produce a simple list/queue >> >>>>dedupe_extents >> >>>>>btrfs_extent_same >> >>>>>>BTRFS_IOC_FILE_EXTENT_SAME >> > >> >So if this right, one of this thinks is realy slow: >> > >> >1. byte-per-byte comparison >> There's no way that this part can't be slow. You need to load the data into >> the registers to do the comparison, you can't just point something at RAM >> and get an answer. On x86, this in turn means that the comparison amounts >> to a loop of 2 loads followed by a compare and a branch for , repeated once >> for each range beyond the first, and that's assuming that the compiler >> optimizes it to the greatest degree possible. On some other systems the >> compare and branch are one instruction, on others the second load might be >> eliminated, but overall it's not something that can be sped up all that >> much. > > On cheap amd64 machines this can be done at gigabytes per second. Not much > gain from optimizing this. > >> >2. sets up the reflinks >> This actually is not as efficient as it sounds like it should be, adding >> reflinks means updating metadata, which means that there is some unavoidable >> overhead here. I doubt that it's where the issue is, but I may be wrong. > > Most of the time spent here is spent waiting for IO. extent-same seems to > imply fsync() with all the performance cost thereof. > >> >3. unlocks the new extent >> There's one other aspect not listed here, locking the original extents, >> which can actually add quite a lot of overhead if the files are actually >> being used. >> > >> >If i'm not wrong with my understanding of the duperemove source code, >> >this behaivor should also affected the online dedupe feature on with >> >Qu Wenruo works. >> AFAIK, that uses a different code path from the batch deduplication ioctl. >> It also doesn't have the context switches and other overhead from an ioctl >> involved, because it's done in kernel code. > > No difference there--the extent-same ioctl is all kernel code too. > >> >2017-01-03 21:40 GMT+01:00 Austin S. Hemmelgarn <ahferro...@gmail.com>: >> >>On 2017-01-03 15:20, Peter Becker wrote: >> >>> >> >>>I think i understand. The resulting keyquestion is, how i can improve >> >>>the performance of extend_same ioctl. >> >>>I tested it with following results: >> >>> >> >>>enviorment: >> >>>2 files, called "file", size each 100GB, duperemove nofiemap-options >> >>>set, 1MB extend size. >> >>> >> >>>duperemove output: >> >>>[0x1908590] (13889/72654) Try to dedupe extents with id d1c672db >> >>>[0x1908590] Add extent for file "/mnt/new/file" at offset 66.7G (6) >> >>>[0x1908590] Add extent for file "/mnt/old/file" at offset 66.9G (7) >> >>>[0x1908590] Dedupe 1 extents (id: d1c672db) with target: (66.7G, >> >>>1.0M), "/mnt/new/file" >> >>> >> >>>iotop output for a 30 sec. sample >> >>> >> >>>avg-cpu: %user %nice %system %iowait %steal %idle >> >>> 22,31 0,00 13,83 33,81 0,00 30,05 >> >>> >> >>>Device: rrqm/s wrqm/s r/s w/s rkB/s >> >>>wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util >> >>>sdd 0,00 1,70 1149,93 0,73 4600,53 >> >>>139,60 8,24 0,23 0,20 0,19 13,64 0,19 >> >>>21,84 >> >>>sde 0,00 0,00 1149,33 0,53 4597,33 >> >>>23,87 8,04 0,20 0,18 0,18 1,75 0,18 >> >>>20,47 >> >>>sdf 0,00 1,70 1149,60 0,63 4598,40 >> >>>139,60 8,24 0,21 0,18 0,18 4,63 0,18 >> >>>20,63 >> >>>sdh 0,00 0,00 1149,33 0,53 4597,33 >> >>>23,87 8,04 0,21 0,18 0,18 4,25 0,18 >> >>>20,85 >> >>> >> >>>resulting in less then 18MB/s read. realy slow. >> >>> >> >>>Querstion 1: why, so slow? >> >> >> >>For a couple of reasons. First, you have to understand that duperemove >> >>itself actually does a pretty large amount of processing outside of the >> >>call >> >>to the ioctl. It first hashes the blocks for quicker comparison and >> >>matching, then figures out which blocks match, and finally calls the ioctl >> >>on the resulting matches. The reason for this behavior is that the ioctl >> >>is >> >>insanely slow. It first locks the ranges passed in (so they don't get >> >>changed by anything else during the deduplication process), then does a >> >>byte-by-byte comparison to make sure they all actually do match (data >> >>safety, I've said at least once before that I think there should be a flag >> >>for the ioctl (or a separate ioctl) to skip this and assume that userspace >> >>really knows what it's doing), then finally sets up the reflinks, and >> >>unlocks the new extent. > > I've never found the ioctl's lock/read/compare operation taking any more > than 1.0% of the time. A dedup agent may spend 20% of its time running > the extent-same ioctl (with the other 80% being metadata traversal > and block reads/hashes). Within the extent-same ioctl the "set up the > reflinks" step is 99% of the time, often more. > > The ioctl can read from the cache, so if userspace reads all > the data immediately before calling the extent-same ioctl then the > read/lock/compare component of the ioctl is negligible. > >> >>All of this ties into why I keep telling people that efficient >> >>deduplication >> >>requires a tool that understands how the data being deduplicated is >> >>structured. By avoiding the need to hash and compare every block of data, >> >>you can significantly improve the time that part takes, and quite often >> >>this >> >>will mitigate the impact of getting a few false positives passed into the >> >>ioctl. >> >>> >> >>>Questiont 2a: would be a higher extend-size perform better? > > Most of the overhead of deleting shared extents depends on the number > of extents, so using larger extents is helpful up to the maximum extent > size; however, the extent-same ioctl cannot make extents larger. If the > source file was fragmented to begin with then using larger sizes in the > extent-same ioctl will not have any effect. > > Given a choice of extents to deduplicate, it helps to choose the larger > extents to keep if possible (i.e. delete/replace the smaller extents > with references to the larger ones); however, the tradeoff is that this > only allows dedup along existing extent boundaries (i.e. no files over > a few MB in size will ever be deduped). > >> >>>Querstion 2b: or did i understand something wrong? >> >> >> >>No, a larger extent would probably not help much, and that's actually a >> >>really good performance sample you've created. >> >> >> >>The block size does end up being somewhat of a trade-off. Ideally, you >> >>want >> >>it matched to the smallest possible chunk of duplicate data greater than or >> >>equal to the filesystem block size for maximal space efficiency. Doing >> >>this >> >>however makes the extra processing done by duperemove take exponentially >> >>longer because it has to calculate hashes for more blocks (this has very >> >>low >> >>impact until you get to very small block sizes), and has to make >> >>exponentially more comparisons (this has a very big impact as you shrink >> >>the >> >>block size, just halving the block size will roughly quadruple the time it >> >>takes to make the comparisons). >> >> >> >>> >> >>>2017-01-03 20:37 GMT+01:00 Austin S. Hemmelgarn <ahferro...@gmail.com>: >> >>>> >> >>>>On 2017-01-03 14:21, Peter Becker wrote: >> >>>>> >> >>>>> >> >>>>>All invocations are justified, but not relevant in (offline) backup >> >>>>>and archive scenarios. >> >>>>> >> >>>>>For example you have multiple version of append-only log-files or >> >>>>>append-only db-files (each more then 100GB in size), like this: >> >>>>> >> >>>>>>Snapshot_01_01_2017 >> >>>>> >> >>>>> >> >>>>>-> file1.log .. 201 GB >> >>>>> >> >>>>>>Snapshot_02_01_2017 >> >>>>> >> >>>>> >> >>>>>-> file1.log .. 205 GB >> >>>>> >> >>>>>>Snapshot_03_01_2017 >> >>>>> >> >>>>> >> >>>>>-> file1.log .. 221 GB >> >>>>> >> >>>>>The first 201 GB would be every time the same. >> >>>>>Files a copied at night from windows, linux or bsd systems and >> >>>>>snapshoted after copy. >> >>>>> >> >>>>>So a fast way to dedupe this is needed. Using 128KB blocks would >> >>>>>result in 1646592 extends per Snapshot. 1MB blocksize results in >> >>>>>205.824 extends (not bad, but still terrible speed). >> >>>>>I will test it at night with a patched version of duperemove with >> >>>>>100MB blocksize, but I have no hope that the throughput increases >> >>>>>thereby. >> >>>> >> >>>> >> >>>>Deduplication is not a general purpose thing (usually you have very >> >>>>specifically structured data), but duperemove is supposed to be a general >> >>>>purpose tool. It works fine for two of the most common cases >> >>>>(deduplicating >> >>>>large numbers of small files or small numbers of large files), but it is >> >>>>sub-optimal for those cases, and will be for almost any other case. This >> >>>>is >> >>>>a canonical example though of a case where you can use a custom script or >> >>>>program to figure out what's duplicated and then have that just call the >> >>>>ioctl as appropriate itself. Most cases where you are storing some kind >> >>>>of >> >>>>well structured data fall into this category. In fact, both of the cases >> >>>>where I use deduplication myself fall into such a category. One case >> >>>>involves multiple directories that are partial copies of a larger tree >> >>>>which >> >>>>are kept in sync with the larger tree and each other. In that particular >> >>>>case, I care about whole file deduplication, so I have a script that just >> >>>>matches on path relative to the roots of each copy and the master copy, >> >>>>verifies that the mtime and size are the same, and if so calls the ioctl >> >>>>for >> >>>>deduplication (with some fancy processing to fit within the max size >> >>>>supported by the ioctl and prevent tiny tail extents). The other case is >> >>>>a >> >>>>set of archives with a pretty solid fixed structure to them. In that >> >>>>case, >> >>>>I have a different script that knows enough about the file structure to >> >>>>know >> >>>>where to look for duplicate blocks, thus avoiding having to hash the >> >>>>whole >> >>>>files. >> >>>> >> >>>>The append-only log/database case fits this type of thing perfectly, for >> >>>>each subsequent file, you know already that (most of) the file up to the >> >>>>length of the previous file is duplicated, so you can just split that >> >>>>however you want into chunks and pass those to the dedupe ioctl and free >> >>>>up >> >>>>most of the space that would be used by the new file. You can then run >> >>>>duperemove with a hash-file to process any new blocks beyond the point >> >>>>you >> >>>>deduplicated up to to reclaim any excess space (currently this will >> >>>>process >> >>>>the whole file, but it should see that most of it is deduplicated >> >>>>already). >> >>>> >> >>>>> >> >>>>>For backup and archive scenarios the checksum-feature and the >> >>>>>dub-data/metadata-feature of btrfs is realy nice. In particular if one >> >>>>>considers the 7 years legally prescribed storage time. >> >>>>> >> >>>>>2017-01-03 13:40 GMT+01:00 Austin S. Hemmelgarn <ahferro...@gmail.com>: >> >>>>>> >> >>>>>> >> >>>>>>On 2016-12-30 15:28, Peter Becker wrote: >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>>Hello, i have a 8 TB volume with multiple files with hundreds of GB >> >>>>>>>each. >> >>>>>>>I try to dedupe this because the first hundred GB of many files are >> >>>>>>>identical. >> >>>>>>>With 128KB blocksize with nofiemap and lookup-extends=no option, will >> >>>>>>>take more then a week (only dedupe, previously hashed). So i tryed -b >> >>>>>>>100M but this returned me an error: "Blocksize is bounded ...". >> >>>>>>> >> >>>>>>>The reason is that the blocksize is limit to >> >>>>>>> >> >>>>>>>#define MAX_BLOCKSIZE (1024U*1024) >> >>>>>>> >> >>>>>>>But i can't found any description why. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>>Beyond what Xin mentioned (namely that 1MB is a much larger block than >> >>>>>>will >> >>>>>>be duplicated in most data-sets), there are a couple of other reasons: >> >>>>>>1. Smaller blocks will actually get you better deduplication on average >> >>>>>>because they're more likely to match. As an example, assume you have 2 >> >>>>>>files with the same 8 4k blocks in different orders: >> >>>>>> FileA: 1 2 3 4 5 6 7 8 >> >>>>>> FileB: 7 8 5 6 3 4 1 2 >> >>>>>>In such a case, deduplicating at any block size above 8k would result >> >>>>>>in >> >>>>>>zero deduplication between these files, while 8k or less would >> >>>>>>completely >> >>>>>>deduplicate them. This is of course a highly specific and somewhat >> >>>>>>contrived example (in most cases it will be scattered duplicate blocks >> >>>>>>over >> >>>>>>dozens of files), but it does convey this specific point. >> >>>>>>2. The kernel will do a byte-wise comparison of all ranges you pass >> >>>>>>into >> >>>>>>the >> >>>>>>ioctl at the same time. Larger block sizes here mean that: >> >>>>>> a) The extents will be locked longer, which will prevent any >> >>>>>>I/O >> >>>>>>to >> >>>>>>the files being deduplicated for the duration of the comparison, which >> >>>>>>may >> >>>>>>in turn cause other issues on the system. >> >>>>>> b) The deduplication process will be stuck in uninterruptible >> >>>>>>sleep >> >>>>>>longer, which on many systems will trigger hung task detection, which >> >>>>>>will >> >>>>>>in turn either spam the system log or panic the system depending on how >> >>>>>>it's >> >>>>>>configured. >> >>>>>> >> >>>> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html