From: "Hatayama, Daisuke/畑山 大輔" <[email protected]> Subject: Re: [PATCH] makedumpfile: --split: assign fair I/O workloads for each process Date: Tue, 25 Mar 2014 14:52:36 +0900
> > > (2014/03/25 10:14), Atsushi Kumagai wrote: >>> From: HATAYAMA Daisuke <[email protected]> >>> >>> When --split option is specified, fair I/O workloads should be >>> assigned for each process to maximize amount of performance >>> optimization by parallel processing. >>> >>> However, the current implementation of setup_splitting() in cyclic >>> mode doesn't care about filtering at all; I/O workloads for each >>> process could be biased easily. >>> >>> This patch deals with the issue by implementing the fair I/O workload >>> assignment as setup_splitting_cyclic(). >>> >>> Note: If --split is specified in cyclic mode, we do filtering three >>> times: in get_dumpable_pages_cyclic(), in setup_splitting_cyclic() and >>> in writeout_dumpfile(). Filtering takes about 10 minutes on system >>> with huge memory according to the benchmark on the past, so it might >>> be necessary to optimize filtering or setup_filtering_cyclic(). >> >> Sorry, I lost the result of that benchmark, could you give me the URL? >> I'd like to confirm that the advantage of fair I/O will exceed the >> 10 minutes disadvantage. >> > > Here are two benchmarks by Jingbai Ma and myself. > > http://lists.infradead.org/pipermail/kexec/2013-March/008515.html > http://lists.infradead.org/pipermail/kexec/2013-March/008517.html > > > Note that Jingbai Ma's results are sum of get_dumpable_cyclic() and > writeout_dumpfile(), so apparently it looks twice larger than mine, but > actually they show almost same performance. > > In summary, each result shows about 40 seconds per 1TiB. So, most of systems > is not affected very much. On 12TiB memory, which is the current maximum > memory size of Fujitsu system, we needs 480 seconds == 8 minutes more. But > this is stable in the sense that time never become long suddenly in some rare > worst case, so it seems to me optimistic in this sense. > > The other ideas to deal with the issue are: > > - paralellize the counting up processes. But it might be difficult to > paralellize the 2nd pass, which seems inherently serial processing. > > - Insead of doing the 2nd pass, make the terminating proces join to still > running process. But it might be combersome to implement this not using > pthread. > I noticed that it's able to create a table of dumpable pages with a relatively small amount of memory by manging a memory as blocks. This is just kind of a page table management. For example, define a block 1 GiB boundary region and assume a system with 64 TiB physical memory (which is current maximum on x86_64). Then, 64 TiB / 1 GiB = 64 Ki blocks A table we consdier here have the number of dumpable pages for each 1 GiB boundary in each entry of 8 bytes. So, total size of the table is: 8 B * 64 Ki blocks = 512 KiB Counting up dumpable pages in each GiB boundary can be done by 1 pass only; get_dumpable_cyclic() does that too. Then, it's assign amount of I/O to each process fairly enough. The difference is at most 1 GiB. If disk speed is 100 MiB/sec, 1 GiB corresponds to about 10 seconds only. If you think 512 KiB not small enough, it would be able to increase block size a little more. If choosing 8 GiB block, table size is 64 KiB only, and the 8 GiB data corresponds to about 80 seconds on typical disks. How do you think this? Thanks. HATAYAMA, Daisuke _______________________________________________ kexec mailing list [email protected] http://lists.infradead.org/mailman/listinfo/kexec
