Re: [v2 0/9] Early boot time stamps for x86
Hi Thomas, Thank you very much for a very insightful feedback. I will address your comments, and if I have any questions, I will ask them before sending out the next patchset. A few replies below: First of all, this "solution" is only valid for a very restricted set of systems and breaks others in very subtle ways. FYI, the architecture name is x86, which handles 32bit/64bit, various vendors and hypervisors. It's not 'arch/mymachine'. The reason why I only targeted newer processors and 64-bit platforms was because these are the machines that keep growing in size: more memory, cpu count etc. Which, means these are the machine where we can get scalability problem in the future. So this needs to be integrated into the existing TSC/CPU calibration mechanism which works across all supported systems. And that means, that early TSC calibration can't be done before x86_init.oem.arch_setup() and hypervisor_init_platform() have been invoked. By that time, which is still pretty early in the boot process: - boot_cpu_data has been preinitialized - calibration functions setup has been done This sounds good. I will sanity check that we do not spend excessive amount of time before x86_init.oem.arch_setup() is done on some larger machines, and will start early clock sometime after this call is finished. Putting the early TSC initialization after hypervisor_init_platform() makes this available for _ALL_ systems/platforms as infrastructure and allows the platforms to add the required bits of support via a new set of function pointers in struct x86_platform_ops, which btw. is a proper form of abstraction. Ok The time between x86_64_start_kernel() and that point is in the low single digit millisecond range or less than a millisecond and therefor completely irrelevant. The first timestamps before that point will be 0 as they used to be. You can argue in circles about that, it's simply not debatable. Sure, I tried perfection by having timestamps available with the first log message, but your proposed practicality makes sense. But that's only the sched clock part of the problem. If this early sched_clock() is available, then this needs to be fed into the setup of timekeeping as well, otherwise dmesg tells 50 seconds into boot and clock monotonic/boottime say 5, which would be confusing at best. That has not be solved in the first step, but definitely before something like this is going to be merged. Thank you for catching this, I will make sure boottime/dmesg times match. Thank you, Pasha
Re: [v2 0/9] Early boot time stamps for x86
Hi Thomas, Thank you very much for a very insightful feedback. I will address your comments, and if I have any questions, I will ask them before sending out the next patchset. A few replies below: First of all, this "solution" is only valid for a very restricted set of systems and breaks others in very subtle ways. FYI, the architecture name is x86, which handles 32bit/64bit, various vendors and hypervisors. It's not 'arch/mymachine'. The reason why I only targeted newer processors and 64-bit platforms was because these are the machines that keep growing in size: more memory, cpu count etc. Which, means these are the machine where we can get scalability problem in the future. So this needs to be integrated into the existing TSC/CPU calibration mechanism which works across all supported systems. And that means, that early TSC calibration can't be done before x86_init.oem.arch_setup() and hypervisor_init_platform() have been invoked. By that time, which is still pretty early in the boot process: - boot_cpu_data has been preinitialized - calibration functions setup has been done This sounds good. I will sanity check that we do not spend excessive amount of time before x86_init.oem.arch_setup() is done on some larger machines, and will start early clock sometime after this call is finished. Putting the early TSC initialization after hypervisor_init_platform() makes this available for _ALL_ systems/platforms as infrastructure and allows the platforms to add the required bits of support via a new set of function pointers in struct x86_platform_ops, which btw. is a proper form of abstraction. Ok The time between x86_64_start_kernel() and that point is in the low single digit millisecond range or less than a millisecond and therefor completely irrelevant. The first timestamps before that point will be 0 as they used to be. You can argue in circles about that, it's simply not debatable. Sure, I tried perfection by having timestamps available with the first log message, but your proposed practicality makes sense. But that's only the sched clock part of the problem. If this early sched_clock() is available, then this needs to be fed into the setup of timekeeping as well, otherwise dmesg tells 50 seconds into boot and clock monotonic/boottime say 5, which would be confusing at best. That has not be solved in the first step, but definitely before something like this is going to be merged. Thank you for catching this, I will make sure boottime/dmesg times match. Thank you, Pasha
Re: [v2 0/9] Early boot time stamps for x86
On Sat, 25 Mar 2017, Pasha Tatashin wrote: > The second versions was actually meant as a reply to your e-mail: the code > differences were minimal: the main differences were in the cover letter. You > mentioned that it is not necessary to have early boot time stamps, and I > wanted to show examples how this data is useful to track scalability bugs and > avoid future regressions. I did not say that early timestamps are not useful. I said they fall not into the category of things which are absolutely necessary in the earliest boot process. Can you spot the difference? If you want to show me something to prove a point, then why don't you reply to my mail as anybody else does who wants a to start a conversation or discussion? Ignoring my remarks and then sending another hastily cobbled together patch set is not in any way a form of conversation/discussion or follow up to my mail. > Anyway, you asked for some time to think about this problem, I certainly did not ask for some time. ... I told you that I need to find a free time slot to look into that in detail. I told you as well, that some of your decisions (earliest boot code, code duplication etc.) are not acceptable. The fact that I told you that I need to find a free time slot also means that I'm going to tell you in more detail why and what needs to be done. > ... I won't send any replies to this thread for the next two weeks. Not replying to mail for a fixed amount of time is just silly. What are you going to do after that? Resend another half baken patch series which ignores every review comment? The review/feedback mechanism does not work that way. If you get feedback, then you either accept it and act accordingly or you discuss it as a response to that feedback. When that discussion is settled, then it's time to rework stuff and send out a new version after making sure that everything is addressed. > So, please consider this solution. The feature is well abstracted, does > not harm the performance of the fast path, and if necessary it can also > be made optional with something like: CONFIG_HAVE_UNSTABLE_EARLY_CLOCK Well abstracted? I have no idea what your understanding of abstractions is, but it's definitely different from mine. This is not abstracted, it's glued into the code in the 'works for me' mode. It's a proof of concept if at all. And no, this won't have a CONFIG switch to clutter the code even more. First of all, this "solution" is only valid for a very restricted set of systems and breaks others in very subtle ways. FYI, the architecture name is x86, which handles 32bit/64bit, various vendors and hypervisors. It's not 'arch/mymachine'. So this needs to be integrated into the existing TSC/CPU calibration mechanism which works across all supported systems. And that means, that early TSC calibration can't be done before x86_init.oem.arch_setup() and hypervisor_init_platform() have been invoked. By that time, which is still pretty early in the boot process: - boot_cpu_data has been preinitialized - calibration functions setup has been done So this won't ignore decisions made by the various platform quirks and does not require pointless copies of existing code and absurd hacks to make it work. Also at this point the most fragile parts of the boot process have been handled. Putting the early TSC initialization after hypervisor_init_platform() makes this available for _ALL_ systems/platforms as infrastructure and allows the platforms to add the required bits of support via a new set of function pointers in struct x86_platform_ops, which btw. is a proper form of abstraction. For some of them it's just switching the calibration function to the early one, for others this won't be possible ever. The time between x86_64_start_kernel() and that point is in the low single digit millisecond range or less than a millisecond and therefor completely irrelevant. The first timestamps before that point will be 0 as they used to be. You can argue in circles about that, it's simply not debatable. Doing that allows to simply split the existing TSC init into an early and late part and that late part is going to switch over from early sched clock to the regular one w/o any _fini() calls or other absurdities. The same applies to the sched core clock code. This can be integrated w/o all that extra early/fini() hackery cleanly. All it takes is a weak sched_clock_early() function which returns 0 to start with. You probably can figure out how to overload it. But that's only the sched clock part of the problem. If this early sched_clock() is available, then this needs to be fed into the setup of timekeeping as well, otherwise dmesg tells 50 seconds into boot and clock monotonic/boottime say 5, which would be confusing at best. That has not be solved in the first step, but definitely before something like this is going to be merged. Thanks, tglx
Re: [v2 0/9] Early boot time stamps for x86
On Sat, 25 Mar 2017, Pasha Tatashin wrote: > The second versions was actually meant as a reply to your e-mail: the code > differences were minimal: the main differences were in the cover letter. You > mentioned that it is not necessary to have early boot time stamps, and I > wanted to show examples how this data is useful to track scalability bugs and > avoid future regressions. I did not say that early timestamps are not useful. I said they fall not into the category of things which are absolutely necessary in the earliest boot process. Can you spot the difference? If you want to show me something to prove a point, then why don't you reply to my mail as anybody else does who wants a to start a conversation or discussion? Ignoring my remarks and then sending another hastily cobbled together patch set is not in any way a form of conversation/discussion or follow up to my mail. > Anyway, you asked for some time to think about this problem, I certainly did not ask for some time. ... I told you that I need to find a free time slot to look into that in detail. I told you as well, that some of your decisions (earliest boot code, code duplication etc.) are not acceptable. The fact that I told you that I need to find a free time slot also means that I'm going to tell you in more detail why and what needs to be done. > ... I won't send any replies to this thread for the next two weeks. Not replying to mail for a fixed amount of time is just silly. What are you going to do after that? Resend another half baken patch series which ignores every review comment? The review/feedback mechanism does not work that way. If you get feedback, then you either accept it and act accordingly or you discuss it as a response to that feedback. When that discussion is settled, then it's time to rework stuff and send out a new version after making sure that everything is addressed. > So, please consider this solution. The feature is well abstracted, does > not harm the performance of the fast path, and if necessary it can also > be made optional with something like: CONFIG_HAVE_UNSTABLE_EARLY_CLOCK Well abstracted? I have no idea what your understanding of abstractions is, but it's definitely different from mine. This is not abstracted, it's glued into the code in the 'works for me' mode. It's a proof of concept if at all. And no, this won't have a CONFIG switch to clutter the code even more. First of all, this "solution" is only valid for a very restricted set of systems and breaks others in very subtle ways. FYI, the architecture name is x86, which handles 32bit/64bit, various vendors and hypervisors. It's not 'arch/mymachine'. So this needs to be integrated into the existing TSC/CPU calibration mechanism which works across all supported systems. And that means, that early TSC calibration can't be done before x86_init.oem.arch_setup() and hypervisor_init_platform() have been invoked. By that time, which is still pretty early in the boot process: - boot_cpu_data has been preinitialized - calibration functions setup has been done So this won't ignore decisions made by the various platform quirks and does not require pointless copies of existing code and absurd hacks to make it work. Also at this point the most fragile parts of the boot process have been handled. Putting the early TSC initialization after hypervisor_init_platform() makes this available for _ALL_ systems/platforms as infrastructure and allows the platforms to add the required bits of support via a new set of function pointers in struct x86_platform_ops, which btw. is a proper form of abstraction. For some of them it's just switching the calibration function to the early one, for others this won't be possible ever. The time between x86_64_start_kernel() and that point is in the low single digit millisecond range or less than a millisecond and therefor completely irrelevant. The first timestamps before that point will be 0 as they used to be. You can argue in circles about that, it's simply not debatable. Doing that allows to simply split the existing TSC init into an early and late part and that late part is going to switch over from early sched clock to the regular one w/o any _fini() calls or other absurdities. The same applies to the sched core clock code. This can be integrated w/o all that extra early/fini() hackery cleanly. All it takes is a weak sched_clock_early() function which returns 0 to start with. You probably can figure out how to overload it. But that's only the sched clock part of the problem. If this early sched_clock() is available, then this needs to be fed into the setup of timekeeping as well, otherwise dmesg tells 50 seconds into boot and clock monotonic/boottime say 5, which would be confusing at best. That has not be solved in the first step, but definitely before something like this is going to be merged. Thanks, tglx
Re: [v2 0/9] Early boot time stamps for x86
Hi Thomas, The second versions was actually meant as a reply to your e-mail: the code differences were minimal: the main differences were in the cover letter. You mentioned that it is not necessary to have early boot time stamps, and I wanted to show examples how this data is useful to track scalability bugs and avoid future regressions. Anyway, you asked for some time to think about this problem, I won't send any replies to this thread for the next two weeks. So, please consider this solution. The feature is well abstracted, does not harm the performance of the fast path, and if necessary it can also be made optional with something like: CONFIG_HAVE_UNSTABLE_EARLY_CLOCK Thank you, Pasha On 03/25/2017 06:25 AM, Thomas Gleixner wrote: On Fri, 24 Mar 2017, Pavel Tatashin wrote: changelog - v1 - v2 In patch "x86/tsc: tsc early": - added tsc_adjusted_early() - fixed 32-bit compile error use do_div() Did you actually read my last reply on V1 of this? I made it entirely clear that the way this is done, i.e. hacking it into the earliest boost stage is not going to happen. Further I asked you to hold off until I found some time to look into this in detail. So what's the point of ignoring what I said and resending the whole lot with some more hackery applied? I don't care about you wasting your time, but I very much care about my time. Thanks, tglx
Re: [v2 0/9] Early boot time stamps for x86
Hi Thomas, The second versions was actually meant as a reply to your e-mail: the code differences were minimal: the main differences were in the cover letter. You mentioned that it is not necessary to have early boot time stamps, and I wanted to show examples how this data is useful to track scalability bugs and avoid future regressions. Anyway, you asked for some time to think about this problem, I won't send any replies to this thread for the next two weeks. So, please consider this solution. The feature is well abstracted, does not harm the performance of the fast path, and if necessary it can also be made optional with something like: CONFIG_HAVE_UNSTABLE_EARLY_CLOCK Thank you, Pasha On 03/25/2017 06:25 AM, Thomas Gleixner wrote: On Fri, 24 Mar 2017, Pavel Tatashin wrote: changelog - v1 - v2 In patch "x86/tsc: tsc early": - added tsc_adjusted_early() - fixed 32-bit compile error use do_div() Did you actually read my last reply on V1 of this? I made it entirely clear that the way this is done, i.e. hacking it into the earliest boost stage is not going to happen. Further I asked you to hold off until I found some time to look into this in detail. So what's the point of ignoring what I said and resending the whole lot with some more hackery applied? I don't care about you wasting your time, but I very much care about my time. Thanks, tglx
Re: [v2 0/9] Early boot time stamps for x86
On Fri, 24 Mar 2017, Pavel Tatashin wrote: > changelog > - > v1 - v2 > In patch "x86/tsc: tsc early": > - added tsc_adjusted_early() > - fixed 32-bit compile error use do_div() Did you actually read my last reply on V1 of this? I made it entirely clear that the way this is done, i.e. hacking it into the earliest boost stage is not going to happen. Further I asked you to hold off until I found some time to look into this in detail. So what's the point of ignoring what I said and resending the whole lot with some more hackery applied? I don't care about you wasting your time, but I very much care about my time. Thanks, tglx
Re: [v2 0/9] Early boot time stamps for x86
On Fri, 24 Mar 2017, Pavel Tatashin wrote: > changelog > - > v1 - v2 > In patch "x86/tsc: tsc early": > - added tsc_adjusted_early() > - fixed 32-bit compile error use do_div() Did you actually read my last reply on V1 of this? I made it entirely clear that the way this is done, i.e. hacking it into the earliest boost stage is not going to happen. Further I asked you to hold off until I found some time to look into this in detail. So what's the point of ignoring what I said and resending the whole lot with some more hackery applied? I don't care about you wasting your time, but I very much care about my time. Thanks, tglx
[v2 0/9] Early boot time stamps for x86
changelog - v1 - v2 In patch "x86/tsc: tsc early": - added tsc_adjusted_early() - fixed 32-bit compile error use do_div() Adding early boot time stamps support for x86 machines. SPARC patch see here: http://www.spinics.net/lists/sparclinux/msg17372.html sample output - Before: https://hastebin.com/zofiqazuze.scala After: https://hastebin.com/otayoliruc.scala example why early time is crucial - Here is boot log on x86 machine with 8 E7-8895 CPUs, and 1T of memory: https://hastebin.com/ebidosafic.pas We want to improve boot time, by deferring vmemmap initialization to when the secondary CPUs are online: https://hastebin.com/migajadiju.pas We see in the log that struct pages are initialized in parallel: [ 19.446609] node 0 initialised, 31980145 pages in 530ms ... But, we have no idea if we saved overall boot time or not by enabling this feature, because the total boot time as shown currently has not changed much, and is actually increased from: 18.99s to 19.52s Now, with this feature we would see right away that the time is actually decreased: Before: https://hastebin.com/ivobehahud.pas After: https://hastebin.com/uvifasohon.pas Boot time reduced from 40.68s to 31.41s, we saved 30% by enabling deferred page initialization feature, great. However, now, we have early boot time stamps, and we can analyze even further, and see that we still spend a lot of time in this interval: [0.009071] kexec_core: crashkernel: memory value expected [ 11.255710] Zone ranges: And, this is where this fix comes handy: https://lkml.org/lkml/2017/3/23/864 The final boot time on this machine is this: https://hastebin.com/anodiqaguj.pas So, 20.7 seconds vs. 40.68s before, and we can show it with the early boot time stamps, and avoid future regressions by having this data always available to us. Pavel Tatashin (9): sched/clock: broken stable to unstable transfer sched/clock: interface to allow timestamps early in boot x86/cpu: determining x86 vendor early x86/tsc: early MSR-based CPU/TSC frequency discovery x86/tsc: disable early messages from quick_pit_calibrate x86/tsc: use cpuid to determine TSC frequency x86/tsc: use cpuid to determine CPU frequency x86/tsc: tsc early x86/tsc: use tsc early arch/x86/include/asm/processor.h | 1 + arch/x86/include/asm/tsc.h | 5 ++ arch/x86/kernel/cpu/common.c | 36 arch/x86/kernel/head64.c | 1 + arch/x86/kernel/time.c | 1 + arch/x86/kernel/tsc.c| 186 ++- arch/x86/kernel/tsc_msr.c| 38 +--- include/linux/sched/clock.h | 4 + kernel/sched/clock.c | 70 ++- 9 files changed, 304 insertions(+), 38 deletions(-) -- 1.8.3.1
[v2 0/9] Early boot time stamps for x86
changelog - v1 - v2 In patch "x86/tsc: tsc early": - added tsc_adjusted_early() - fixed 32-bit compile error use do_div() Adding early boot time stamps support for x86 machines. SPARC patch see here: http://www.spinics.net/lists/sparclinux/msg17372.html sample output - Before: https://hastebin.com/zofiqazuze.scala After: https://hastebin.com/otayoliruc.scala example why early time is crucial - Here is boot log on x86 machine with 8 E7-8895 CPUs, and 1T of memory: https://hastebin.com/ebidosafic.pas We want to improve boot time, by deferring vmemmap initialization to when the secondary CPUs are online: https://hastebin.com/migajadiju.pas We see in the log that struct pages are initialized in parallel: [ 19.446609] node 0 initialised, 31980145 pages in 530ms ... But, we have no idea if we saved overall boot time or not by enabling this feature, because the total boot time as shown currently has not changed much, and is actually increased from: 18.99s to 19.52s Now, with this feature we would see right away that the time is actually decreased: Before: https://hastebin.com/ivobehahud.pas After: https://hastebin.com/uvifasohon.pas Boot time reduced from 40.68s to 31.41s, we saved 30% by enabling deferred page initialization feature, great. However, now, we have early boot time stamps, and we can analyze even further, and see that we still spend a lot of time in this interval: [0.009071] kexec_core: crashkernel: memory value expected [ 11.255710] Zone ranges: And, this is where this fix comes handy: https://lkml.org/lkml/2017/3/23/864 The final boot time on this machine is this: https://hastebin.com/anodiqaguj.pas So, 20.7 seconds vs. 40.68s before, and we can show it with the early boot time stamps, and avoid future regressions by having this data always available to us. Pavel Tatashin (9): sched/clock: broken stable to unstable transfer sched/clock: interface to allow timestamps early in boot x86/cpu: determining x86 vendor early x86/tsc: early MSR-based CPU/TSC frequency discovery x86/tsc: disable early messages from quick_pit_calibrate x86/tsc: use cpuid to determine TSC frequency x86/tsc: use cpuid to determine CPU frequency x86/tsc: tsc early x86/tsc: use tsc early arch/x86/include/asm/processor.h | 1 + arch/x86/include/asm/tsc.h | 5 ++ arch/x86/kernel/cpu/common.c | 36 arch/x86/kernel/head64.c | 1 + arch/x86/kernel/time.c | 1 + arch/x86/kernel/tsc.c| 186 ++- arch/x86/kernel/tsc_msr.c| 38 +--- include/linux/sched/clock.h | 4 + kernel/sched/clock.c | 70 ++- 9 files changed, 304 insertions(+), 38 deletions(-) -- 1.8.3.1