Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/18/2016 06:16 AM, Thomas Gleixner wrote: > On Mon, 14 Nov 2016, Boris Ostrovsky wrote: >> On 11/13/2016 06:42 PM, M. Vefa Bicakci wrote: >> >>> I found out that my domU kernels invoke the 'apic_disable' function >>> because CONFIG_X86_MPPARSE was not enabled in my kernel configuration, >>> which would cause the 'smp_found_config' bit to be unset at boot-up. >> smp_found_config is not the problem, it is usually zero for Xen PV guests. >> >> What is the problem is that because of your particular config selection >> acpi_mps_check() fails (with the error message that you mention below) and >> that leads to X86_FEATURE_APIC being cleared. And then we indeed switch to >> APIC noop and things go south after that. > Indeed. And what really puzzles me is that Xen manages to bring up a > secondary CPU despite APIC being disabled. PV guests bring secondary VCPUs up using hypercalls (see xen_cpu_up()). > There are quite some assumptions about no APIC == no SMP in all of x86. Can > we please make Xen behave like anything else? > I will try to see if we can improve APIC emulation for these guests. Unfortunately it will have to be done on kernel side since we still need to support older Xen versions. But as I said earlier, the right answer to this is PVH. -boris
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/18/2016 06:16 AM, Thomas Gleixner wrote: > On Mon, 14 Nov 2016, Boris Ostrovsky wrote: >> On 11/13/2016 06:42 PM, M. Vefa Bicakci wrote: >> >>> I found out that my domU kernels invoke the 'apic_disable' function >>> because CONFIG_X86_MPPARSE was not enabled in my kernel configuration, >>> which would cause the 'smp_found_config' bit to be unset at boot-up. >> smp_found_config is not the problem, it is usually zero for Xen PV guests. >> >> What is the problem is that because of your particular config selection >> acpi_mps_check() fails (with the error message that you mention below) and >> that leads to X86_FEATURE_APIC being cleared. And then we indeed switch to >> APIC noop and things go south after that. > Indeed. And what really puzzles me is that Xen manages to bring up a > secondary CPU despite APIC being disabled. PV guests bring secondary VCPUs up using hypercalls (see xen_cpu_up()). > There are quite some assumptions about no APIC == no SMP in all of x86. Can > we please make Xen behave like anything else? > I will try to see if we can improve APIC emulation for these guests. Unfortunately it will have to be done on kernel side since we still need to support older Xen versions. But as I said earlier, the right answer to this is PVH. -boris
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Mon, 14 Nov 2016, Boris Ostrovsky wrote: > On 11/13/2016 06:42 PM, M. Vefa Bicakci wrote: > > > I found out that my domU kernels invoke the 'apic_disable' function > > because CONFIG_X86_MPPARSE was not enabled in my kernel configuration, > > which would cause the 'smp_found_config' bit to be unset at boot-up. > > smp_found_config is not the problem, it is usually zero for Xen PV guests. > > What is the problem is that because of your particular config selection > acpi_mps_check() fails (with the error message that you mention below) and > that leads to X86_FEATURE_APIC being cleared. And then we indeed switch to > APIC noop and things go south after that. Indeed. And what really puzzles me is that Xen manages to bring up a secondary CPU despite APIC being disabled. There are quite some assumptions about no APIC == no SMP in all of x86. Can we please make Xen behave like anything else? Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Mon, 14 Nov 2016, Boris Ostrovsky wrote: > On 11/13/2016 06:42 PM, M. Vefa Bicakci wrote: > > > I found out that my domU kernels invoke the 'apic_disable' function > > because CONFIG_X86_MPPARSE was not enabled in my kernel configuration, > > which would cause the 'smp_found_config' bit to be unset at boot-up. > > smp_found_config is not the problem, it is usually zero for Xen PV guests. > > What is the problem is that because of your particular config selection > acpi_mps_check() fails (with the error message that you mention below) and > that leads to X86_FEATURE_APIC being cleared. And then we indeed switch to > APIC noop and things go south after that. Indeed. And what really puzzles me is that Xen manages to bring up a secondary CPU despite APIC being disabled. There are quite some assumptions about no APIC == no SMP in all of x86. Can we please make Xen behave like anything else? Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/13/2016 06:42 PM, M. Vefa Bicakci wrote: I found out that my domU kernels invoke the 'apic_disable' function because CONFIG_X86_MPPARSE was not enabled in my kernel configuration, which would cause the 'smp_found_config' bit to be unset at boot-up. smp_found_config is not the problem, it is usually zero for Xen PV guests. What is the problem is that because of your particular config selection acpi_mps_check() fails (with the error message that you mention below) and that leads to X86_FEATURE_APIC being cleared. And then we indeed switch to APIC noop and things go south after that. -boris This would cause 'init_apic_mappings' to call 'apic_disable', which would cause Xen's 'apic' ops structure pointer to be replaced with the no-op APIC ops structure's pointer. The use of the no-op APIC ops structure would in turn cause invalid virtual CPU package identifiers to be generated. Invalid CPU package identifiers would in turn cause the RAPL module to produce a kernel oops due to potentially missing error handling. It looks like I have been ignoring the following kernel warning which I should have noticed a long time ago: MPS support code is not built-in. Using acpi=off or acpi=noirq or pci=noacpi may have problem To all on this e-mail thread, I learned a bit through this exercise, but I have also taken a lot of everyone's time and created quite a bit of e-mail traffic because of a kernel configuration issue on my end. My apologies. Vefa
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/13/2016 06:42 PM, M. Vefa Bicakci wrote: I found out that my domU kernels invoke the 'apic_disable' function because CONFIG_X86_MPPARSE was not enabled in my kernel configuration, which would cause the 'smp_found_config' bit to be unset at boot-up. smp_found_config is not the problem, it is usually zero for Xen PV guests. What is the problem is that because of your particular config selection acpi_mps_check() fails (with the error message that you mention below) and that leads to X86_FEATURE_APIC being cleared. And then we indeed switch to APIC noop and things go south after that. -boris This would cause 'init_apic_mappings' to call 'apic_disable', which would cause Xen's 'apic' ops structure pointer to be replaced with the no-op APIC ops structure's pointer. The use of the no-op APIC ops structure would in turn cause invalid virtual CPU package identifiers to be generated. Invalid CPU package identifiers would in turn cause the RAPL module to produce a kernel oops due to potentially missing error handling. It looks like I have been ignoring the following kernel warning which I should have noticed a long time ago: MPS support code is not built-in. Using acpi=off or acpi=noirq or pci=noacpi may have problem To all on this e-mail thread, I learned a bit through this exercise, but I have also taken a lot of everyone's time and created quite a bit of e-mail traffic because of a kernel configuration issue on my end. My apologies. Vefa
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/13/2016 09:04 PM, Boris Ostrovsky wrote: > On 11/12/2016 05:05 PM, M. Vefa Bicakci wrote: >> On 11/10/2016 06:31 PM, Boris Ostrovsky wrote: >>> On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: > On 11/10/2016 06:13 AM, Thomas Gleixner wrote: >> On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: >> >>> I have found that your patch unfortunately does not improve the >>> situation >>> for me. Here is an excerpt obtained from the dmesg of a kernel >>> compiled >>> with this patch *as well as* Sebastian's patch: >>> [0.002561] CPU: Physical Processor ID: 0 >>> [0.002566] CPU: Processor Core ID: 0 >>> [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: >>> CPUID: 2 >> So apic->cpu_present_to_apicid() gives us a completely bogus APIC id >> which >> translates to a bogus package id. And looking at the XEN code: >> >>xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, >> >> and xen_cpu_present_to_apicid does: >> >> static int xen_cpu_present_to_apicid(int cpu) >> { >> if (cpu_present(cpu)) >> return xen_get_apic_id(xen_apic_read(APIC_ID)); >> else >> return BAD_APICID; >> } >> >> So independent of which present CPU we query we get just some random >> information, in the above case we get BAD_APICID from >> xen_apic_read() not >> from the else path as this CPU _IS_ present. >> >> What's so wrong with storing the fricking firmware supplied APICid as >> everybody else does and report it back when queried? > By firmware you mean ACPI? It is most likely not available to PV guests. > How about returning cpu_data(cpu).initial_apicid? > > And what was the original problem? The original issue I found was that VMware was returning a different set of APIC id's in the ACPI tables than what it advertised on the CPU's. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html >>> For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map >>> for PV VCPUs") to at least temporarily work around some topology map >>> problems that PV guests have with RAPL (which I think is what Vefa's >>> problem was). >> Hello Boris, >> >> (Sorry for the delay!) >> >> It appears that the problem is a bit different compared to the one >> corrected by a6a198bc60e6, because my kernel tree -- based on 4.8.6 -- >> already includes the -stable backport of that commit, i.e. >> 88540ad0820ddfb05626e0136c0e5a79cea85fd1 >> >> The patch I included in my previous e-mail (dated 2016-11-10) corrects >> root cause of the issue I am having with 4.8.6. Sebastian's original >> patch adding error checking to the RAPL module prevents the RAPL module >> from causing a kernel oops without my patch. > > I don't see any messages from you on that date. Can you provide a link > to it (and to Sebastian's patch)? > > (BTW, generally it's a good idea to copy xen-devel list on any > Xen-related issues). As I explain below, it turns out that my issue was 'only' a kernel configuration issue. For reference, I had unknowingly solved my kernel-configuration-induced issue via the patch at: https://marc.info/?l=linux-kernel=147875027314638=2 Sebastian's patch (which adds error handling to the RAPL module) is at: https://marc.info/?l=linux-kernel=147739814217598=2 >> The issue I am experiencing is caused by the boot-up code in the >> 'init_apic_mappings' function switching the APIC ops structure from >> Xen's structure to a no-op structure by calling the 'apic_disable' >> function. Please let me know if I can clarify or elaborate. > > apic_disable() is only invoked if there is no APIC present (i.e. > detect_init_APIC() returns a non-zero value) and I don't think this can > happen. Is your CPUID[1].edx[9] not set? I found out that my domU kernels invoke the 'apic_disable' function because CONFIG_X86_MPPARSE was not enabled in my kernel configuration, which would cause the 'smp_found_config' bit to be unset at boot-up. This would cause 'init_apic_mappings' to call 'apic_disable', which would cause Xen's 'apic' ops structure pointer to be replaced with the no-op APIC ops structure's pointer. The use of the no-op APIC ops structure would in turn cause invalid virtual CPU package identifiers to be generated. Invalid CPU package identifiers would in turn cause the RAPL module to produce a kernel oops due to potentially missing error handling. It looks like I have been ignoring the following kernel warning which I should have noticed a long time ago: MPS support code is not built-in. Using acpi=off or acpi=noirq or pci=noacpi may have problem To all on this e-mail thread, I learned a bit through this exercise, but I have also taken a lot of everyone's time and created quite a bit of e-mail traffic because of a kernel
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/13/2016 09:04 PM, Boris Ostrovsky wrote: > On 11/12/2016 05:05 PM, M. Vefa Bicakci wrote: >> On 11/10/2016 06:31 PM, Boris Ostrovsky wrote: >>> On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: > On 11/10/2016 06:13 AM, Thomas Gleixner wrote: >> On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: >> >>> I have found that your patch unfortunately does not improve the >>> situation >>> for me. Here is an excerpt obtained from the dmesg of a kernel >>> compiled >>> with this patch *as well as* Sebastian's patch: >>> [0.002561] CPU: Physical Processor ID: 0 >>> [0.002566] CPU: Processor Core ID: 0 >>> [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: >>> CPUID: 2 >> So apic->cpu_present_to_apicid() gives us a completely bogus APIC id >> which >> translates to a bogus package id. And looking at the XEN code: >> >>xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, >> >> and xen_cpu_present_to_apicid does: >> >> static int xen_cpu_present_to_apicid(int cpu) >> { >> if (cpu_present(cpu)) >> return xen_get_apic_id(xen_apic_read(APIC_ID)); >> else >> return BAD_APICID; >> } >> >> So independent of which present CPU we query we get just some random >> information, in the above case we get BAD_APICID from >> xen_apic_read() not >> from the else path as this CPU _IS_ present. >> >> What's so wrong with storing the fricking firmware supplied APICid as >> everybody else does and report it back when queried? > By firmware you mean ACPI? It is most likely not available to PV guests. > How about returning cpu_data(cpu).initial_apicid? > > And what was the original problem? The original issue I found was that VMware was returning a different set of APIC id's in the ACPI tables than what it advertised on the CPU's. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html >>> For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map >>> for PV VCPUs") to at least temporarily work around some topology map >>> problems that PV guests have with RAPL (which I think is what Vefa's >>> problem was). >> Hello Boris, >> >> (Sorry for the delay!) >> >> It appears that the problem is a bit different compared to the one >> corrected by a6a198bc60e6, because my kernel tree -- based on 4.8.6 -- >> already includes the -stable backport of that commit, i.e. >> 88540ad0820ddfb05626e0136c0e5a79cea85fd1 >> >> The patch I included in my previous e-mail (dated 2016-11-10) corrects >> root cause of the issue I am having with 4.8.6. Sebastian's original >> patch adding error checking to the RAPL module prevents the RAPL module >> from causing a kernel oops without my patch. > > I don't see any messages from you on that date. Can you provide a link > to it (and to Sebastian's patch)? > > (BTW, generally it's a good idea to copy xen-devel list on any > Xen-related issues). As I explain below, it turns out that my issue was 'only' a kernel configuration issue. For reference, I had unknowingly solved my kernel-configuration-induced issue via the patch at: https://marc.info/?l=linux-kernel=147875027314638=2 Sebastian's patch (which adds error handling to the RAPL module) is at: https://marc.info/?l=linux-kernel=147739814217598=2 >> The issue I am experiencing is caused by the boot-up code in the >> 'init_apic_mappings' function switching the APIC ops structure from >> Xen's structure to a no-op structure by calling the 'apic_disable' >> function. Please let me know if I can clarify or elaborate. > > apic_disable() is only invoked if there is no APIC present (i.e. > detect_init_APIC() returns a non-zero value) and I don't think this can > happen. Is your CPUID[1].edx[9] not set? I found out that my domU kernels invoke the 'apic_disable' function because CONFIG_X86_MPPARSE was not enabled in my kernel configuration, which would cause the 'smp_found_config' bit to be unset at boot-up. This would cause 'init_apic_mappings' to call 'apic_disable', which would cause Xen's 'apic' ops structure pointer to be replaced with the no-op APIC ops structure's pointer. The use of the no-op APIC ops structure would in turn cause invalid virtual CPU package identifiers to be generated. Invalid CPU package identifiers would in turn cause the RAPL module to produce a kernel oops due to potentially missing error handling. It looks like I have been ignoring the following kernel warning which I should have noticed a long time ago: MPS support code is not built-in. Using acpi=off or acpi=noirq or pci=noacpi may have problem To all on this e-mail thread, I learned a bit through this exercise, but I have also taken a lot of everyone's time and created quite a bit of e-mail traffic because of a kernel
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/12/2016 05:05 PM, M. Vefa Bicakci wrote: > On 11/10/2016 06:31 PM, Boris Ostrovsky wrote: >> On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: >>> >>> On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: On 11/10/2016 06:13 AM, Thomas Gleixner wrote: > On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > >> I have found that your patch unfortunately does not improve the >> situation >> for me. Here is an excerpt obtained from the dmesg of a kernel >> compiled >> with this patch *as well as* Sebastian's patch: >> [0.002561] CPU: Physical Processor ID: 0 >> [0.002566] CPU: Processor Core ID: 0 >> [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: >> CPUID: 2 > So apic->cpu_present_to_apicid() gives us a completely bogus APIC id > which > translates to a bogus package id. And looking at the XEN code: > >xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, > > and xen_cpu_present_to_apicid does: > > static int xen_cpu_present_to_apicid(int cpu) > { > if (cpu_present(cpu)) > return xen_get_apic_id(xen_apic_read(APIC_ID)); > else > return BAD_APICID; > } > > So independent of which present CPU we query we get just some random > information, in the above case we get BAD_APICID from > xen_apic_read() not > from the else path as this CPU _IS_ present. > > What's so wrong with storing the fricking firmware supplied APICid as > everybody else does and report it back when queried? By firmware you mean ACPI? It is most likely not available to PV guests. How about returning cpu_data(cpu).initial_apicid? And what was the original problem? >>> The original issue I found was that VMware was returning a different set >>> of APIC id's in the ACPI tables than what it advertised on the CPU's. >>> >>> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html >> For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map >> for PV VCPUs") to at least temporarily work around some topology map >> problems that PV guests have with RAPL (which I think is what Vefa's >> problem was). > Hello Boris, > > (Sorry for the delay!) > > It appears that the problem is a bit different compared to the one > corrected by a6a198bc60e6, because my kernel tree -- based on 4.8.6 -- > already includes the -stable backport of that commit, i.e. > 88540ad0820ddfb05626e0136c0e5a79cea85fd1 > > The patch I included in my previous e-mail (dated 2016-11-10) corrects > root cause of the issue I am having with 4.8.6. Sebastian's original > patch adding error checking to the RAPL module prevents the RAPL module > from causing a kernel oops without my patch. I don't see any messages from you on that date. Can you provide a link to it (and to Sebastian's patch)? (BTW, generally it's a good idea to copy xen-devel list on any Xen-related issues). > > The issue I am experiencing is caused by the boot-up code in the > 'init_apic_mappings' function switching the APIC ops structure from > Xen's structure to a no-op structure by calling the 'apic_disable' > function. Please let me know if I can clarify or elaborate. apic_disable() is only invoked if there is no APIC present (i.e. detect_init_APIC() returns a non-zero value) and I don't think this can happen. Is your CPUID[1].edx[9] not set? -boris > > For the record, using 4.8.7 without my correction patch patch does not > rectify the issue at hand. 4.8.7 changes the call site of the > 'init_apic_mapping' function, so I had thought that it could be helpful. > > Thank you, > > Vefa
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/12/2016 05:05 PM, M. Vefa Bicakci wrote: > On 11/10/2016 06:31 PM, Boris Ostrovsky wrote: >> On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: >>> >>> On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: On 11/10/2016 06:13 AM, Thomas Gleixner wrote: > On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > >> I have found that your patch unfortunately does not improve the >> situation >> for me. Here is an excerpt obtained from the dmesg of a kernel >> compiled >> with this patch *as well as* Sebastian's patch: >> [0.002561] CPU: Physical Processor ID: 0 >> [0.002566] CPU: Processor Core ID: 0 >> [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: >> CPUID: 2 > So apic->cpu_present_to_apicid() gives us a completely bogus APIC id > which > translates to a bogus package id. And looking at the XEN code: > >xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, > > and xen_cpu_present_to_apicid does: > > static int xen_cpu_present_to_apicid(int cpu) > { > if (cpu_present(cpu)) > return xen_get_apic_id(xen_apic_read(APIC_ID)); > else > return BAD_APICID; > } > > So independent of which present CPU we query we get just some random > information, in the above case we get BAD_APICID from > xen_apic_read() not > from the else path as this CPU _IS_ present. > > What's so wrong with storing the fricking firmware supplied APICid as > everybody else does and report it back when queried? By firmware you mean ACPI? It is most likely not available to PV guests. How about returning cpu_data(cpu).initial_apicid? And what was the original problem? >>> The original issue I found was that VMware was returning a different set >>> of APIC id's in the ACPI tables than what it advertised on the CPU's. >>> >>> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html >> For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map >> for PV VCPUs") to at least temporarily work around some topology map >> problems that PV guests have with RAPL (which I think is what Vefa's >> problem was). > Hello Boris, > > (Sorry for the delay!) > > It appears that the problem is a bit different compared to the one > corrected by a6a198bc60e6, because my kernel tree -- based on 4.8.6 -- > already includes the -stable backport of that commit, i.e. > 88540ad0820ddfb05626e0136c0e5a79cea85fd1 > > The patch I included in my previous e-mail (dated 2016-11-10) corrects > root cause of the issue I am having with 4.8.6. Sebastian's original > patch adding error checking to the RAPL module prevents the RAPL module > from causing a kernel oops without my patch. I don't see any messages from you on that date. Can you provide a link to it (and to Sebastian's patch)? (BTW, generally it's a good idea to copy xen-devel list on any Xen-related issues). > > The issue I am experiencing is caused by the boot-up code in the > 'init_apic_mappings' function switching the APIC ops structure from > Xen's structure to a no-op structure by calling the 'apic_disable' > function. Please let me know if I can clarify or elaborate. apic_disable() is only invoked if there is no APIC present (i.e. detect_init_APIC() returns a non-zero value) and I don't think this can happen. Is your CPUID[1].edx[9] not set? -boris > > For the record, using 4.8.7 without my correction patch patch does not > rectify the issue at hand. 4.8.7 changes the call site of the > 'init_apic_mapping' function, so I had thought that it could be helpful. > > Thank you, > > Vefa
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 01:50 PM, Charles (Chas) Williams wrote: > > > On 11/09/2016 10:57 PM, M. Vefa Bicakci wrote: >> [0.002000] mvb: CPU: Physical Processor ID: 0 >> [0.002000] mvb: CPU: Processor Core ID: 0 >> [0.002000] mvb: identify_cpu:1112: c: 880013b0a040, >> c->logical_proc_id: 65535 >> [0.002000] mvb: __default_cpu_present_to_apicid:612: Returning >> 65535! mps_cpu: 1, nr_cpu_ids: 2, cpu_present(mps_cpu): 1 >> [0.002000] smpboot: mvb: topology_update_package_map:270: cpu: 1, >> pkg: 4095 >> [0.002000] smpboot: APIC() Converting physical 4095 to logical >> package 0 >> [0.002000] smpboot: mvb: topology_update_package_map:305: cpu: 1, >> cpu_data(cpu).logical_proc_id: 0 > > This seems strange. 0x is BAD_APICID. Why didn't this fail here: > > for_each_present_cpu(cpu) { > unsigned int apicid = apic->cpu_present_to_apicid(cpu); > > if (apicid == BAD_APICID || > !apic->apic_id_valid(apicid)) << > continue; > if (!topology_update_package_map(apicid, cpu)) > continue; > > topology_update_package_map() should never have been called? (Sorry for the delay!) It appears that this is a different call path than the one in smp_init_package_map function. I *think* the following call path is the one shown in the dmesg, but I am not sure: cpu_bringup_and_idle cpu_bringup (arch/x86/xen/smp.c) smp_store_cpu_info (this call path branch is included for context) identify_secondary_cpu identify_cpu detect_ht topology_update_package_map Sorry about the potentially misleading dmesg excerpt I posted. Vefa
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 06:31 PM, Boris Ostrovsky wrote: > On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: >> >> >> On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: >>> On 11/10/2016 06:13 AM, Thomas Gleixner wrote: On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > I have found that your patch unfortunately does not improve the > situation > for me. Here is an excerpt obtained from the dmesg of a kernel > compiled > with this patch *as well as* Sebastian's patch: > [0.002561] CPU: Physical Processor ID: 0 > [0.002566] CPU: Processor Core ID: 0 > [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: > CPUID: 2 So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which translates to a bogus package id. And looking at the XEN code: xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, and xen_cpu_present_to_apicid does: static int xen_cpu_present_to_apicid(int cpu) { if (cpu_present(cpu)) return xen_get_apic_id(xen_apic_read(APIC_ID)); else return BAD_APICID; } So independent of which present CPU we query we get just some random information, in the above case we get BAD_APICID from xen_apic_read() not from the else path as this CPU _IS_ present. What's so wrong with storing the fricking firmware supplied APICid as everybody else does and report it back when queried? >>> >>> By firmware you mean ACPI? It is most likely not available to PV guests. >>> How about returning cpu_data(cpu).initial_apicid? >>> >>> And what was the original problem? >> >> The original issue I found was that VMware was returning a different set >> of APIC id's in the ACPI tables than what it advertised on the CPU's. >> >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html > > For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map > for PV VCPUs") to at least temporarily work around some topology map > problems that PV guests have with RAPL (which I think is what Vefa's > problem was). Hello Boris, (Sorry for the delay!) It appears that the problem is a bit different compared to the one corrected by a6a198bc60e6, because my kernel tree -- based on 4.8.6 -- already includes the -stable backport of that commit, i.e. 88540ad0820ddfb05626e0136c0e5a79cea85fd1 The patch I included in my previous e-mail (dated 2016-11-10) corrects root cause of the issue I am having with 4.8.6. Sebastian's original patch adding error checking to the RAPL module prevents the RAPL module from causing a kernel oops without my patch. The issue I am experiencing is caused by the boot-up code in the 'init_apic_mappings' function switching the APIC ops structure from Xen's structure to a no-op structure by calling the 'apic_disable' function. Please let me know if I can clarify or elaborate. For the record, using 4.8.7 without my correction patch patch does not rectify the issue at hand. 4.8.7 changes the call site of the 'init_apic_mapping' function, so I had thought that it could be helpful. Thank you, Vefa
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 01:50 PM, Charles (Chas) Williams wrote: > > > On 11/09/2016 10:57 PM, M. Vefa Bicakci wrote: >> [0.002000] mvb: CPU: Physical Processor ID: 0 >> [0.002000] mvb: CPU: Processor Core ID: 0 >> [0.002000] mvb: identify_cpu:1112: c: 880013b0a040, >> c->logical_proc_id: 65535 >> [0.002000] mvb: __default_cpu_present_to_apicid:612: Returning >> 65535! mps_cpu: 1, nr_cpu_ids: 2, cpu_present(mps_cpu): 1 >> [0.002000] smpboot: mvb: topology_update_package_map:270: cpu: 1, >> pkg: 4095 >> [0.002000] smpboot: APIC() Converting physical 4095 to logical >> package 0 >> [0.002000] smpboot: mvb: topology_update_package_map:305: cpu: 1, >> cpu_data(cpu).logical_proc_id: 0 > > This seems strange. 0x is BAD_APICID. Why didn't this fail here: > > for_each_present_cpu(cpu) { > unsigned int apicid = apic->cpu_present_to_apicid(cpu); > > if (apicid == BAD_APICID || > !apic->apic_id_valid(apicid)) << > continue; > if (!topology_update_package_map(apicid, cpu)) > continue; > > topology_update_package_map() should never have been called? (Sorry for the delay!) It appears that this is a different call path than the one in smp_init_package_map function. I *think* the following call path is the one shown in the dmesg, but I am not sure: cpu_bringup_and_idle cpu_bringup (arch/x86/xen/smp.c) smp_store_cpu_info (this call path branch is included for context) identify_secondary_cpu identify_cpu detect_ht topology_update_package_map Sorry about the potentially misleading dmesg excerpt I posted. Vefa
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 06:31 PM, Boris Ostrovsky wrote: > On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: >> >> >> On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: >>> On 11/10/2016 06:13 AM, Thomas Gleixner wrote: On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > I have found that your patch unfortunately does not improve the > situation > for me. Here is an excerpt obtained from the dmesg of a kernel > compiled > with this patch *as well as* Sebastian's patch: > [0.002561] CPU: Physical Processor ID: 0 > [0.002566] CPU: Processor Core ID: 0 > [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: > CPUID: 2 So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which translates to a bogus package id. And looking at the XEN code: xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, and xen_cpu_present_to_apicid does: static int xen_cpu_present_to_apicid(int cpu) { if (cpu_present(cpu)) return xen_get_apic_id(xen_apic_read(APIC_ID)); else return BAD_APICID; } So independent of which present CPU we query we get just some random information, in the above case we get BAD_APICID from xen_apic_read() not from the else path as this CPU _IS_ present. What's so wrong with storing the fricking firmware supplied APICid as everybody else does and report it back when queried? >>> >>> By firmware you mean ACPI? It is most likely not available to PV guests. >>> How about returning cpu_data(cpu).initial_apicid? >>> >>> And what was the original problem? >> >> The original issue I found was that VMware was returning a different set >> of APIC id's in the ACPI tables than what it advertised on the CPU's. >> >> http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html > > For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map > for PV VCPUs") to at least temporarily work around some topology map > problems that PV guests have with RAPL (which I think is what Vefa's > problem was). Hello Boris, (Sorry for the delay!) It appears that the problem is a bit different compared to the one corrected by a6a198bc60e6, because my kernel tree -- based on 4.8.6 -- already includes the -stable backport of that commit, i.e. 88540ad0820ddfb05626e0136c0e5a79cea85fd1 The patch I included in my previous e-mail (dated 2016-11-10) corrects root cause of the issue I am having with 4.8.6. Sebastian's original patch adding error checking to the RAPL module prevents the RAPL module from causing a kernel oops without my patch. The issue I am experiencing is caused by the boot-up code in the 'init_apic_mappings' function switching the APIC ops structure from Xen's structure to a no-op structure by calling the 'apic_disable' function. Please let me know if I can clarify or elaborate. For the record, using 4.8.7 without my correction patch patch does not rectify the issue at hand. 4.8.7 changes the call site of the 'init_apic_mapping' function, so I had thought that it could be helpful. Thank you, Vefa
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 12:13 PM, Thomas Gleixner wrote: > On Thu, 10 Nov 2016, Boris Ostrovsky wrote: >> On 11/10/2016 10:12 AM, Thomas Gleixner wrote: >>> On Thu, 10 Nov 2016, Boris Ostrovsky wrote: By firmware you mean ACPI? It is most likely not available to PV guests. >>> You either have to provide ACPI or MP tables. And either of those has to >>> provide the intial APIC ids for the CPUs. They are supposed to match the >>> IDs which are in the CPUID leafs. >>> How about returning cpu_data(cpu).initial_apicid? >>> For a not yet brought up CPU. That's what the initial ACPI/MP table >>> enumeration is for. >> Unfortunately PV guests have neither. So we may need to emulate >> something in xen_cpu_present_to_apicid(). > SFI does the same thing and according to the dmesg which was posted, this > is using SFI. We also have devicetree based boot concept which provides the > APICids in the CPU enumeration at boot time in a way which the whole x86 > machinery is expecting. > > So what kind of APICid is XEN handing in via SFI? None, or just an invalid > one? None, SFI is not used at all. Or device tree for that matter. The (PV) guest currently assumes APIC ID (i.e. APIC[0x20]) to always be zero (we just fixed a bug where CPUID may return an inconsistent/non-zero initial APIC ID value, but that's a different issue I think). There is no topology, really, everything is flat. So this obviously is rather, ahem, fragile and we've been fixing bugs there, especially with introduction of package map (a6a198bc60e6 that I mentioned in the other message is one such example). There is work scheduled on the hypervisor side to properly report initial APIC IDs in CPUID and that may help with correctly dealing with this on Linux side. (Longer term we would like to move to PVH where we will have "true" APIC emulation, with ACPI, just like in HVM guests. You can do your part by helping: http://marc.info/?l=linux-kernel=147791731414152=3 ;-)) -boris
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 12:13 PM, Thomas Gleixner wrote: > On Thu, 10 Nov 2016, Boris Ostrovsky wrote: >> On 11/10/2016 10:12 AM, Thomas Gleixner wrote: >>> On Thu, 10 Nov 2016, Boris Ostrovsky wrote: By firmware you mean ACPI? It is most likely not available to PV guests. >>> You either have to provide ACPI or MP tables. And either of those has to >>> provide the intial APIC ids for the CPUs. They are supposed to match the >>> IDs which are in the CPUID leafs. >>> How about returning cpu_data(cpu).initial_apicid? >>> For a not yet brought up CPU. That's what the initial ACPI/MP table >>> enumeration is for. >> Unfortunately PV guests have neither. So we may need to emulate >> something in xen_cpu_present_to_apicid(). > SFI does the same thing and according to the dmesg which was posted, this > is using SFI. We also have devicetree based boot concept which provides the > APICids in the CPU enumeration at boot time in a way which the whole x86 > machinery is expecting. > > So what kind of APICid is XEN handing in via SFI? None, or just an invalid > one? None, SFI is not used at all. Or device tree for that matter. The (PV) guest currently assumes APIC ID (i.e. APIC[0x20]) to always be zero (we just fixed a bug where CPUID may return an inconsistent/non-zero initial APIC ID value, but that's a different issue I think). There is no topology, really, everything is flat. So this obviously is rather, ahem, fragile and we've been fixing bugs there, especially with introduction of package map (a6a198bc60e6 that I mentioned in the other message is one such example). There is work scheduled on the hypervisor side to properly report initial APIC IDs in CPUID and that may help with correctly dealing with this on Linux side. (Longer term we would like to move to PVH where we will have "true" APIC emulation, with ACPI, just like in HVM guests. You can do your part by helping: http://marc.info/?l=linux-kernel=147791731414152=3 ;-)) -boris
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, Boris Ostrovsky wrote: > On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: > For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map > for PV VCPUs") to at least temporarily work around some topology map > problems that PV guests have with RAPL (which I think is what Vefa's > problem was). RAPL is just the messenger and it could be any other facility relying on the package map information. The underlying root cause is something else. See the other mail. Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, Boris Ostrovsky wrote: > On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: > For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map > for PV VCPUs") to at least temporarily work around some topology map > problems that PV guests have with RAPL (which I think is what Vefa's > problem was). RAPL is just the messenger and it could be any other facility relying on the package map information. The underlying root cause is something else. See the other mail. Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, Boris Ostrovsky wrote: > On 11/10/2016 10:12 AM, Thomas Gleixner wrote: > > On Thu, 10 Nov 2016, Boris Ostrovsky wrote: > >> By firmware you mean ACPI? It is most likely not available to PV guests. > > You either have to provide ACPI or MP tables. And either of those has to > > provide the intial APIC ids for the CPUs. They are supposed to match the > > IDs which are in the CPUID leafs. > > > >> How about returning cpu_data(cpu).initial_apicid? > > For a not yet brought up CPU. That's what the initial ACPI/MP table > > enumeration is for. > > Unfortunately PV guests have neither. So we may need to emulate > something in xen_cpu_present_to_apicid(). SFI does the same thing and according to the dmesg which was posted, this is using SFI. We also have devicetree based boot concept which provides the APICids in the CPU enumeration at boot time in a way which the whole x86 machinery is expecting. So what kind of APICid is XEN handing in via SFI? None, or just an invalid one? Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, Boris Ostrovsky wrote: > On 11/10/2016 10:12 AM, Thomas Gleixner wrote: > > On Thu, 10 Nov 2016, Boris Ostrovsky wrote: > >> By firmware you mean ACPI? It is most likely not available to PV guests. > > You either have to provide ACPI or MP tables. And either of those has to > > provide the intial APIC ids for the CPUs. They are supposed to match the > > IDs which are in the CPUID leafs. > > > >> How about returning cpu_data(cpu).initial_apicid? > > For a not yet brought up CPU. That's what the initial ACPI/MP table > > enumeration is for. > > Unfortunately PV guests have neither. So we may need to emulate > something in xen_cpu_present_to_apicid(). SFI does the same thing and according to the dmesg which was posted, this is using SFI. We also have devicetree based boot concept which provides the APICids in the CPU enumeration at boot time in a way which the whole x86 machinery is expecting. So what kind of APICid is XEN handing in via SFI? None, or just an invalid one? Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 2016-11-10 10:31:20 [-0500], Boris Ostrovsky wrote: > For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map > for PV VCPUs") to at least temporarily work around some topology map > problems that PV guests have with RAPL (which I think is what Vefa's > problem was). I wouldn't blame it on RAPL. It is RAPL that exposes the wrong APICID <-> CPU <-> Package mapping. > -boris Sebastian
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 2016-11-10 10:31:20 [-0500], Boris Ostrovsky wrote: > For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map > for PV VCPUs") to at least temporarily work around some topology map > problems that PV guests have with RAPL (which I think is what Vefa's > problem was). I wouldn't blame it on RAPL. It is RAPL that exposes the wrong APICID <-> CPU <-> Package mapping. > -boris Sebastian
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 10:12 AM, Thomas Gleixner wrote: > On Thu, 10 Nov 2016, Boris Ostrovsky wrote: >> On 11/10/2016 06:13 AM, Thomas Gleixner wrote: >>> On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: >>> I have found that your patch unfortunately does not improve the situation for me. Here is an excerpt obtained from the dmesg of a kernel compiled with this patch *as well as* Sebastian's patch: [0.002561] CPU: Physical Processor ID: 0 [0.002566] CPU: Processor Core ID: 0 [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 >>> So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which >>> translates to a bogus package id. And looking at the XEN code: >>> >>>xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, >>> >>> and xen_cpu_present_to_apicid does: >>> >>> static int xen_cpu_present_to_apicid(int cpu) >>> { >>> if (cpu_present(cpu)) >>> return xen_get_apic_id(xen_apic_read(APIC_ID)); >>> else >>> return BAD_APICID; >>> } >>> >>> So independent of which present CPU we query we get just some random >>> information, in the above case we get BAD_APICID from xen_apic_read() not >>> from the else path as this CPU _IS_ present. >>> >>> What's so wrong with storing the fricking firmware supplied APICid as >>> everybody else does and report it back when queried? >> By firmware you mean ACPI? It is most likely not available to PV guests. > You either have to provide ACPI or MP tables. And either of those has to > provide the intial APIC ids for the CPUs. They are supposed to match the > IDs which are in the CPUID leafs. > >> How about returning cpu_data(cpu).initial_apicid? > For a not yet brought up CPU. That's what the initial ACPI/MP table > enumeration is for. Unfortunately PV guests have neither. So we may need to emulate something in xen_cpu_present_to_apicid(). -boris (+xen-devel)
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 10:12 AM, Thomas Gleixner wrote: > On Thu, 10 Nov 2016, Boris Ostrovsky wrote: >> On 11/10/2016 06:13 AM, Thomas Gleixner wrote: >>> On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: >>> I have found that your patch unfortunately does not improve the situation for me. Here is an excerpt obtained from the dmesg of a kernel compiled with this patch *as well as* Sebastian's patch: [0.002561] CPU: Physical Processor ID: 0 [0.002566] CPU: Processor Core ID: 0 [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 >>> So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which >>> translates to a bogus package id. And looking at the XEN code: >>> >>>xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, >>> >>> and xen_cpu_present_to_apicid does: >>> >>> static int xen_cpu_present_to_apicid(int cpu) >>> { >>> if (cpu_present(cpu)) >>> return xen_get_apic_id(xen_apic_read(APIC_ID)); >>> else >>> return BAD_APICID; >>> } >>> >>> So independent of which present CPU we query we get just some random >>> information, in the above case we get BAD_APICID from xen_apic_read() not >>> from the else path as this CPU _IS_ present. >>> >>> What's so wrong with storing the fricking firmware supplied APICid as >>> everybody else does and report it back when queried? >> By firmware you mean ACPI? It is most likely not available to PV guests. > You either have to provide ACPI or MP tables. And either of those has to > provide the intial APIC ids for the CPUs. They are supposed to match the > IDs which are in the CPUID leafs. > >> How about returning cpu_data(cpu).initial_apicid? > For a not yet brought up CPU. That's what the initial ACPI/MP table > enumeration is for. Unfortunately PV guests have neither. So we may need to emulate something in xen_cpu_present_to_apicid(). -boris (+xen-devel)
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: > > > On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: >> On 11/10/2016 06:13 AM, Thomas Gleixner wrote: >>> On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: >>> I have found that your patch unfortunately does not improve the situation for me. Here is an excerpt obtained from the dmesg of a kernel compiled with this patch *as well as* Sebastian's patch: [0.002561] CPU: Physical Processor ID: 0 [0.002566] CPU: Processor Core ID: 0 [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 >>> So apic->cpu_present_to_apicid() gives us a completely bogus APIC id >>> which >>> translates to a bogus package id. And looking at the XEN code: >>> >>>xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, >>> >>> and xen_cpu_present_to_apicid does: >>> >>> static int xen_cpu_present_to_apicid(int cpu) >>> { >>> if (cpu_present(cpu)) >>> return xen_get_apic_id(xen_apic_read(APIC_ID)); >>> else >>> return BAD_APICID; >>> } >>> >>> So independent of which present CPU we query we get just some random >>> information, in the above case we get BAD_APICID from >>> xen_apic_read() not >>> from the else path as this CPU _IS_ present. >>> >>> What's so wrong with storing the fricking firmware supplied APICid as >>> everybody else does and report it back when queried? >> >> By firmware you mean ACPI? It is most likely not available to PV guests. >> How about returning cpu_data(cpu).initial_apicid? >> >> And what was the original problem? > > The original issue I found was that VMware was returning a different set > of APIC id's in the ACPI tables than what it advertised on the CPU's. > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map for PV VCPUs") to at least temporarily work around some topology map problems that PV guests have with RAPL (which I think is what Vefa's problem was). -boris
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 10:05 AM, Charles (Chas) Williams wrote: > > > On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: >> On 11/10/2016 06:13 AM, Thomas Gleixner wrote: >>> On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: >>> I have found that your patch unfortunately does not improve the situation for me. Here is an excerpt obtained from the dmesg of a kernel compiled with this patch *as well as* Sebastian's patch: [0.002561] CPU: Physical Processor ID: 0 [0.002566] CPU: Processor Core ID: 0 [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 >>> So apic->cpu_present_to_apicid() gives us a completely bogus APIC id >>> which >>> translates to a bogus package id. And looking at the XEN code: >>> >>>xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, >>> >>> and xen_cpu_present_to_apicid does: >>> >>> static int xen_cpu_present_to_apicid(int cpu) >>> { >>> if (cpu_present(cpu)) >>> return xen_get_apic_id(xen_apic_read(APIC_ID)); >>> else >>> return BAD_APICID; >>> } >>> >>> So independent of which present CPU we query we get just some random >>> information, in the above case we get BAD_APICID from >>> xen_apic_read() not >>> from the else path as this CPU _IS_ present. >>> >>> What's so wrong with storing the fricking firmware supplied APICid as >>> everybody else does and report it back when queried? >> >> By firmware you mean ACPI? It is most likely not available to PV guests. >> How about returning cpu_data(cpu).initial_apicid? >> >> And what was the original problem? > > The original issue I found was that VMware was returning a different set > of APIC id's in the ACPI tables than what it advertised on the CPU's. > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html For Xen, we recently added a6a198bc60e6 ("xen/x86: Update topology map for PV VCPUs") to at least temporarily work around some topology map problems that PV guests have with RAPL (which I think is what Vefa's problem was). -boris
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, Boris Ostrovsky wrote: > On 11/10/2016 06:13 AM, Thomas Gleixner wrote: > > On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > > > >> I have found that your patch unfortunately does not improve the situation > >> for me. Here is an excerpt obtained from the dmesg of a kernel compiled > >> with this patch *as well as* Sebastian's patch: > >> [0.002561] CPU: Physical Processor ID: 0 > >> [0.002566] CPU: Processor Core ID: 0 > >> [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: > >> CPUID: 2 > > So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which > > translates to a bogus package id. And looking at the XEN code: > > > >xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, > > > > and xen_cpu_present_to_apicid does: > > > > static int xen_cpu_present_to_apicid(int cpu) > > { > > if (cpu_present(cpu)) > > return xen_get_apic_id(xen_apic_read(APIC_ID)); > > else > > return BAD_APICID; > > } > > > > So independent of which present CPU we query we get just some random > > information, in the above case we get BAD_APICID from xen_apic_read() not > > from the else path as this CPU _IS_ present. > > > > What's so wrong with storing the fricking firmware supplied APICid as > > everybody else does and report it back when queried? > > By firmware you mean ACPI? It is most likely not available to PV guests. You either have to provide ACPI or MP tables. And either of those has to provide the intial APIC ids for the CPUs. They are supposed to match the IDs which are in the CPUID leafs. > How about returning cpu_data(cpu).initial_apicid? For a not yet brought up CPU. That's what the initial ACPI/MP table enumeration is for. Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, Boris Ostrovsky wrote: > On 11/10/2016 06:13 AM, Thomas Gleixner wrote: > > On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > > > >> I have found that your patch unfortunately does not improve the situation > >> for me. Here is an excerpt obtained from the dmesg of a kernel compiled > >> with this patch *as well as* Sebastian's patch: > >> [0.002561] CPU: Physical Processor ID: 0 > >> [0.002566] CPU: Processor Core ID: 0 > >> [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: > >> CPUID: 2 > > So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which > > translates to a bogus package id. And looking at the XEN code: > > > >xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, > > > > and xen_cpu_present_to_apicid does: > > > > static int xen_cpu_present_to_apicid(int cpu) > > { > > if (cpu_present(cpu)) > > return xen_get_apic_id(xen_apic_read(APIC_ID)); > > else > > return BAD_APICID; > > } > > > > So independent of which present CPU we query we get just some random > > information, in the above case we get BAD_APICID from xen_apic_read() not > > from the else path as this CPU _IS_ present. > > > > What's so wrong with storing the fricking firmware supplied APICid as > > everybody else does and report it back when queried? > > By firmware you mean ACPI? It is most likely not available to PV guests. You either have to provide ACPI or MP tables. And either of those has to provide the intial APIC ids for the CPUs. They are supposed to match the IDs which are in the CPUID leafs. > How about returning cpu_data(cpu).initial_apicid? For a not yet brought up CPU. That's what the initial ACPI/MP table enumeration is for. Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: On 11/10/2016 06:13 AM, Thomas Gleixner wrote: On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: I have found that your patch unfortunately does not improve the situation for me. Here is an excerpt obtained from the dmesg of a kernel compiled with this patch *as well as* Sebastian's patch: [0.002561] CPU: Physical Processor ID: 0 [0.002566] CPU: Processor Core ID: 0 [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which translates to a bogus package id. And looking at the XEN code: xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, and xen_cpu_present_to_apicid does: static int xen_cpu_present_to_apicid(int cpu) { if (cpu_present(cpu)) return xen_get_apic_id(xen_apic_read(APIC_ID)); else return BAD_APICID; } So independent of which present CPU we query we get just some random information, in the above case we get BAD_APICID from xen_apic_read() not from the else path as this CPU _IS_ present. What's so wrong with storing the fricking firmware supplied APICid as everybody else does and report it back when queried? By firmware you mean ACPI? It is most likely not available to PV guests. How about returning cpu_data(cpu).initial_apicid? And what was the original problem? The original issue I found was that VMware was returning a different set of APIC id's in the ACPI tables than what it advertised on the CPU's. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html This damned attitude of we just hack the code into submission and let everybody else deal with the outcoming is utterly annoying. Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 09:02 AM, Boris Ostrovsky wrote: On 11/10/2016 06:13 AM, Thomas Gleixner wrote: On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: I have found that your patch unfortunately does not improve the situation for me. Here is an excerpt obtained from the dmesg of a kernel compiled with this patch *as well as* Sebastian's patch: [0.002561] CPU: Physical Processor ID: 0 [0.002566] CPU: Processor Core ID: 0 [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which translates to a bogus package id. And looking at the XEN code: xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, and xen_cpu_present_to_apicid does: static int xen_cpu_present_to_apicid(int cpu) { if (cpu_present(cpu)) return xen_get_apic_id(xen_apic_read(APIC_ID)); else return BAD_APICID; } So independent of which present CPU we query we get just some random information, in the above case we get BAD_APICID from xen_apic_read() not from the else path as this CPU _IS_ present. What's so wrong with storing the fricking firmware supplied APICid as everybody else does and report it back when queried? By firmware you mean ACPI? It is most likely not available to PV guests. How about returning cpu_data(cpu).initial_apicid? And what was the original problem? The original issue I found was that VMware was returning a different set of APIC id's in the ACPI tables than what it advertised on the CPU's. http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1266716.html This damned attitude of we just hack the code into submission and let everybody else deal with the outcoming is utterly annoying. Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 06:13 AM, Thomas Gleixner wrote: > On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > >> I have found that your patch unfortunately does not improve the situation >> for me. Here is an excerpt obtained from the dmesg of a kernel compiled >> with this patch *as well as* Sebastian's patch: >> [0.002561] CPU: Physical Processor ID: 0 >> [0.002566] CPU: Processor Core ID: 0 >> [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: >> 2 > So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which > translates to a bogus package id. And looking at the XEN code: > >xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, > > and xen_cpu_present_to_apicid does: > > static int xen_cpu_present_to_apicid(int cpu) > { > if (cpu_present(cpu)) > return xen_get_apic_id(xen_apic_read(APIC_ID)); > else > return BAD_APICID; > } > > So independent of which present CPU we query we get just some random > information, in the above case we get BAD_APICID from xen_apic_read() not > from the else path as this CPU _IS_ present. > > What's so wrong with storing the fricking firmware supplied APICid as > everybody else does and report it back when queried? By firmware you mean ACPI? It is most likely not available to PV guests. How about returning cpu_data(cpu).initial_apicid? And what was the original problem? -boris > > This damned attitude of we just hack the code into submission and let > everybody else deal with the outcoming is utterly annoying. > > Thanks, > > tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/10/2016 06:13 AM, Thomas Gleixner wrote: > On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > >> I have found that your patch unfortunately does not improve the situation >> for me. Here is an excerpt obtained from the dmesg of a kernel compiled >> with this patch *as well as* Sebastian's patch: >> [0.002561] CPU: Physical Processor ID: 0 >> [0.002566] CPU: Processor Core ID: 0 >> [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: >> 2 > So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which > translates to a bogus package id. And looking at the XEN code: > >xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, > > and xen_cpu_present_to_apicid does: > > static int xen_cpu_present_to_apicid(int cpu) > { > if (cpu_present(cpu)) > return xen_get_apic_id(xen_apic_read(APIC_ID)); > else > return BAD_APICID; > } > > So independent of which present CPU we query we get just some random > information, in the above case we get BAD_APICID from xen_apic_read() not > from the else path as this CPU _IS_ present. > > What's so wrong with storing the fricking firmware supplied APICid as > everybody else does and report it back when queried? By firmware you mean ACPI? It is most likely not available to PV guests. How about returning cpu_data(cpu).initial_apicid? And what was the original problem? -boris > > This damned attitude of we just hack the code into submission and let > everybody else deal with the outcoming is utterly annoying. > > Thanks, > > tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, Nov 10, 2016 at 12:13:04PM +0100, Thomas Gleixner wrote: > What's so wrong with storing the fricking firmware supplied APICid as > everybody else does and report it back when queried? > > This damned attitude of we just hack the code into submission and let > everybody else deal with the outcoming is utterly annoying. At some point I'd recommend just marking things BROKEN and getting on with life. There's only so much crap you can deal with. What Xen does here is completely insane.
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, Nov 10, 2016 at 12:13:04PM +0100, Thomas Gleixner wrote: > What's so wrong with storing the fricking firmware supplied APICid as > everybody else does and report it back when queried? > > This damned attitude of we just hack the code into submission and let > everybody else deal with the outcoming is utterly annoying. At some point I'd recommend just marking things BROKEN and getting on with life. There's only so much crap you can deal with. What Xen does here is completely insane.
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, Charles (Chas) Williams wrote: > On 11/09/2016 10:57 PM, M. Vefa Bicakci wrote: > > [0.002000] mvb: CPU: Physical Processor ID: 0 > > [0.002000] mvb: CPU: Processor Core ID: 0 > > [0.002000] mvb: identify_cpu:1112: c: 880013b0a040, > > c->logical_proc_id: 65535 > > [0.002000] mvb: __default_cpu_present_to_apicid:612: Returning 65535! > > mps_cpu: 1, nr_cpu_ids: 2, cpu_present(mps_cpu): 1 > > [0.002000] smpboot: mvb: topology_update_package_map:270: cpu: 1, pkg: > > 4095 > > [0.002000] smpboot: APIC() Converting physical 4095 to logical > > package 0 > > [0.002000] smpboot: mvb: topology_update_package_map:305: cpu: 1, > > cpu_data(cpu).logical_proc_id: 0 > > This seems strange. 0x is BAD_APICID. Why didn't this fail here: > > for_each_present_cpu(cpu) { > unsigned int apicid = apic->cpu_present_to_apicid(cpu); > > if (apicid == BAD_APICID || !apic->apic_id_valid(apicid)) > << > continue; > if (!topology_update_package_map(apicid, cpu)) > continue; > > topology_update_package_map() should never have been called? That's right. Looking into it
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, Charles (Chas) Williams wrote: > On 11/09/2016 10:57 PM, M. Vefa Bicakci wrote: > > [0.002000] mvb: CPU: Physical Processor ID: 0 > > [0.002000] mvb: CPU: Processor Core ID: 0 > > [0.002000] mvb: identify_cpu:1112: c: 880013b0a040, > > c->logical_proc_id: 65535 > > [0.002000] mvb: __default_cpu_present_to_apicid:612: Returning 65535! > > mps_cpu: 1, nr_cpu_ids: 2, cpu_present(mps_cpu): 1 > > [0.002000] smpboot: mvb: topology_update_package_map:270: cpu: 1, pkg: > > 4095 > > [0.002000] smpboot: APIC() Converting physical 4095 to logical > > package 0 > > [0.002000] smpboot: mvb: topology_update_package_map:305: cpu: 1, > > cpu_data(cpu).logical_proc_id: 0 > > This seems strange. 0x is BAD_APICID. Why didn't this fail here: > > for_each_present_cpu(cpu) { > unsigned int apicid = apic->cpu_present_to_apicid(cpu); > > if (apicid == BAD_APICID || !apic->apic_id_valid(apicid)) > << > continue; > if (!topology_update_package_map(apicid, cpu)) > continue; > > topology_update_package_map() should never have been called? That's right. Looking into it
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > I have found that your patch unfortunately does not improve the situation > for me. Here is an excerpt obtained from the dmesg of a kernel compiled > with this patch *as well as* Sebastian's patch: > [0.002561] CPU: Physical Processor ID: 0 > [0.002566] CPU: Processor Core ID: 0 > [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which translates to a bogus package id. And looking at the XEN code: xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, and xen_cpu_present_to_apicid does: static int xen_cpu_present_to_apicid(int cpu) { if (cpu_present(cpu)) return xen_get_apic_id(xen_apic_read(APIC_ID)); else return BAD_APICID; } So independent of which present CPU we query we get just some random information, in the above case we get BAD_APICID from xen_apic_read() not from the else path as this CPU _IS_ present. What's so wrong with storing the fricking firmware supplied APICid as everybody else does and report it back when queried? This damned attitude of we just hack the code into submission and let everybody else deal with the outcoming is utterly annoying. Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Thu, 10 Nov 2016, M. Vefa Bicakci wrote: > I have found that your patch unfortunately does not improve the situation > for me. Here is an excerpt obtained from the dmesg of a kernel compiled > with this patch *as well as* Sebastian's patch: > [0.002561] CPU: Physical Processor ID: 0 > [0.002566] CPU: Processor Core ID: 0 > [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 So apic->cpu_present_to_apicid() gives us a completely bogus APIC id which translates to a bogus package id. And looking at the XEN code: xen_pv_apic.cpu_present_to_apicid = xen_cpu_present_to_apicid, and xen_cpu_present_to_apicid does: static int xen_cpu_present_to_apicid(int cpu) { if (cpu_present(cpu)) return xen_get_apic_id(xen_apic_read(APIC_ID)); else return BAD_APICID; } So independent of which present CPU we query we get just some random information, in the above case we get BAD_APICID from xen_apic_read() not from the else path as this CPU _IS_ present. What's so wrong with storing the fricking firmware supplied APICid as everybody else does and report it back when queried? This damned attitude of we just hack the code into submission and let everybody else deal with the outcoming is utterly annoying. Thanks, tglx
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/09/2016 10:57 PM, M. Vefa Bicakci wrote: [0.002000] mvb: CPU: Physical Processor ID: 0 [0.002000] mvb: CPU: Processor Core ID: 0 [0.002000] mvb: identify_cpu:1112: c: 880013b0a040, c->logical_proc_id: 65535 [0.002000] mvb: __default_cpu_present_to_apicid:612: Returning 65535! mps_cpu: 1, nr_cpu_ids: 2, cpu_present(mps_cpu): 1 [0.002000] smpboot: mvb: topology_update_package_map:270: cpu: 1, pkg: 4095 [0.002000] smpboot: APIC() Converting physical 4095 to logical package 0 [0.002000] smpboot: mvb: topology_update_package_map:305: cpu: 1, cpu_data(cpu).logical_proc_id: 0 This seems strange. 0x is BAD_APICID. Why didn't this fail here: for_each_present_cpu(cpu) { unsigned int apicid = apic->cpu_present_to_apicid(cpu); if (apicid == BAD_APICID || !apic->apic_id_valid(apicid)) << continue; if (!topology_update_package_map(apicid, cpu)) continue; topology_update_package_map() should never have been called?
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/09/2016 10:57 PM, M. Vefa Bicakci wrote: [0.002000] mvb: CPU: Physical Processor ID: 0 [0.002000] mvb: CPU: Processor Core ID: 0 [0.002000] mvb: identify_cpu:1112: c: 880013b0a040, c->logical_proc_id: 65535 [0.002000] mvb: __default_cpu_present_to_apicid:612: Returning 65535! mps_cpu: 1, nr_cpu_ids: 2, cpu_present(mps_cpu): 1 [0.002000] smpboot: mvb: topology_update_package_map:270: cpu: 1, pkg: 4095 [0.002000] smpboot: APIC() Converting physical 4095 to logical package 0 [0.002000] smpboot: mvb: topology_update_package_map:305: cpu: 1, cpu_data(cpu).logical_proc_id: 0 This seems strange. 0x is BAD_APICID. Why didn't this fail here: for_each_present_cpu(cpu) { unsigned int apicid = apic->cpu_present_to_apicid(cpu); if (apicid == BAD_APICID || !apic->apic_id_valid(apicid)) << continue; if (!topology_update_package_map(apicid, cpu)) continue; topology_update_package_map() should never have been called?
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/09/2016 06:35 PM, Thomas Gleixner wrote: > Both ACPI and MP specifications require that the APIC id in the respective > tables must be the same as the APIC id in CPUID. > > The kernel retrieves the physical package id from the APIC id during the > ACPI/MP table scan and builds the physical to logical package map. > > There exist Virtualbox and Xen implementations which violate the spec. As a > result the physical to logical package map, which relies on the ACPI/MP > tables does not work on those systems, because the CPUID initialized > physical package id does not match the firmware id. This causes system > crashes and malfunction due to invalid package mappings. > > The only way to cure this is to sanitize the physical package id after the > CPUID enumeration and yell when the APIC ids are different. If the physical > package IDs differ use the package information from the ACPI/MP tables so > the existing logical package map just works. > > Reported-by: "Charles (Chas) Williams", > Reported-by: M. Vefa Bicakci > Signed-off-by: Thomas Gleixner Hello Thomas and Sebastian, Sorry for the delay in reporting what I have found out and for the delay in testing this patch, which has been due to my health. I have found that your patch unfortunately does not improve the situation for me. Here is an excerpt obtained from the dmesg of a kernel compiled with this patch *as well as* Sebastian's patch: === 8< === [0.002561] CPU: Physical Processor ID: 0 [0.002566] CPU: Processor Core ID: 0 [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 [0.002577] [Firmware Bug]: CPU0: Using firmware package id 4095 instead of 0 [0.002586] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8 [0.002591] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0 [0.033319] ftrace: allocating 28753 entries in 113 pages [0.040121] cpu 0 spinlock event irq 1 [0.040145] smpboot: Max logical packages: 1 [0.040155] VPMU disabled by hypervisor. [0.040181] Performance Events: unsupported p6 CPU model 42 no PMU driver, software events only. [0.047050] NMI watchdog: disabled (cpu0): hardware events not enabled [0.047065] NMI watchdog: Shutting down hard lockup detector on all cpus [0.052015] installing Xen timer for CPU 1 [0.052074] SMP alternatives: switching to SMP code [0.002000] Disabled fast string operations [0.002000] [Firmware Bug]: CPU1: APIC id mismatch. Firmware: CPUID: 2 [0.002000] [Firmware Bug]: CPU1: Using firmware package id 4095 instead of 0 [0.002000] smpboot: APIC() Converting physical 4095 to logical package 0 [0.078061] cpu 1 spinlock event irq 13 ... [0.216404] Freeing initrd memory: 4340K (880001fa7000 - 8800023e4000) [0.216487] RAPL PMU: rapl pmu error: max package: 1 but CPU0 belongs to 65535 [0.217572] futex hash table entries: 512 (order: 3, 32768 bytes) [0.218293] Initialise system trusted keyrings ... [0.216404] Freeing initrd memory: 4340K (880001fa7000 - 8800023e4000) [0.216487] RAPL PMU: rapl pmu error: max package: 1 but CPU0 belongs to 65535 [0.217572] futex hash table entries: 512 (order: 3, 32768 bytes) ... [2.974474] intel_rapl: Found RAPL domain package [2.974489] intel_rapl: Found RAPL domain core [2.974498] intel_rapl: Found RAPL domain uncore [2.974518] intel_rapl: RAPL package 4095 domain package locked by BIOS === >8 === As you can see, your patch unfortunately does not correct the issue with the virtual CPU package identifiers in Xen-based virtual machines using para-virtualization. In summary, the root cause of the issue for me appears to be that the boot-up code in the init_apic_mappings function switches the APIC 'ops' structure from Xen's 'ops' structure to the no-op ops structure. Due to this, the smp_init_package_map uses the no-op APIC ops structure's cpu_present_to_to_apicid function, even though it should use the corresponding method from Xen's APIC ops structure (i.e., xen_cpu_present_to_apicid). Here is a dmesg excerpt with a kernel patched with Sebastian's patch (and some debugging code), exhibiting the issue I have just explained: (Note the 'switched to apic NOOP' line.) === 8< === (early) [0.00] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org (early) [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs (early) [0.00] No local APIC present (early) [0.00] APIC: disable apic facility (early) [0.00] APIC: switched to apic NOOP (early) [0.00] e820: [mem 0xfa00-0x] available for PCI devices (early) [0.00] Booting paravirtualized kernel on Xen ... [0.034082] mvb: kernel_init_freeable:1007: About to call smp_prepare_cpus... [0.034123] cpu 0 spinlock event irq 1 [0.034138] smpboot: mvb: smp_init_package_map:372:
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/09/2016 06:35 PM, Thomas Gleixner wrote: > Both ACPI and MP specifications require that the APIC id in the respective > tables must be the same as the APIC id in CPUID. > > The kernel retrieves the physical package id from the APIC id during the > ACPI/MP table scan and builds the physical to logical package map. > > There exist Virtualbox and Xen implementations which violate the spec. As a > result the physical to logical package map, which relies on the ACPI/MP > tables does not work on those systems, because the CPUID initialized > physical package id does not match the firmware id. This causes system > crashes and malfunction due to invalid package mappings. > > The only way to cure this is to sanitize the physical package id after the > CPUID enumeration and yell when the APIC ids are different. If the physical > package IDs differ use the package information from the ACPI/MP tables so > the existing logical package map just works. > > Reported-by: "Charles (Chas) Williams" , > Reported-by: M. Vefa Bicakci > Signed-off-by: Thomas Gleixner Hello Thomas and Sebastian, Sorry for the delay in reporting what I have found out and for the delay in testing this patch, which has been due to my health. I have found that your patch unfortunately does not improve the situation for me. Here is an excerpt obtained from the dmesg of a kernel compiled with this patch *as well as* Sebastian's patch: === 8< === [0.002561] CPU: Physical Processor ID: 0 [0.002566] CPU: Processor Core ID: 0 [0.002572] [Firmware Bug]: CPU0: APIC id mismatch. Firmware: CPUID: 2 [0.002577] [Firmware Bug]: CPU0: Using firmware package id 4095 instead of 0 [0.002586] Last level iTLB entries: 4KB 512, 2MB 8, 4MB 8 [0.002591] Last level dTLB entries: 4KB 512, 2MB 32, 4MB 32, 1GB 0 [0.033319] ftrace: allocating 28753 entries in 113 pages [0.040121] cpu 0 spinlock event irq 1 [0.040145] smpboot: Max logical packages: 1 [0.040155] VPMU disabled by hypervisor. [0.040181] Performance Events: unsupported p6 CPU model 42 no PMU driver, software events only. [0.047050] NMI watchdog: disabled (cpu0): hardware events not enabled [0.047065] NMI watchdog: Shutting down hard lockup detector on all cpus [0.052015] installing Xen timer for CPU 1 [0.052074] SMP alternatives: switching to SMP code [0.002000] Disabled fast string operations [0.002000] [Firmware Bug]: CPU1: APIC id mismatch. Firmware: CPUID: 2 [0.002000] [Firmware Bug]: CPU1: Using firmware package id 4095 instead of 0 [0.002000] smpboot: APIC() Converting physical 4095 to logical package 0 [0.078061] cpu 1 spinlock event irq 13 ... [0.216404] Freeing initrd memory: 4340K (880001fa7000 - 8800023e4000) [0.216487] RAPL PMU: rapl pmu error: max package: 1 but CPU0 belongs to 65535 [0.217572] futex hash table entries: 512 (order: 3, 32768 bytes) [0.218293] Initialise system trusted keyrings ... [0.216404] Freeing initrd memory: 4340K (880001fa7000 - 8800023e4000) [0.216487] RAPL PMU: rapl pmu error: max package: 1 but CPU0 belongs to 65535 [0.217572] futex hash table entries: 512 (order: 3, 32768 bytes) ... [2.974474] intel_rapl: Found RAPL domain package [2.974489] intel_rapl: Found RAPL domain core [2.974498] intel_rapl: Found RAPL domain uncore [2.974518] intel_rapl: RAPL package 4095 domain package locked by BIOS === >8 === As you can see, your patch unfortunately does not correct the issue with the virtual CPU package identifiers in Xen-based virtual machines using para-virtualization. In summary, the root cause of the issue for me appears to be that the boot-up code in the init_apic_mappings function switches the APIC 'ops' structure from Xen's 'ops' structure to the no-op ops structure. Due to this, the smp_init_package_map uses the no-op APIC ops structure's cpu_present_to_to_apicid function, even though it should use the corresponding method from Xen's APIC ops structure (i.e., xen_cpu_present_to_apicid). Here is a dmesg excerpt with a kernel patched with Sebastian's patch (and some debugging code), exhibiting the issue I have just explained: (Note the 'switched to apic NOOP' line.) === 8< === (early) [0.00] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org (early) [0.00] smpboot: Allowing 2 CPUs, 0 hotplug CPUs (early) [0.00] No local APIC present (early) [0.00] APIC: disable apic facility (early) [0.00] APIC: switched to apic NOOP (early) [0.00] e820: [mem 0xfa00-0x] available for PCI devices (early) [0.00] Booting paravirtualized kernel on Xen ... [0.034082] mvb: kernel_init_freeable:1007: About to call smp_prepare_cpus... [0.034123] cpu 0 spinlock event irq 1 [0.034138] smpboot: mvb: smp_init_package_map:372: max_physical_pkg_id, after set: 4096 [0.034146] mvb:
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Wed, 9 Nov 2016, Charles (Chas) Williams wrote: > On 11/09/2016 11:03 AM, Peter Zijlstra wrote: > > On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote: > > > Both ACPI and MP specifications require that the APIC id in the respective > > > tables must be the same as the APIC id in CPUID. > > > > > > The kernel retrieves the physical package id from the APIC id during the > > > ACPI/MP table scan and builds the physical to logical package map. > > > > > > There exist Virtualbox and Xen implementations which violate the spec. As > > > a > > > > ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy > > virt stuff. > > Yes, this was VMware in particular. It would be good to get this comment > right so as not to mislead anyone. Sure, will fix. > > > > /* > > > + * The physical to logical package id mapping is initialized from the > > > + * acpi/mptables information. Make sure that CPUID actually agrees with > > > + * that. > > > + */ > > > +static void sanitize_package_id(struct cpuinfo_x86 *c) > > > +{ > > > +#ifdef CONFIG_SMP > > > + unsigned int pkg, apicid, cpu = smp_processor_id(); > > > + > > > + apicid = apic->cpu_present_to_apicid(cpu); > > > + pkg = apicid >> boot_cpu_data.x86_coreid_bits; > > > + > > > + if (apicid != c->initial_apicid) { > > > + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: > > > %x\n", > > > +cpu, apicid, c->initial_apicid); > > > > Should we not also 'fix' c->initial_apicid ? > > Since we have c->apicid and c->initial_apicid it seems reasonable to keep one > set to the "correct" value. I don't think c->initial_apicid is used past > this. It is, but just for a printk in the MCE code, so it should not matter at all. > I should have some tests on this patch later today. >
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Wed, 9 Nov 2016, Charles (Chas) Williams wrote: > On 11/09/2016 11:03 AM, Peter Zijlstra wrote: > > On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote: > > > Both ACPI and MP specifications require that the APIC id in the respective > > > tables must be the same as the APIC id in CPUID. > > > > > > The kernel retrieves the physical package id from the APIC id during the > > > ACPI/MP table scan and builds the physical to logical package map. > > > > > > There exist Virtualbox and Xen implementations which violate the spec. As > > > a > > > > ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy > > virt stuff. > > Yes, this was VMware in particular. It would be good to get this comment > right so as not to mislead anyone. Sure, will fix. > > > > /* > > > + * The physical to logical package id mapping is initialized from the > > > + * acpi/mptables information. Make sure that CPUID actually agrees with > > > + * that. > > > + */ > > > +static void sanitize_package_id(struct cpuinfo_x86 *c) > > > +{ > > > +#ifdef CONFIG_SMP > > > + unsigned int pkg, apicid, cpu = smp_processor_id(); > > > + > > > + apicid = apic->cpu_present_to_apicid(cpu); > > > + pkg = apicid >> boot_cpu_data.x86_coreid_bits; > > > + > > > + if (apicid != c->initial_apicid) { > > > + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: > > > %x\n", > > > +cpu, apicid, c->initial_apicid); > > > > Should we not also 'fix' c->initial_apicid ? > > Since we have c->apicid and c->initial_apicid it seems reasonable to keep one > set to the "correct" value. I don't think c->initial_apicid is used past > this. It is, but just for a printk in the MCE code, so it should not matter at all. > I should have some tests on this patch later today. >
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/09/2016 10:35 AM, Thomas Gleixner wrote: Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID. The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map. There exist Virtualbox and Xen implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings. The only way to cure this is to sanitize the physical package id after the CPUID enumeration and yell when the APIC ids are different. If the physical package IDs differ use the package information from the ACPI/MP tables so the existing logical package map just works. Reported-by: "Charles (Chas) Williams", Reported-by: M. Vefa Bicakci Signed-off-by: Thomas Gleixner For 4 virtual sockets, 1 core per socket VM: [0.235459] node #0, CPUs: #1 [0.238579] Disabled fast string operations [0.238620] mce: CPU supports 0 MCE banks [0.238864] [Firmware Bug]: CPU1: APIC id mismatch. Firmware: 1 CPUID: 2 [0.238878] [Firmware Bug]: CPU1: Using firmware package id 1 instead of 2 [0.239502] #2 [0.241298] Disabled fast string operations [0.241356] mce: CPU supports 0 MCE banks [0.241429] [Firmware Bug]: CPU2: APIC id mismatch. Firmware: 2 CPUID: 4 [0.241431] [Firmware Bug]: CPU2: Using firmware package id 2 instead of 4 [0.241631] #3 [0.244075] Disabled fast string operations [0.244112] mce: CPU supports 0 MCE banks [0.244284] [Firmware Bug]: CPU3: APIC id mismatch. Firmware: 3 CPUID: 6 [0.244293] [Firmware Bug]: CPU3: Using firmware package id 3 instead of 6 For a 2 virtual sockets, 2 cores per socket, VMware seems to get its APIC table correct as this fixup code isn't triggered. The mapping looks like: [0.028911] topology_update_package_map: apicid 0 pkg 0 cpu 0 [0.029068] smpboot: APIC(0) Converting physical 0 to logical package 0, cpu 0 [0.029072] topology_update_package_map: apicid 1 pkg 0 cpu 1 [0.029220] topology_update_package_map: apicid 2 pkg 1 cpu 2 [0.029376] smpboot: APIC(2) Converting physical 1 to logical package 1, cpu 2 [0.029381] topology_update_package_map: apicid 3 pkg 1 cpu 3 [0.029525] smpboot: Max logical packages: 2 For a VM with 1 virtual socket and 4 cores, the behavior is again correct. [0.016198] topology_update_package_map: apicid 0 pkg 0 cpu 0 [0.016271] smpboot: APIC(0) Converting physical 0 to logical package 0, cpu 0 (88023fc0a040) [0.016273] topology_update_package_map: apicid 1 pkg 0 cpu 1 [0.016336] topology_update_package_map: apicid 2 pkg 0 cpu 2 [0.016397] topology_update_package_map: apicid 3 pkg 0 cpu 3 It looks like VMware might have some assumption about the minimum number of cores on a virtual socket. Regardless, the fix solves my problem! Thanks!
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/09/2016 10:35 AM, Thomas Gleixner wrote: Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID. The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map. There exist Virtualbox and Xen implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings. The only way to cure this is to sanitize the physical package id after the CPUID enumeration and yell when the APIC ids are different. If the physical package IDs differ use the package information from the ACPI/MP tables so the existing logical package map just works. Reported-by: "Charles (Chas) Williams" , Reported-by: M. Vefa Bicakci Signed-off-by: Thomas Gleixner For 4 virtual sockets, 1 core per socket VM: [0.235459] node #0, CPUs: #1 [0.238579] Disabled fast string operations [0.238620] mce: CPU supports 0 MCE banks [0.238864] [Firmware Bug]: CPU1: APIC id mismatch. Firmware: 1 CPUID: 2 [0.238878] [Firmware Bug]: CPU1: Using firmware package id 1 instead of 2 [0.239502] #2 [0.241298] Disabled fast string operations [0.241356] mce: CPU supports 0 MCE banks [0.241429] [Firmware Bug]: CPU2: APIC id mismatch. Firmware: 2 CPUID: 4 [0.241431] [Firmware Bug]: CPU2: Using firmware package id 2 instead of 4 [0.241631] #3 [0.244075] Disabled fast string operations [0.244112] mce: CPU supports 0 MCE banks [0.244284] [Firmware Bug]: CPU3: APIC id mismatch. Firmware: 3 CPUID: 6 [0.244293] [Firmware Bug]: CPU3: Using firmware package id 3 instead of 6 For a 2 virtual sockets, 2 cores per socket, VMware seems to get its APIC table correct as this fixup code isn't triggered. The mapping looks like: [0.028911] topology_update_package_map: apicid 0 pkg 0 cpu 0 [0.029068] smpboot: APIC(0) Converting physical 0 to logical package 0, cpu 0 [0.029072] topology_update_package_map: apicid 1 pkg 0 cpu 1 [0.029220] topology_update_package_map: apicid 2 pkg 1 cpu 2 [0.029376] smpboot: APIC(2) Converting physical 1 to logical package 1, cpu 2 [0.029381] topology_update_package_map: apicid 3 pkg 1 cpu 3 [0.029525] smpboot: Max logical packages: 2 For a VM with 1 virtual socket and 4 cores, the behavior is again correct. [0.016198] topology_update_package_map: apicid 0 pkg 0 cpu 0 [0.016271] smpboot: APIC(0) Converting physical 0 to logical package 0, cpu 0 (88023fc0a040) [0.016273] topology_update_package_map: apicid 1 pkg 0 cpu 1 [0.016336] topology_update_package_map: apicid 2 pkg 0 cpu 2 [0.016397] topology_update_package_map: apicid 3 pkg 0 cpu 3 It looks like VMware might have some assumption about the minimum number of cores on a virtual socket. Regardless, the fix solves my problem! Thanks!
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/09/2016 11:03 AM, Peter Zijlstra wrote: On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote: Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID. The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map. There exist Virtualbox and Xen implementations which violate the spec. As a ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy virt stuff. Yes, this was VMware in particular. It would be good to get this comment right so as not to mislead anyone. /* + * The physical to logical package id mapping is initialized from the + * acpi/mptables information. Make sure that CPUID actually agrees with + * that. + */ +static void sanitize_package_id(struct cpuinfo_x86 *c) +{ +#ifdef CONFIG_SMP + unsigned int pkg, apicid, cpu = smp_processor_id(); + + apicid = apic->cpu_present_to_apicid(cpu); + pkg = apicid >> boot_cpu_data.x86_coreid_bits; + + if (apicid != c->initial_apicid) { + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: %x\n", + cpu, apicid, c->initial_apicid); Should we not also 'fix' c->initial_apicid ? Since we have c->apicid and c->initial_apicid it seems reasonable to keep one set to the "correct" value. I don't think c->initial_apicid is used past this. I should have some tests on this patch later today.
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On 11/09/2016 11:03 AM, Peter Zijlstra wrote: On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote: Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID. The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map. There exist Virtualbox and Xen implementations which violate the spec. As a ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy virt stuff. Yes, this was VMware in particular. It would be good to get this comment right so as not to mislead anyone. /* + * The physical to logical package id mapping is initialized from the + * acpi/mptables information. Make sure that CPUID actually agrees with + * that. + */ +static void sanitize_package_id(struct cpuinfo_x86 *c) +{ +#ifdef CONFIG_SMP + unsigned int pkg, apicid, cpu = smp_processor_id(); + + apicid = apic->cpu_present_to_apicid(cpu); + pkg = apicid >> boot_cpu_data.x86_coreid_bits; + + if (apicid != c->initial_apicid) { + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: %x\n", + cpu, apicid, c->initial_apicid); Should we not also 'fix' c->initial_apicid ? Since we have c->apicid and c->initial_apicid it seems reasonable to keep one set to the "correct" value. I don't think c->initial_apicid is used past this. I should have some tests on this patch later today.
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote: > Both ACPI and MP specifications require that the APIC id in the respective > tables must be the same as the APIC id in CPUID. > > The kernel retrieves the physical package id from the APIC id during the > ACPI/MP table scan and builds the physical to logical package map. > > There exist Virtualbox and Xen implementations which violate the spec. As a ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy virt stuff. > /* > + * The physical to logical package id mapping is initialized from the > + * acpi/mptables information. Make sure that CPUID actually agrees with > + * that. > + */ > +static void sanitize_package_id(struct cpuinfo_x86 *c) > +{ > +#ifdef CONFIG_SMP > + unsigned int pkg, apicid, cpu = smp_processor_id(); > + > + apicid = apic->cpu_present_to_apicid(cpu); > + pkg = apicid >> boot_cpu_data.x86_coreid_bits; > + > + if (apicid != c->initial_apicid) { > + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: > %x\n", > +cpu, apicid, c->initial_apicid); Should we not also 'fix' c->initial_apicid ? > + } > + if (pkg != c->phys_proc_id) { > + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of > %u\n", > +cpu, pkg, c->phys_proc_id); > + c->phys_proc_id = pkg; > + } > + c->logical_proc_id = topology_phys_to_logical_pkg(pkg); > +#else > + c->locical_proc_id = 0; UP FTW ;-) > +#endif > +}
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Wed, Nov 09, 2016 at 04:35:51PM +0100, Thomas Gleixner wrote: > Both ACPI and MP specifications require that the APIC id in the respective > tables must be the same as the APIC id in CPUID. > > The kernel retrieves the physical package id from the APIC id during the > ACPI/MP table scan and builds the physical to logical package map. > > There exist Virtualbox and Xen implementations which violate the spec. As a ISTR it was VMware, not VirtualBox, but whatever.. they're both crazy virt stuff. > /* > + * The physical to logical package id mapping is initialized from the > + * acpi/mptables information. Make sure that CPUID actually agrees with > + * that. > + */ > +static void sanitize_package_id(struct cpuinfo_x86 *c) > +{ > +#ifdef CONFIG_SMP > + unsigned int pkg, apicid, cpu = smp_processor_id(); > + > + apicid = apic->cpu_present_to_apicid(cpu); > + pkg = apicid >> boot_cpu_data.x86_coreid_bits; > + > + if (apicid != c->initial_apicid) { > + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: > %x\n", > +cpu, apicid, c->initial_apicid); Should we not also 'fix' c->initial_apicid ? > + } > + if (pkg != c->phys_proc_id) { > + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of > %u\n", > +cpu, pkg, c->phys_proc_id); > + c->phys_proc_id = pkg; > + } > + c->logical_proc_id = topology_phys_to_logical_pkg(pkg); > +#else > + c->locical_proc_id = 0; UP FTW ;-) > +#endif > +}
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Wed, 9 Nov 2016, Thomas Gleixner wrote: > + if (pkg != c->phys_proc_id) { > + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of > %u\n", > +cpu, pkg, c->phys_proc_id); > + c->phys_proc_id = pkg; > + } > + c->logical_proc_id = topology_phys_to_logical_pkg(pkg); > +#else > + c->locical_proc_id = 0; That want's to be logical_proc_id of course ... Stupid me.
Re: [PATCH] x86/cpuid: Deal with broken firmware once more
On Wed, 9 Nov 2016, Thomas Gleixner wrote: > + if (pkg != c->phys_proc_id) { > + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of > %u\n", > +cpu, pkg, c->phys_proc_id); > + c->phys_proc_id = pkg; > + } > + c->logical_proc_id = topology_phys_to_logical_pkg(pkg); > +#else > + c->locical_proc_id = 0; That want's to be logical_proc_id of course ... Stupid me.
[PATCH] x86/cpuid: Deal with broken firmware once more
Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID. The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map. There exist Virtualbox and Xen implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings. The only way to cure this is to sanitize the physical package id after the CPUID enumeration and yell when the APIC ids are different. If the physical package IDs differ use the package information from the ACPI/MP tables so the existing logical package map just works. Reported-by: "Charles (Chas) Williams", Reported-by: M. Vefa Bicakci Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/common.c | 31 +-- 1 file changed, 29 insertions(+), 2 deletions(-) --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -979,6 +979,34 @@ static void x86_init_cache_qos(struct cp } /* + * The physical to logical package id mapping is initialized from the + * acpi/mptables information. Make sure that CPUID actually agrees with + * that. + */ +static void sanitize_package_id(struct cpuinfo_x86 *c) +{ +#ifdef CONFIG_SMP + unsigned int pkg, apicid, cpu = smp_processor_id(); + + apicid = apic->cpu_present_to_apicid(cpu); + pkg = apicid >> boot_cpu_data.x86_coreid_bits; + + if (apicid != c->initial_apicid) { + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: %x\n", + cpu, apicid, c->initial_apicid); + } + if (pkg != c->phys_proc_id) { + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of %u\n", + cpu, pkg, c->phys_proc_id); + c->phys_proc_id = pkg; + } + c->logical_proc_id = topology_phys_to_logical_pkg(pkg); +#else + c->locical_proc_id = 0; +#endif +} + +/* * This does the hard work of actually picking apart the CPU stuff... */ static void identify_cpu(struct cpuinfo_x86 *c) @@ -1103,8 +1131,7 @@ static void identify_cpu(struct cpuinfo_ #ifdef CONFIG_NUMA numa_add_cpu(smp_processor_id()); #endif - /* The boot/hotplug time assigment got cleared, restore it */ - c->logical_proc_id = topology_phys_to_logical_pkg(c->phys_proc_id); + sanitize_package_id(c); } /*
[PATCH] x86/cpuid: Deal with broken firmware once more
Both ACPI and MP specifications require that the APIC id in the respective tables must be the same as the APIC id in CPUID. The kernel retrieves the physical package id from the APIC id during the ACPI/MP table scan and builds the physical to logical package map. There exist Virtualbox and Xen implementations which violate the spec. As a result the physical to logical package map, which relies on the ACPI/MP tables does not work on those systems, because the CPUID initialized physical package id does not match the firmware id. This causes system crashes and malfunction due to invalid package mappings. The only way to cure this is to sanitize the physical package id after the CPUID enumeration and yell when the APIC ids are different. If the physical package IDs differ use the package information from the ACPI/MP tables so the existing logical package map just works. Reported-by: "Charles (Chas) Williams" , Reported-by: M. Vefa Bicakci Signed-off-by: Thomas Gleixner --- arch/x86/kernel/cpu/common.c | 31 +-- 1 file changed, 29 insertions(+), 2 deletions(-) --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -979,6 +979,34 @@ static void x86_init_cache_qos(struct cp } /* + * The physical to logical package id mapping is initialized from the + * acpi/mptables information. Make sure that CPUID actually agrees with + * that. + */ +static void sanitize_package_id(struct cpuinfo_x86 *c) +{ +#ifdef CONFIG_SMP + unsigned int pkg, apicid, cpu = smp_processor_id(); + + apicid = apic->cpu_present_to_apicid(cpu); + pkg = apicid >> boot_cpu_data.x86_coreid_bits; + + if (apicid != c->initial_apicid) { + pr_err(FW_BUG "CPU%u: APIC id mismatch. Firmware: %x CPUID: %x\n", + cpu, apicid, c->initial_apicid); + } + if (pkg != c->phys_proc_id) { + pr_err(FW_BUG "CPU%u: Using firmware package id %u instead of %u\n", + cpu, pkg, c->phys_proc_id); + c->phys_proc_id = pkg; + } + c->logical_proc_id = topology_phys_to_logical_pkg(pkg); +#else + c->locical_proc_id = 0; +#endif +} + +/* * This does the hard work of actually picking apart the CPU stuff... */ static void identify_cpu(struct cpuinfo_x86 *c) @@ -1103,8 +1131,7 @@ static void identify_cpu(struct cpuinfo_ #ifdef CONFIG_NUMA numa_add_cpu(smp_processor_id()); #endif - /* The boot/hotplug time assigment got cleared, restore it */ - c->logical_proc_id = topology_phys_to_logical_pkg(c->phys_proc_id); + sanitize_package_id(c); } /*