On 11/2/25 10:14 AM, Timur Tabi wrote:
On Sat, 2025-11-01 at 18:36 -0700, John Hubbard wrote:
NVIDIA GPUs are moving away from using NV_PMC_BOOT_0 to contain
architecture and revision details, and will instead use NV_PMC_BOOT_42
in the future. NV_PMC_BOOT_0 will be zeroed out.

You missed this one.  Boot0 will not be completely zeroed out.


Thanks for catching that, I'll write it like the other case.


+impl TryFrom<regs::NV_PMC_BOOT_42> for Spec {
+    type Error = Error;
+
+    fn try_from(boot42: regs::NV_PMC_BOOT_42) -> Result<Self> {
+        Ok(Self {
+            chipset: boot42.chipset()?,
+            revision: boot42.revision(),
+        })
+    }
+}
+
  impl fmt::Display for Revision {
      fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
          write!(f, "{:x}.{:x}", self.major, self.minor)
@@ -169,9 +180,34 @@ pub(crate) struct Spec {
 impl Spec {
      fn new(bar: &Bar0) -> Result<Spec> {
+        // Some brief notes about boot0 and boot42, in chronological order:
+        //
+        // NV04 through Volta:
+        //
+        //    Not supported by Nova. boot0 is necessary and sufficient to 
identify these
GPUs.
+        //    boot42 may not even exist on some of these GPUs.boot42

Did you intend to write more than just "boot42" at the end here?

Nope, that's just an odd typo fragment that I need to delete, thanks
for spotting it.

...
          let boot0 = regs::NV_PMC_BOOT_0::read(bar);
-        Spec::try_from(boot0)
+        if boot0.use_boot42_instead() {
+            Spec::try_from(regs::NV_PMC_BOOT_42::read(bar))
+        } else {
+            Spec::try_from(boot0)
+        }
      }
  }
diff --git a/drivers/gpu/nova-core/regs.rs b/drivers/gpu/nova-core/regs.rs
index 207b865335af..8b5ff3858210 100644
--- a/drivers/gpu/nova-core/regs.rs
+++ b/drivers/gpu/nova-core/regs.rs
@@ -25,6 +25,13 @@
  });
 impl NV_PMC_BOOT_0 {
+    pub(crate) fn use_boot42_instead(self) -> bool {
+        // "Future" GPUs (some time after Rubin) will set `architecture_0`
+        // to 0, and `architecture_1` to 1, and put the architecture details in
+        // boot42 instead.
+        self.architecture_0() == 0 && self.architecture_1() == 1
+    }

So this was the crux of my initial objection, and I just don't think this is truly 
"forward
looking".  The code is using boot42 only if boot0 is "zeroed out".  So 
sometimes Nova will use

To put it another way: the code is only using boot42 if boot0 is
encoded, by the HW team, to go read boot42. As you know, the future
ref manual literally says "go read boot42."

boot0 and sometimes it will use boot42, depending on the GPU.  It's this 
inconsistency that
bothers me.

Instead, I think Nova should use only boot42, so that we have consistent 
information across all
GPUs.  boot0 should only be used to avoid accidentally reading boot42 when it 
doesn't exist.

I am convinced that the most appropriate thing for a device driver
to do is to match what the HW configuration says. We should draw
the dividing line at the changeover point, which is in an upcoming
ref manual.

Once boot0 has the encoding set to "go read boot42", the driver
does that. Until then, HW promises that boot0 is correct.

It may look all nice and neat to use "Nova is a new driver" to
pick the point to change, but again, it's more accurate and
appropriate for a device driver to follow HW's lead, and use
what boot0 says to do.



Previously, Danilo said this:

I think you're indeed talking about the same thing, but thinking differently
about the implementation details.

A standalone is_ancient_gpu() function called from probe() like

        if is_ancient_gpu(bar) {
                return Err(ENODEV);
        }

is what we would probably do in C, but in Rust we should just call

        let spec = Spec::new()?;

from probe() and Spec::new() will return Err(ENODEV) when it run into an ancient
GPU spec internally.

This I agree with.  The first thing that Spec::new() should do is check whether 
we're on an
ancient GPU that does not even have boot42.  If so, return Err(ENODEV).  
Otherwise, from that
point onward, no code will ever look at boot0 again.  boot0 should never be 
used to return the
actual architecture/gpu information.


I don't think we have a conflict on this point, if you read through how
the code works. The only difference is the point I wrote about above.

I'm hoping you'll allow me to proceed with that.

thanks,
--
John Hubbard

Reply via email to