On 1/11/2024 11:25, Kim, Jonathan wrote: >> This looks OK. The compiler must be warning about a potential problem >> here, not a definite one. >> >> Question for Jon, how does the firmware encode the error code in the >> context ID? I see these macros: >> >> #define KFD_DEBUG_CP_BAD_OP_ECODE_MASK 0x3fffc00 >> #define KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT 10 >> #define KFD_DEBUG_CP_BAD_OP_ECODE(ctxid0) (((ctxid0) & \ >> KFD_DEBUG_CP_BAD_OP_ECODE_MASK) \ >> >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT) >> >> It looks like we have 16 bits for the ECODE. That's enough to have a bit >> mask. Do we really need KFD_EC_MASK to convert an error number into a >> bitmask here? > > Added Jay for confirmation. > I could be wrong but IIRC (and I'm quite fuzzy on this ... probably should > document this), unlike the wave trap code interrupt mask (bit mask) the CP > bad op code is a single error code that directly points to one of the > exception code enums that we defined in the user API header. > If that's the case, the KFD_EC_MASK is convenient for the kfd debugger code > to mask the payload to send to the debugger or runtime. > If that's been wrong this whole time (i.e. the bad ops code is actually a > bitwise mask of ecodes), then I'm not sure how we were able to get away with > running the runtime negative tests for as long as we have and we'd need to > recheck those tests.
That's right. Queue errors are serialized. The error code is recorded directly. Wavefront errors may occur concurrently within a wavefront. Those are recorded as a bitmask. >> In the code above, if ecode is 0, that would lead to calling >> kfd_set_dbg_ev_from_interrupt with a event mask of 0. Not sure if that >> even makes sense. Jon, so we need special handling of cases where the >> error code is 0 or out of range, so we can warn about buggy firmware >> rather than creating nonsensical events for the debugger? > > That makes sense. Again, deferring to Jay if a NULL cp bad op code is > expected under any circumstances. > Either way, raising undefined events to the debugger or runtime isn't useful > so range checking to filter out non-encoded cp bad op interrupts would be > needed. On AQL queues this interrupt carries an error code beginning from 16.
