On 1/11/2024 11:25, Kim, Jonathan wrote:

>> This looks OK. The compiler must be warning about a potential problem
>> here, not a definite one.
>>
>> Question for Jon, how does the firmware encode the error code in the
>> context ID? I see these macros:
>>
>> #define KFD_DEBUG_CP_BAD_OP_ECODE_MASK          0x3fffc00
>> #define KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT         10
>> #define KFD_DEBUG_CP_BAD_OP_ECODE(ctxid0) (((ctxid0) &                  \
>>                                  KFD_DEBUG_CP_BAD_OP_ECODE_MASK)         \
>>                                  >> KFD_DEBUG_CP_BAD_OP_ECODE_SHIFT)
>>
>> It looks like we have 16 bits for the ECODE. That's enough to have a bit
>> mask. Do we really need KFD_EC_MASK to convert an error number into a
>> bitmask here?
> 
> Added Jay for confirmation.
> I could be wrong but IIRC (and I'm quite fuzzy on this ... probably should 
> document this), unlike the wave trap code interrupt mask (bit mask) the CP 
> bad op code is a single error code that directly points to one of the 
> exception code enums that we defined in the user API header.
> If that's the case, the KFD_EC_MASK is convenient for the kfd debugger code 
> to mask the payload to send to the debugger or runtime.
> If that's been wrong this whole time (i.e. the bad ops code is actually a 
> bitwise mask of ecodes), then I'm not sure how we were able to get away with 
> running the runtime negative tests for as long as we have and we'd need to 
> recheck those tests.

That's right. Queue errors are serialized. The error code is recorded directly.

Wavefront errors may occur concurrently within a wavefront. Those are recorded 
as a bitmask.

>> In the code above, if ecode is 0, that would lead to calling
>> kfd_set_dbg_ev_from_interrupt with a event mask of 0. Not sure if that
>> even makes sense. Jon, so we need special handling of cases where the
>> error code is 0 or out of range, so we can warn about buggy firmware
>> rather than creating nonsensical events for the debugger?
> 
> That makes sense.  Again, deferring to Jay if a NULL cp bad op code is 
> expected under any circumstances.
> Either way, raising undefined events to the debugger or runtime isn't useful 
> so range checking to filter out non-encoded cp bad op interrupts would be 
> needed.

On AQL queues this interrupt carries an error code beginning from 16.

Reply via email to