Re: More on AMD Athlon/AGP stability issue

Benjamin Scott Thu, 24 Jan 2002 14:48:59 -0800

On Thu, 24 Jan 2002, Steven W. Orr wrote:
> I'm not totally stoopid, but I really am not understanding this issue
> very well.


  You are not alone.  There has been a fair amount of discussion on LKML
(the Linux Kernel Mailing List) about the issue.  *This issue is not nailed
down yet*.  It is entirely possible that further investigation will discover
new information that invalidates previous conclusions.  The story continues
to unfold, as they say in the news business.

> I'd pay money to go to a GNHLUG meeting where someone could explain this
> to me with only a slightly restricted number of syllables per word.

  This is my understanding of the issue.  I *know* it is not complete, and
it may well be completely bogus.  But for lack of anything better:

  The Athlon processor engages in something called "speculative writes".  I
am not quite sure how the speculation works, but the result is that data in
the processor cache is written out to main memory "early".

  AGP has something called a GART (Graphics Address Remapping Table) that
lets the video card access main memory in a direct fashion, to increase
performance.  Or something like that.

  The kernel is responsible for mapping main memory to virtual memory.  It
also is responsible for marking data as catchable by the various levels and
layers of memory caching in the system.

  The kernel is marking data being written to the AGP card as catchable, and
so the Athlon processor is caching them.  However, the GART is not aware of
this caching, and is doing something not quite compatible.  The result is
that data in the processor cache does not match data in some other location.
When the Athlon does the speculative write to write the cache to main
memory, everything goes to hell.

  Apparently, these speculative writes are sane and allowed by the Athlon
design, and the GART behavior is allowed by the AGP design, and the problem
is that the kernel is marking memory as catchable when it should not.

  I do not understand the details here, so if this seems like hand-waving,
that is because it is.  :-)

  As for why "mem=nopentium" and the memory page size would make a
difference, well, the kernel folks aren't too sure of that either.  It may
be an accident having to do with page alignment boundaries, or it may just
reduce (but not eliminate) the chance of the bug triggering, or who knows
what.

  This stuff is heavy wizardry [1].  :-)

  (It also underscores the concern many have with the 2.4 kernel: If almost
no one really understands the kernel's memory manager design and
implementation, how can we be sure it works? [2])

Footnotes
---------
[1] http://www.tuxedo.org/~esr/jargon/html/entry/heavy-wizardry.html
[2] See LKML postings last month complaining about this.

-- 
Ben Scott <[EMAIL PROTECTED]>
| The opinions expressed in this message are those of the author and do not |
| necessarily represent the views or policy of any other person, entity or  |
| organization.  All information is provided without warranty of any kind.  |


*****************************************************************
To unsubscribe from this list, send mail to [EMAIL PROTECTED]
with the text 'unsubscribe gnhlug' in the message body.
*****************************************************************

Re: More on AMD Athlon/AGP stability issue

Reply via email to