Steve,
Should we commit Geoff's fix or is there a better way to fix it?

Ali


On Dec 29, 2007, at 1:39 PM, Steve Reinhardt wrote:

OK, great, glad you tracked that bug down.  Your fix is a pretty good
one, but I think the right answer is that CPU3 in your example should
not issue a ReadEx... if it knows that it's requesting the block for a
store conditional, and it sees that the block has been invalidated, it
should fail the store conditional without getting an exclusive copy.
In fact the current behavior is broken in that it can lead to
livelock; if there are a lot of CPUs doing what CPU3 is doing at the
same time, then they could prevent any cache from successfully
completeing an ll/sc sequence.  Could this be what you're seeing at 16
CPUs?

As far as the "allocating bonus target for snoop" messages, you
shouldn't worry about those; the best thing is probably just to up the
number of targets per MSHR and that should go away.  The issue is that
we use up an MSHR target when we save a request for a deferred snoop,
but since there's no way to nack a snoop, we really have no choice
once the MSHR's targets are full but to keep allocating them anyway.
So until/unless we come up with a way to nack snoops, which we
probably never will, then this really should be a warning that the
number of targets per MSHR is set too low.  There is an upper bound on
the  number of targets that would be needed, basically the sum of the
max number of outstanding accesses from above (which is a function of
the CPU model for an L1 or the number of caches above for an L2+),
plus the max number of outstanding snoops for a single block (which
would be a function of the number of other caches in the system).

Let me know if there's anything else I can help with.

Steve

On Dec 29, 2007 6:41 AM, Geoffrey Blake <[EMAIL PROTECTED]> wrote:
Steve,

What you described below is exactly what was happening when I was going through the bus and cache traces. With more than 2 CPUs, you would get into a condition where CPU1 would release its spin-lock, then CPU2 and CPU3 would both read the line and try to do a store-conditional. At this point there are 2 UpgradeReq's trying to get the bus. Say CPU2 gets the bus first, so it invalidates CPU1 and CPU3's cache lines. CPU3 gets the bus next and issues a ReadExReq because its line was invalidated, this then invalidates CPU2's pending cache fill. CPU3 will fail the store-conditional and mark the line as exclusive only. If another CPU tries to read the same line to get the spin lock, it will get a stale value from a lower level of cache, ignoring the up to date value. This makes the kernel do some bizarre things as you would imagine. For 16+ CPUs there is something else wrong, but that
one gets many of the "warn: bonus snoop allocated" messages, so I'm
wondering what could be happening there.

Geoff


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:m5-users- [EMAIL PROTECTED] On
Behalf Of Steve Reinhardt
Sent: Saturday, December 29, 2007 12:26 AM
To: M5 users mailing list
Subject: Re: [m5-users] full-system issue in m5 beta 4

Geoff,

Do you have any more information on what the problem was that this
patch fixes?  On the face of it, the patch doesn't make sense... the
original code only marks the block dirty if the block was written to,
while with your patch it will get marked as dirty even in the case of
a failed store conditional that doesn't actually modify the block. So
locally it seems wrong.

However I can imagine that there might be some global situation where
a block is dirty (owned) in cache A, and then cache B requests an
exclusive copy for a store conditional, but in the meantime something
else happens that causes the store conditional to fail. So then cache B gets A's dirty copy, but fails to mark it as dirty, so then later it
doesn't get written back, and A's modification is lost.  Does this
sound like what's happening?  If so, then this may well be the right
fix, but I'd have to think about it a little more... the key issue is
that right now when a cache receives an exclusive copy of a block it
doesn't really pay attention to whether it's getting it from memory
(in which case it's OK not to mark it dirty) or from another cache (in
which case it must be marked dirty).

Though at this point I can't think of a reason it would be incorrect
to always mark the block dirty even if the store conditional fails...
you might suffer an extra writeback in some very rare circumstances,
but it should still be functionally correct. So perhaps your patch is
the right solution.

Steve

On Dec 28, 2007 9:16 PM, Nathan Binkert <[EMAIL PROTECTED]> wrote:
What are the implications of this diff? I'm not clear on the funcitons in question, but if this was wrong for 4 and 8 cpus, it seems like it's just
fundamentally wrong.  Steve?

  Nate


To those looking for a fix to booting M5 in FS mode with more than 2 CPUs, I've attached a diff that fixes some of the problems. I have M5 booting with 4 and 8 CPUs using timing simple CPU and caches and the
l2cache.  16 CPUs and above, its still getting stuck.

Geoff

Quoting Ali Saidi <[EMAIL PROTECTED]>:

Normally I add a -s to that command line because I want to create
checkpoints with the atomic cpu, I restore from the checkpoints
immediately into the timing cpu where the caches are warmed up and
then I switch to a detailed cpu model. The -w flag has no meaning
unless the -s (standard switch) flag is used.

You'll need to modify the scripts a little bit if you want to do
anything else. If you want to just transition into a timing cpu and
not into a detailed cpu you'll need to change line 64 in
Simulation.py  from         root.switch_cpus = switch_cpus to
testsys.switch_cpus = switch_cpus and then add some code to alter the
atomic warm up period. Alternatively you could use the standard
switch  code and change the O3 cpu to another timing cpu if you
wanted to end up with a simple cpu model that would allow statistics
to be collected  on the other cpu after the switch over.

Ali


On Dec 17, 2007, at 6:49 AM, abc def wrote:

I tried using following command line:
build/ALPHA_FS/m5.opt configs/example/fs.py  -n 4 -r 1
--timing --caches -w 50000000000, so that it switches
to timing simple cpu only after warming up caches with
atomic simple cpu.
But nothing is happening in console. It is not getting
restored from checkpoint.

I am using system files from version b3.

Can you please forward me the command  line you use
for booting up timing simple cpu.


--- Ali Saidi <[EMAIL PROTECTED]> escribió:

It's another bug, but since we never really boot
with timing and
caches it's not surprising that we haven't seen it
before.

Ali

On Dec 16, 2007, at 11:43 PM, Nathan Binkert wrote:

This could honestly be just because it takes a
long time.  With
timing and caches, the simulator is pretty slow.

This is working if caches option is not used.

But with L1,L2 cache present and with multiple
cpus it
is still getting stuck while booting.

command line used:
build/ALPHA_FS/m5.opt configs/example/fs.py  -n 4
--timing --caches --l2cache

--- Ali Saidi <[EMAIL PROTECTED]> escribió:

There is an issue in b4 with when the CPU ids
get
assigned to CPUs
that can cause some weird behavior in all
multi-processor
configurations (2,3,4, xxx cpus). The attach
patch
fixes those problems.



Ali

On Dec 16, 2007, at 2:53 AM, Ali Saidi wrote:

Yea, you found a bug. I found the changeset
that
caused the problem,
and I'll try to figure out what is going on
tomorrow and post a patch.

In the future please create a new topic on the
mailing list by
sending a new message to m5-users@m5sim.org
instead of replying to a
current topic and changing the subject.
Replying
to the same topic
and just changing the subject preserves the
In-Reply-To mail header
and makes it more difficult to reconstruct
threads
of conversation
on the mailing list.

Ali

On Dec 15, 2007, at 7:57 PM, abc def wrote:

Timing simple cpu in full system mode in m5
beta
4  is
not booting up. In the console it is getting
stuck
into "NET: Registered protocol family 2" and
is
not
proceeding forward.

System files are from:




http://www.m5sim.org/dist/current/m5_system_2.0b3.tar.bz2

This is happening if 4 cpus are used for
booting.
For
1 cpu it is ok.



______________________________________________
¿Chef por primera vez?
Sé un mejor Cocinillas.
http://es.answers.yahoo.com/info/welcome

_______________________________________________
m5-users mailing list
m5-users@m5sim.org


http://m5sim.org/cgi-bin/mailman/listinfo/m5-users


_______________________________________________
m5-users mailing list
m5-users@m5sim.org

http://m5sim.org/cgi-bin/mailman/listinfo/m5-users


_______________________________________________
m5-users mailing list
m5-users@m5sim.org

http://m5sim.org/cgi-bin/mailman/listinfo/m5-users




______________________________________________
¿Chef por primera vez?
Sé un mejor Cocinillas.
http://es.answers.yahoo.com/info/welcome
_______________________________________________
m5-users mailing list
m5-users@m5sim.org

http://m5sim.org/cgi-bin/mailman/listinfo/m5-users
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users





______________________________________________
¿Chef por primera vez?
Sé un mejor Cocinillas.
http://es.answers.yahoo.com/info/welcome
_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users


_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users







----- End forwarded message -----


_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users



No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.17.11/1201 - Release Date: 12/28/2007
11:51 AM


No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.17.11/1201 - Release Date: 12/28/2007
11:51 AM



_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users


_______________________________________________
m5-users mailing list
m5-users@m5sim.org
http://m5sim.org/cgi-bin/mailman/listinfo/m5-users

Reply via email to