Hi Korey,
Yes, let's move this conversation back to m5-dev, since I think
others may be interested and could help.
I don't know what the problem is exactly, but at some point of time
(probably back in the early GEMS days) I seem to remember the Set
code
included an assertion check about the 31st bit in 32-bit mode.
Therefore, I think we knew about this problem and made sure that
never
happened. I believe that is why we used to have a restriction that
Ruby could only support 16 processors. I'm really fuzzy on the
details...maybe someone else can elaborate.
In the end, I just want to make sure we add something in the code
that makes sure we don't encounter this problem again. This is one
of
those bugs that can take a while to track down, if you don't catch it
right when it happens with an assertion.
Brad
From: koreylsew...@gmail.com [mailto:koreylsew...@gmail.com] On
Behalf Of Korey Sewell
Sent: Tuesday, April 05, 2011 7:14 AM
To: Beckmann, Brad
Subject: Re: [m5-dev] Running Ruby w/32 Cores
Hi again Brad,
I looked this over again and although my 32-bit patch "fixes" things,
now that I look at it again, I'm not convinced that I actually fixed
the symptom of the bug but rather the cause of the bug.
Do you happen to know what are the problems with the 32-bit Set
counts?
Sorry for prolonging the issue, but I thought I had put this to bed
but maybe not. Finally, it may not matter that this works on 32-bit
machines but it'd be nice if it did. (Let me know if I should move
this convo to the m5-dev list)
I end up checking the last bit in the count function manually (the
code as follows):
int
Set::count() const
{
int counter = 0;
long mask;
for (int i = 0; i < m_nArrayLen; i++) {
mask = (long)0x01;
for (int j = 0; j < LONG_BITS; j++) {
// FIXME - significant performance loss when array
// population << LONG_BITS
if ((m_p_nArray[i] & mask) != 0) {
counter++;
}
mask = mask << 1;
}
#ifndef _LP64
long msb_mask = 0x80000000;
if ((m_p_nArray[i] & msb_mask) != 0) {
counter++;
}
#endif
}
return counter;
}
On Tue, Apr 5, 2011 at 1:30 AM, Korey Sewell
<ksew...@umich.edu<mailto:ksew...@umich.edu>> wrote:
Brad, it looks like you were right on the money here. I found the
spot where it was returning the wrong value via a SLICC function to
"count sharers for everyone except the owner".
I realized that the machine that I use for testing is just a 32-bit
machine, and like you warned there look to be issues with the Set
type
there. I ran the Fft-32 cores on a 64-bit machine and it seems to
work
correctly. I'll be running on the full splash/parsec suites soon and
that should stress Ruby a good bit :).
I have a patch that checks to see if _LP64 is defined, and if not
check that last bit when doing the set count function.
Thanks for being helpful in debugging. It was a "relatively" easy
bug, but as always going through code and becoming more proficient at
getting around while trying to solve a bug is really helpful.
On Fri, Apr 1, 2011 at 7:28 PM, Beckmann, Brad
<brad.beckm...@amd.com<mailto:brad.beckm...@amd.com>> wrote:
Ok for the first trace, the critical line is the following:
348523 0 L2Cache L1_GETX ILOSX>IFLXO [0x16180,
line 0x16180] [NetDest (4) 0 - 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 - 0 0 - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - ]30
L2Cache identifies that 31 caches have a shared copy and that L1
cache 9 (L1-9) is the owner.
When L1Cache 0 (L1-0) issues a GETX, the L2Cache issues 30 Inv
probes, forwards the GETX to L1-9, and sends an ack to L1-0 itself.
However, the L2 cache tells L1-0 to expect only 30 acks instead of
31. It could be something wrong with the NetDest::count() function,
or the Set::count() function? I slightly modified my previous patch
to isolate on what value the NetDest::count() function is returning.
If it is returning 30, instead of 31, then it must be a problem with
NetDest. You are compiling gem5 as a 64-bit binary, right?
The second problem is essentially the same issue. L2Cache 31 (L2-31)
is the owner of the block, but I suspect NetDest is not counting bit
31 and thus it is returning a count of 0...causing the error.
Overall, concentrate on that NetDest::count function, or more
importantly the Set::count() function. Once you find out the
problem,
please let me know.
Thanks,
Brad
From: koreylsew...@gmail.com<mailto:koreylsew...@gmail.com>
[mailto:koreylsew...@gmail.com<mailto:koreylsew...@gmail.com>] On
Behalf Of Korey Sewell
Sent: Friday, April 01, 2011 12:00 PM
To: Beckmann, Brad
Subject: Re: [m5-dev] Running Ruby w/32 Cores
Brad,
attached are the protocol traces grep'd for the offending addresses.
I'm going to spend the weekend digging through Ruby code so hopefully
I'm pretty close to generating the fixes myself.
The first trace, fft.l2-1.trace, the offending address is 0x16b80 and
the second trace the address is 0x1fc0.
Please let me know if you anything catches your eye when you get some
time.
On Thu, Mar 31, 2011 at 5:01 PM, Beckmann, Brad
<brad.beckm...@amd.com<mailto:brad.beckm...@amd.com>> wrote:
Yes, there should have been an attached patch. Here it is again.
Brad
-----Original Message-----
From: m5-dev-boun...@m5sim.org<mailto:m5-dev-boun...@m5sim.org>
[mailto:m5-dev-boun...@m5sim.org<mailto:m5-dev-boun...@m5sim.org>]
On Behalf Of Korey Sewell
Sent: Thursday, March 31, 2011 1:10 PM
To: M5 Developer List
Subject: Re: [m5-dev] Running Ruby w/32 Cores
Is there an attached patch I should be running or did it get bounced
by m5-
dev? If so, can you send it directly to me rather through m5-dev?
On Wed, Mar 30, 2011 at 8:26 PM, Beckmann, Brad
<brad.beckm...@amd.com<mailto:brad.beckm...@amd.com>> wrote:
> Hi Korey,
>
> For the first trace, it looks like the L2 cache is either
miscounting the
number of valid L1 copies, or there is an error with the ack
arithmetic. We
are going to need a bit more information to figure out where the
exact
problem is. Could you apply the attached patch and reply with the
new
protocol trace? Thanks.
>
> For the second trace, you should be able to get the offending
address by
simply attaching GDB to the aborted process. Without knowing which
address to zero in on, it is the proverbial "finding a needle in a
haystack".
>
> Thanks,
>
> Brad
>
>
>
>> -----Original Message-----
>> From: m5-dev-boun...@m5sim.org<mailto:m5-dev-boun...@m5sim.org>
[mailto:m5-dev-<mailto:m5-dev->
boun...@m5sim.org<mailto:boun...@m5sim.org>] On
>> Behalf Of Korey Sewell
>> Sent: Tuesday, March 29, 2011 10:15 AM
>> To: M5 Developer List
>> Subject: Re: [m5-dev] Running Ruby w/32 Cores
>>
>> Thanks for the response Brad.
>>
>> The 1st trace has 1 L2 and the 2nd has 1 L2 (i had a typo in the
original
email).
>>
>> For each trace, I attach the stdout/stderr (*.out) and then the
>> protocol trace (*.prottrace).
>>
>> Also, in the 1st trace, the offending address is clear and I
isolate
>> that in the protocol trace file provided. However, in the 2nd
trace,
>> it's unclear (currently) which access caused it to fail so I took
the
>> whole protocol trace file and gzip'd it.
>>
>> My current lack of expertise in SLICC limits me a bit, but I'd
like
>> to be more helpful in debugging so if there is anything that I
can
>> look into (or run) on my end to expedite the process, please
advise.
>> In the interim, I'll try to locate the exact address that's
breaking trace 2
and then hopefully repost that.
>>
>> Thanks!
>>
>> -Korey
>>
>> On Tue, Mar 29, 2011 at 12:02 PM, Beckmann, Brad
>> <brad.beckm...@amd.com<mailto:brad.beckm...@amd.com>> wrote:
>> > Hi Korey,
>> >
>> > I believe both of these issues should be easy to solve once we
have
>> > a
>> protocol trace leading up to the error. If you could create such
a
>> trace and send it to the list, that would be great. Just zero in
on the
offending address.
>> >
>> > Thanks,
>> >
>> > Brad
>> >
>> >
>> >> -----Original Message-----
>> >> From:
m5-dev-boun...@m5sim.org<mailto:m5-dev-boun...@m5sim.org>
[mailto:m5-dev-<mailto:m5-dev->
>> boun...@m5sim.org<mailto:boun...@m5sim.org>] On
>> >> Behalf Of Korey Sewell
>> >> Sent: Tuesday, March 29, 2011 8:11 AM
>> >> To: M5 Developer List
>> >> Subject: [m5-dev] Running Ruby w/32 Cores
>> >>
>> >> Hi All,
>> >> I'm still having a bit of trouble running Ruby with 32+ cores.
I
>> >> am experimenting w/configs varying the l2-caches. The runs
seems
>> >> to generate various errors in the SLICC.
>> >>
>> >> Has anybody seen these or have any insight to how to start
solving
>> >> these type of issues (posted below)?
>> >> =========
>> >> The command line and errors are as follows:
>> >> (1) 32 Cores and 32 L2s
>> >> build/ALPHA_FS_MOESI_CMP_directory/m5.opt
>> >> configs/example/ruby_fs.py -b FftBase32 -n 32 --num-dirs=32
--num-
>> >> l2caches=32 ...
>> >> info: Entering event queue @ 0. Starting simulation...
>> >> Runtime Error at MOESI_CMP_directory-dir.sm:155, Ruby Time:
38279:
>> >> assert failure, PID: 5990
>> >> press return to continue.
>> >>
>> >> Program aborted at cycle 19139500
>> >> Aborted
>> >>
>> >> (2) 32 Cores and 1 L2
>> >> build/ALPHA_FS_MOESI_CMP_directory/m5.opt
>> >> configs/example/ruby_fs.py -b FftBase32 -n 32 --num-dirs=32
--num-
>> >> l2caches=32 ...
>> >> fatal: Invalid transition
>> >> system.l1_cntrl0 time: 349075 addr: [0x16180, line 0x16180]
event:
>> >> Ack
>> state:
>> >> MM @ cycle 174537500
>> >>
>>
[doTransitionWorker:build/ALPHA_FS_MOESI_CMP_directory/mem/protoc
>> >> ol/L1Cache_Transitions.cc,
>> >> line 477]
>> >> Memory Usage: 2316756 KBytes
>> >> For more information see: http://www.m5sim.org/fatal/23f196b2
>> >> ========
>> >>
>> >> Please let me know if you do...Thanks!
>> >>
>> >> --
>> >> - Korey
>> >> _______________________________________________
>> >> m5-dev mailing list
>> >> m5-dev@m5sim.org<mailto:m5-dev@m5sim.org>
>> >> http://m5sim.org/mailman/listinfo/m5-dev
>> >
>> >
>> > _______________________________________________
>> > m5-dev mailing list
>> > m5-dev@m5sim.org<mailto:m5-dev@m5sim.org>
>> > http://m5sim.org/mailman/listinfo/m5-dev
>> >
>>
>>
>>
>> --
>> - Korey
>
> _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org<mailto:m5-dev@m5sim.org>
> http://m5sim.org/mailman/listinfo/m5-dev
>
>
--
- Korey
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org<mailto:m5-dev@m5sim.org>
http://m5sim.org/mailman/listinfo/m5-dev
--
- Korey
--
- Korey
--
- Korey
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev