Hi Korey, Yes, let's move this conversation back to m5-dev, since I think others may be interested and could help.
I don't know what the problem is exactly, but at some point of time (probably back in the early GEMS days) I seem to remember the Set code included an assertion check about the 31st bit in 32-bit mode. Therefore, I think we knew about this problem and made sure that never happened. I believe that is why we used to have a restriction that Ruby could only support 16 processors. I'm really fuzzy on the details...maybe someone else can elaborate. In the end, I just want to make sure we add something in the code that makes sure we don't encounter this problem again. This is one of those bugs that can take a while to track down, if you don't catch it right when it happens with an assertion. Brad From: koreylsew...@gmail.com [mailto:koreylsew...@gmail.com] On Behalf Of Korey Sewell Sent: Tuesday, April 05, 2011 7:14 AM To: Beckmann, Brad Subject: Re: [m5-dev] Running Ruby w/32 Cores Hi again Brad, I looked this over again and although my 32-bit patch "fixes" things, now that I look at it again, I'm not convinced that I actually fixed the symptom of the bug but rather the cause of the bug. Do you happen to know what are the problems with the 32-bit Set counts? Sorry for prolonging the issue, but I thought I had put this to bed but maybe not. Finally, it may not matter that this works on 32-bit machines but it'd be nice if it did. (Let me know if I should move this convo to the m5-dev list) I end up checking the last bit in the count function manually (the code as follows): int Set::count() const { int counter = 0; long mask; for (int i = 0; i < m_nArrayLen; i++) { mask = (long)0x01; for (int j = 0; j < LONG_BITS; j++) { // FIXME - significant performance loss when array // population << LONG_BITS if ((m_p_nArray[i] & mask) != 0) { counter++; } mask = mask << 1; } #ifndef _LP64 long msb_mask = 0x80000000; if ((m_p_nArray[i] & msb_mask) != 0) { counter++; } #endif } return counter; } On Tue, Apr 5, 2011 at 1:30 AM, Korey Sewell <ksew...@umich.edu<mailto:ksew...@umich.edu>> wrote: Brad, it looks like you were right on the money here. I found the spot where it was returning the wrong value via a SLICC function to "count sharers for everyone except the owner". I realized that the machine that I use for testing is just a 32-bit machine, and like you warned there look to be issues with the Set type there. I ran the Fft-32 cores on a 64-bit machine and it seems to work correctly. I'll be running on the full splash/parsec suites soon and that should stress Ruby a good bit :). I have a patch that checks to see if _LP64 is defined, and if not check that last bit when doing the set count function. Thanks for being helpful in debugging. It was a "relatively" easy bug, but as always going through code and becoming more proficient at getting around while trying to solve a bug is really helpful. On Fri, Apr 1, 2011 at 7:28 PM, Beckmann, Brad <brad.beckm...@amd.com<mailto:brad.beckm...@amd.com>> wrote: Ok for the first trace, the critical line is the following: 348523 0 L2Cache L1_GETX ILOSX>IFLXO [0x16180, line 0x16180] [NetDest (4) 0 - 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 - 0 0 - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - ]30 L2Cache identifies that 31 caches have a shared copy and that L1 cache 9 (L1-9) is the owner. When L1Cache 0 (L1-0) issues a GETX, the L2Cache issues 30 Inv probes, forwards the GETX to L1-9, and sends an ack to L1-0 itself. However, the L2 cache tells L1-0 to expect only 30 acks instead of 31. It could be something wrong with the NetDest::count() function, or the Set::count() function? I slightly modified my previous patch to isolate on what value the NetDest::count() function is returning. If it is returning 30, instead of 31, then it must be a problem with NetDest. You are compiling gem5 as a 64-bit binary, right? The second problem is essentially the same issue. L2Cache 31 (L2-31) is the owner of the block, but I suspect NetDest is not counting bit 31 and thus it is returning a count of 0...causing the error. Overall, concentrate on that NetDest::count function, or more importantly the Set::count() function. Once you find out the problem, please let me know. Thanks, Brad From: koreylsew...@gmail.com<mailto:koreylsew...@gmail.com> [mailto:koreylsew...@gmail.com<mailto:koreylsew...@gmail.com>] On Behalf Of Korey Sewell Sent: Friday, April 01, 2011 12:00 PM To: Beckmann, Brad Subject: Re: [m5-dev] Running Ruby w/32 Cores Brad, attached are the protocol traces grep'd for the offending addresses. I'm going to spend the weekend digging through Ruby code so hopefully I'm pretty close to generating the fixes myself. The first trace, fft.l2-1.trace, the offending address is 0x16b80 and the second trace the address is 0x1fc0. Please let me know if you anything catches your eye when you get some time. On Thu, Mar 31, 2011 at 5:01 PM, Beckmann, Brad <brad.beckm...@amd.com<mailto:brad.beckm...@amd.com>> wrote: Yes, there should have been an attached patch. Here it is again. Brad > -----Original Message----- > From: m5-dev-boun...@m5sim.org<mailto:m5-dev-boun...@m5sim.org> > [mailto:m5-dev-boun...@m5sim.org<mailto:m5-dev-boun...@m5sim.org>] > On Behalf Of Korey Sewell > Sent: Thursday, March 31, 2011 1:10 PM > To: M5 Developer List > Subject: Re: [m5-dev] Running Ruby w/32 Cores > > Is there an attached patch I should be running or did it get bounced by m5- > dev? If so, can you send it directly to me rather through m5-dev? > > On Wed, Mar 30, 2011 at 8:26 PM, Beckmann, Brad > <brad.beckm...@amd.com<mailto:brad.beckm...@amd.com>> wrote: > > Hi Korey, > > > > For the first trace, it looks like the L2 cache is either miscounting the > number of valid L1 copies, or there is an error with the ack arithmetic. We > are going to need a bit more information to figure out where the exact > problem is. Could you apply the attached patch and reply with the new > protocol trace? Thanks. > > > > For the second trace, you should be able to get the offending address by > simply attaching GDB to the aborted process. Without knowing which > address to zero in on, it is the proverbial "finding a needle in a haystack". > > > > Thanks, > > > > Brad > > > > > > > >> -----Original Message----- > >> From: m5-dev-boun...@m5sim.org<mailto:m5-dev-boun...@m5sim.org> > >> [mailto:m5-dev-<mailto:m5-dev-> > boun...@m5sim.org<mailto:boun...@m5sim.org>] On > >> Behalf Of Korey Sewell > >> Sent: Tuesday, March 29, 2011 10:15 AM > >> To: M5 Developer List > >> Subject: Re: [m5-dev] Running Ruby w/32 Cores > >> > >> Thanks for the response Brad. > >> > >> The 1st trace has 1 L2 and the 2nd has 1 L2 (i had a typo in the original > email). > >> > >> For each trace, I attach the stdout/stderr (*.out) and then the > >> protocol trace (*.prottrace). > >> > >> Also, in the 1st trace, the offending address is clear and I isolate > >> that in the protocol trace file provided. However, in the 2nd trace, > >> it's unclear (currently) which access caused it to fail so I took the > >> whole protocol trace file and gzip'd it. > >> > >> My current lack of expertise in SLICC limits me a bit, but I'd like > >> to be more helpful in debugging so if there is anything that I can > >> look into (or run) on my end to expedite the process, please advise. > >> In the interim, I'll try to locate the exact address that's breaking trace > >> 2 > and then hopefully repost that. > >> > >> Thanks! > >> > >> -Korey > >> > >> On Tue, Mar 29, 2011 at 12:02 PM, Beckmann, Brad > >> <brad.beckm...@amd.com<mailto:brad.beckm...@amd.com>> wrote: > >> > Hi Korey, > >> > > >> > I believe both of these issues should be easy to solve once we have > >> > a > >> protocol trace leading up to the error. If you could create such a > >> trace and send it to the list, that would be great. Just zero in on the > offending address. > >> > > >> > Thanks, > >> > > >> > Brad > >> > > >> > > >> >> -----Original Message----- > >> >> From: m5-dev-boun...@m5sim.org<mailto:m5-dev-boun...@m5sim.org> > >> >> [mailto:m5-dev-<mailto:m5-dev-> > >> boun...@m5sim.org<mailto:boun...@m5sim.org>] On > >> >> Behalf Of Korey Sewell > >> >> Sent: Tuesday, March 29, 2011 8:11 AM > >> >> To: M5 Developer List > >> >> Subject: [m5-dev] Running Ruby w/32 Cores > >> >> > >> >> Hi All, > >> >> I'm still having a bit of trouble running Ruby with 32+ cores. I > >> >> am experimenting w/configs varying the l2-caches. The runs seems > >> >> to generate various errors in the SLICC. > >> >> > >> >> Has anybody seen these or have any insight to how to start solving > >> >> these type of issues (posted below)? > >> >> ========= > >> >> The command line and errors are as follows: > >> >> (1) 32 Cores and 32 L2s > >> >> build/ALPHA_FS_MOESI_CMP_directory/m5.opt > >> >> configs/example/ruby_fs.py -b FftBase32 -n 32 --num-dirs=32 --num- > >> >> l2caches=32 ... > >> >> info: Entering event queue @ 0. Starting simulation... > >> >> Runtime Error at MOESI_CMP_directory-dir.sm:155, Ruby Time: 38279: > >> >> assert failure, PID: 5990 > >> >> press return to continue. > >> >> > >> >> Program aborted at cycle 19139500 > >> >> Aborted > >> >> > >> >> (2) 32 Cores and 1 L2 > >> >> build/ALPHA_FS_MOESI_CMP_directory/m5.opt > >> >> configs/example/ruby_fs.py -b FftBase32 -n 32 --num-dirs=32 --num- > >> >> l2caches=32 ... > >> >> fatal: Invalid transition > >> >> system.l1_cntrl0 time: 349075 addr: [0x16180, line 0x16180] event: > >> >> Ack > >> state: > >> >> MM @ cycle 174537500 > >> >> > >> > [doTransitionWorker:build/ALPHA_FS_MOESI_CMP_directory/mem/protoc > >> >> ol/L1Cache_Transitions.cc, > >> >> line 477] > >> >> Memory Usage: 2316756 KBytes > >> >> For more information see: http://www.m5sim.org/fatal/23f196b2 > >> >> ======== > >> >> > >> >> Please let me know if you do...Thanks! > >> >> > >> >> -- > >> >> - Korey > >> >> _______________________________________________ > >> >> m5-dev mailing list > >> >> m5-dev@m5sim.org<mailto:m5-dev@m5sim.org> > >> >> http://m5sim.org/mailman/listinfo/m5-dev > >> > > >> > > >> > _______________________________________________ > >> > m5-dev mailing list > >> > m5-dev@m5sim.org<mailto:m5-dev@m5sim.org> > >> > http://m5sim.org/mailman/listinfo/m5-dev > >> > > >> > >> > >> > >> -- > >> - Korey > > > > _______________________________________________ > > m5-dev mailing list > > m5-dev@m5sim.org<mailto:m5-dev@m5sim.org> > > http://m5sim.org/mailman/listinfo/m5-dev > > > > > > > > -- > - Korey > _______________________________________________ > m5-dev mailing list > m5-dev@m5sim.org<mailto:m5-dev@m5sim.org> > http://m5sim.org/mailman/listinfo/m5-dev -- - Korey -- - Korey -- - Korey _______________________________________________ m5-dev mailing list m5-dev@m5sim.org http://m5sim.org/mailman/listinfo/m5-dev