Hi Nilay,

   On deadlock issue with MESI_CMP_directory :
Yes, this can happen as ruby_tester or Sequencer only reports *possible* deadlocks. With higher number of processors there is more contention (and thus latency) and it can mistakenly report deadlock. I generally look at the protocol trace to figure out whether there is actually any deadlock or not. You can also try doubling the Sequencer deadlock threshold and see if the problem goes away. If its a true deadlock, it will break again.

On some related note, as Brad has pointed out MESI_CMP_directory has its share of issues. Recently one of Prof. Sarita Adve's student e-mailed us (Multifacet) about 6 bugs he found while model checking the MESI_CMP_directory (including a major one). I took some time to look at them and it seems like MESI_CMP_directory is now fixed (hopefully). The modified protocol is now passing 1M checks with 16 processors with multiple random seeds. I can locally coordinate with you on this, if you want.

Thanks
Arka

On 01/04/2011 11:43 AM, Nilay Vaish wrote:

On 2011-01-03 15:31:20, Brad Beckmann wrote:
Hi Nilay,
First, I must say this is an impressive amount of work.  You definitely got a 
lot done over holiday break. :)

Overall, this set of patches is definitely close, but I want to see if we can 
take them a step forward.  Also I have a few suggestions that may make things 
easier.  Finally, I have a bunch of minor questions/suggestions on individual 
lines, but I’ll hold off on those until you can respond to my higher-level 
questions.

The main thing I would like to see improved is not having to differentiate 
between “entry” and “entry_ptr” in the .sm files.  Am I correct that the only 
functions in the .sm files that are passed an “entry_ptr” are “is_valid_ptr”, 
“getCacheEntry”, and “set_cache_entry”?  If so, it seems that all three 
functions are generated with unique python code, either in an AST file or 
StateMachine.py.  Therefore, could we just pass these functions “entry” and 
rely on the underneath python code to generate the correct references?  This 
would make things more readable, “is_valid_ptr()” becomes “is_valid”, and it 
doesn’t require the slicc programmer to understand which functions take an 
entry pointer versus the entry itself.  If we can’t make such a change, I worry 
about how much extra complexity this change pushes on the slicc programmer.

Also another suggestion to make things more readable, please replace the name 
L1IcacheMemory_entry with L1I_entry.  Do the same for L1D_entry and L2_entry.  
That will shorten many of your lines.

So am I correct that hammer’s simultaneous usage of valid L1 and L2 cache 
entries in certain transitions is the only reason that within all actions, the 
getCacheEntry calls take multiple cache entries?  If so, I think it would be 
fairly trivial to use a tbe entry as an intermediary between the L1 and L2 for 
those particular hammer transitions.  That way only one cache entry is valid at 
any particular time, and we can simply use the variable cache_entry in the 
actions.  That should clean things up a lot.

By the way, once you check in these patches, the MESI_CMP_directory protocol 
will be deprecated, correct?  If so, make sure you include a patch that removes 
it from the regression tester.

Brad

The main thing I would like to see improved is not having to differentiate
between “entry” and “entry_ptr” in the .sm files.  Am I correct
that the only functions in the .sm files that are passed an
“entry_ptr” are “is_valid_ptr”, “getCacheEntry”, and
“set_cache_entry”?  If so, it seems that all three functions are
generated with unique python code, either in an AST file or
StateMachine.py.  Therefore, could we just pass these functions
“entry” and rely on the underneath python code to generate the correct
references?  This would make things more readable, “is_valid_ptr()”
becomes “is_valid”, and it doesn’t require the slicc programmer to
understand which functions take an entry pointer versus the entry itself.
If we can’t make such a change, I worry about how much extra complexity
this change pushes on the slicc programmer.
There are functions that are passed cache entry and transaction buffer entry as 
arguments. Currently, I assume that these arguments are passed using pointers.

Also another suggestion to make things more readable, please replace the
name L1IcacheMemory_entry with L1I_entry.  Do the same for L1D_entry and
L2_entry.  That will shorten many of your lines.
The names of the cache entry variables are currently tied with the names of the 
cache memory variables belonging to the machine. If the name of the cache 
memory variable is A, then the corresponding cache entry variable is named 
A_entry.

So am I correct that hammer’s simultaneous usage of valid L1 and L2
cache entries in certain transitions is the only reason that within all
actions, the getCacheEntry calls take multiple cache entries?  If so, I
think it would be fairly trivial to use a tbe entry as an intermediary
between the L1 and L2 for those particular hammer transitions.  That way
only one cache entry is valid at any particular time, and we can simply
use the variable cache_entry in the actions.  That should clean things up
a lot.
Oops! Should have thought of that before doing all those changes. But can we 
assume that we would always have only one valid cache entry pointer at any 
given time? If that's true, I would probably revert to previous version of the 
patch. This should also resolve the naming issue.

By the way, once you check in these patches, the MESI_CMP_directory
protocol will be deprecated, correct?  If so, make sure you include a
patch that removes it from the regression tester.
I have a patch for the protocol, but I need to discuss it. Do you think it is 
possible that a protocol is not in a dead lock but random tester outputs so 
because the underlying memory system is taking too much time? The patch works 
for 1, 2, and 4 processors for 10,000,000 loads. I have tested these processor 
configurations with 40 different seed values. But for 8 processors, random 
tester outputs some thing like this --

panic: Possible Deadlock detected. Aborting!
version: 6 request.paddr: 12779 m_writeRequestTable: 15 current time: 369500011 
issue_time: 368993771 difference: 506240
  @ cycle 369500011
[wakeup:build/ALPHA_SE_MESI_CMP_directory/mem/ruby/system/Sequencer.cc, line 
123]


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/358/#review596
-----------------------------------------------------------


On 2011-01-03 14:18:13, Nilay Vaish wrote:
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/358/
-----------------------------------------------------------

(Updated 2011-01-03 14:18:13)


Review request for Default.


Summary
-------

The purpose of this patch is to change the way CacheMemory interfaces with 
coherence protocols. Currently, whenever a cache controller (defined in the 
protocol under consideration) needs to carry out any operation on a cache 
block, it looks up the tag hash map and figures out whether or not the block 
exists in the cache. In case it does exist, the operation is carried out (which 
requires another lookup). Over a single clock cycle, multiple such lookups take 
place as observed through profiling of different protocols. It was seen that 
the tag lookup takes anything from 10% to 20% of the simulation time. In order 
to reduce this time, this patch is being posted. The CacheMemory class now will 
have a function that will return pointer to a cache block entry, instead of a 
reference (though the function that returns the reference has been retained 
currently). Functions have been introduced for setting/unsetting a pointer and 
check its pointer. Similar changes have been carried out for Transaction Buffer 
entries as well.

Note that changes are required to both SLICC and the protocol files. This patch 
carries out changes to SLICC and committing this patch alone, I believe, will 
___break___ all the protocols. I am working on patching the protocols as well. 
This patch is being put to get feed back from other developers.


Diffs
-----

   src/mem/protocol/RubySlicc_Types.sm UNKNOWN
   src/mem/ruby/slicc_interface/AbstractCacheEntry.hh UNKNOWN
   src/mem/ruby/slicc_interface/AbstractCacheEntry.cc UNKNOWN
   src/mem/ruby/system/CacheMemory.hh UNKNOWN
   src/mem/ruby/system/CacheMemory.cc UNKNOWN
   src/mem/ruby/system/TBETable.hh UNKNOWN
   src/mem/slicc/ast/ActionDeclAST.py UNKNOWN
   src/mem/slicc/ast/FormalParamAST.py UNKNOWN
   src/mem/slicc/ast/FuncCallExprAST.py UNKNOWN
   src/mem/slicc/ast/InPortDeclAST.py UNKNOWN
   src/mem/slicc/ast/MethodCallExprAST.py UNKNOWN
   src/mem/slicc/ast/TypeDeclAST.py UNKNOWN
   src/mem/slicc/ast/__init__.py UNKNOWN
   src/mem/slicc/parser.py UNKNOWN
   src/mem/slicc/symbols/StateMachine.py UNKNOWN

Diff: http://reviews.m5sim.org/r/358/diff


Testing
-------

I have tested these changes using the MOESI_CMP_directory and MI protocols.


Thanks,

Nilay




_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev
_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Reply via email to