Re: [m5-dev] X86 FS regression

Gabe Black Mon, 13 Dec 2010 19:08:14 -0800

 Ok, will do. That should be easy enough.

Gabe


On 12/13/10 16:47, Ali Saidi wrote:
> I've got a patch that gets closer to supporting caches between the TLB and L2 
> cache, but it doesn't work. Since we don't have a way to invalidate addresses 
> out of the cache if you switch a memory region to uncachable, the address 
> remains in the cache and causes all sorts of havoc. If you want to make some 
> changes, please make them x86 only for now. We'll need to implement some form 
> of cache cleaning or invalidating before it can be supported on ARM.
>
> Thanks,
> Ali
>
> On Dec 13, 2010, at 4:56 AM, Gabe Black wrote:
>
>> I finally got around to trying this out (patch attached) and it seemed
>> to fix x86. This change seems to break ARM_FS, though. It faults when it
>> tries to execute code at the fault vector because the page table entry
>> supposedly is marked no execute (I think). That makes the timing CPU
>> spin around and around because it keeps getting a fault, invoking it,
>> and attempting to fetch again. The call stack never hits a point where
>> it has to wait for an event, so it never collapses back down and
>> recurses until the stack is too big and M5 segfaults. It seems to just
>> get lost with the atomic CPU and I'm not entirely sure what's going on
>> there, although I suspect the atomic CPU is just structured differently
>> and doesn't recurse infinitely.
>>
>> I wanted to ask the ARM folks if they knew what was going on here. Is
>> something about the page table walk supposed to be uncached but isn't?
>> This seems to work without that cache added in, so I suspect the walker
>> is picking up stale data or something.
>>
>> Gabe
>>
>> Gabe Black wrote:
>>> Of these, I think the walker cache sounds better for two reasons. First,
>>> it avoids the L1 pollution Ali was talking about, and second, a new bus
>>> would add mostly inert stuff on the way to memory and which would
>>> involve looking up what port to use even though it'd always be the same
>>> one. I'll give that a try.
>>>
>>> Gabe
>>>
>>> Steve Reinhardt wrote:
>>>
>>>> I think the two easy (python-only) solutions are sharing the existing
>>>> L1 via a bus and tacking on a small L1 to the walker.  Which one is
>>>> more realistic would depend on what you're trying to model.
>>>>
>>>> Steve
>>>>
>>>> On Tue, Nov 23, 2010 at 8:23 AM, Ali Saidi <sa...@umich.edu
>>>> <mailto:sa...@umich.edu>> wrote:
>>>>
>>>>    So what is the relatively good way to make this work in the short
>>>>    term? A bus? What about the slightly better version? I suppose a
>>>>    small cache might be ok and probably somewhat realistic.
>>>>
>>>>
>>>>
>>>>    Thanks,
>>>>
>>>>    Ali
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>    On Tue, 23 Nov 2010 08:15:01 -0800, Steve Reinhardt
>>>>    <ste...@gmail.com <mailto:ste...@gmail.com>> wrote:
>>>>
>>>>
>>>>>    And even though I do think it could be made to work, I'm not sure
>>>>>    it would be easy or a good idea.  There are a lot of corner cases
>>>>>    to worry about, especially for writes, since you'd have to
>>>>>    actually buffer the write data somewhere as opposed to just
>>>>>    remembering that so-and-so has requested an exclusive copy.
>>>>>
>>>>>    Actually as I think about it, that might be the case that's
>>>>>    breaking now... if the L1 has an exclusive copy and then it
>>>>>    snoops a write (and not a read-exclusive), I'm guessing it will
>>>>>    just invalidate its copy, losing the modifications.  I wouldn't
>>>>>    be terribly surprised if reads are working OK (the L1 should
>>>>>    snoop those and respond if it's the owner), and of course it's
>>>>>    all OK if the L1 doesn't have a copy of the block.
>>>>>
>>>>>    So maybe there is a relatively easy way to make this work, but
>>>>>    figuring out whether that's true and then testing it is still a
>>>>>    non-trivial amount of effort.
>>>>>
>>>>>    Steve
>>>>>
>>>>>    On Tue, Nov 23, 2010 at 7:57 AM, Steve Reinhardt
>>>>>    <ste...@gmail.com <mailto:ste...@gmail.com>> wrote:
>>>>>
>>>>>        No, when the L2 receives a request it assumes the L1s above
>>>>>        it have already been snooped, which is true since the request
>>>>>        came in on the bus that the L1s snoop.  The issue is that
>>>>>        caches don't necessarily behave correctly when
>>>>>        non-cache-block requests come in through their mem-side
>>>>>        (snoop) port and not through their cpu-side (request) port. 
>>>>>        I'm guessing this could be made to work, I'd just be very
>>>>>        surprised if it does right now, since the caches weren't
>>>>>        designed to deal with this case and aren't tested this way.
>>>>>
>>>>>        Steve
>>>>>
>>>>>
>>>>>        On Tue, Nov 23, 2010 at 7:50 AM, Ali Saidi <sa...@umich.edu
>>>>>        <mailto:sa...@umich.edu>> wrote:
>>>>>
>>>>>            Does it? Shouldn't the l2 receive the request, ask for
>>>>>            the block and end up snooping the l1s?
>>>>>
>>>>>
>>>>>
>>>>>            Ali
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            On Tue, 23 Nov 2010 07:30:00 -0800, Steve Reinhardt
>>>>>            <ste...@gmail.com <mailto:ste...@gmail.com>> wrote:
>>>>>
>>>>>                The point is that connecting between the L1 and L2
>>>>>                induces the same problems wrt the L1 that connecting
>>>>>                directly to memory induces wrt the whole cache
>>>>>                hierarchy.  You're just statistically more likely to
>>>>>                get away with it in the former case because the L1 is
>>>>>                smaller.
>>>>>
>>>>>                Steve
>>>>>
>>>>>                On Tue, Nov 23, 2010 at 7:16 AM, Ali Saidi
>>>>>                <sa...@umich.edu <mailto:sa...@umich.edu>> wrote:
>>>>>
>>>>>
>>>>>                    Where are you connecting the table walker? If
>>>>>                    it's between the l1 and l2 my guess is that it
>>>>>                    will work. if it is to the memory bus, yes,
>>>>>                    memory is just responding without the help of a
>>>>>                    cache and this could be the reason.
>>>>>
>>>>>                    Ali
>>>>>
>>>>>
>>>>>
>>>>>                    On Tue, 23 Nov 2010 06:29:20 -0500, Gabe Black
>>>>>                    <gbl...@eecs.umich.edu
>>>>>                    <mailto:gbl...@eecs.umich.edu>> wrote:
>>>>>
>>>>>                        I think I may have just now. I've fixed a few
>>>>>                        issues, and am now getting
>>>>>                        to the point where something that should be
>>>>>                        in the pagetables is causing
>>>>>                        a page fault. I found where the table walker
>>>>>                        is walking the tables for
>>>>>                        this particular access, and the last level
>>>>>                        entry is all 0s. There could
>>>>>                        be a number of reasons this is all 0s, but
>>>>>                        since the main difference
>>>>>                        other than timing between this and a working
>>>>>                        configuration is the
>>>>>                        presence of caches and we've identified a
>>>>>                        potential issue there, I'm
>>>>>                        inclined to suspect the actual page table
>>>>>                        entry is still in the L1 and
>>>>>                        hasn't been evicted out to memory yet.
>>>>>
>>>>>                        To fix this, is the best solution to add a
>>>>>                        bus below the CPU for all the
>>>>>                        connections that need to go to the L1? I'm
>>>>>                        assuming they'd all go into
>>>>>                        the dcache since they're more data-ey and
>>>>>                        that keeps the icache read
>>>>>                        only (ignoring SMC issues), and the dcache is
>>>>>                        probably servicing lower
>>>>>                        bandwidth normally. It also seems a little
>>>>>                        strange that this type of
>>>>>                        configuration is going on in the BaseCPU.py
>>>>>                        SimObject python file and
>>>>>                        not a configuration file, but I could be
>>>>>                        convinced there's a reason.
>>>>>                        Even if this isn't really a "fix" or the
>>>>>                        "right thing" to do, I'd still
>>>>>                        like to try it temporarily at least to see if
>>>>>                        it corrects the problem
>>>>>                        I'm seeing.
>>>>>
>>>>>                        Gabe
>>>>>
>>>>>                        Ali Saidi wrote:
>>>>>
>>>>>
>>>>>                            I haven't seen any strange behavior yet.
>>>>>                            That isn't to say it's not
>>>>>                            going to cause an issue in the future,
>>>>>                            but we've taken many a tlb miss
>>>>>                            and it hasn't fallen over yet.
>>>>>
>>>>>                            Ali
>>>>>
>>>>>                            On Mon, 22 Nov 2010 13:08:13 -0800, Steve
>>>>>                            Reinhardt <ste...@gmail.com
>>>>>                            <mailto:ste...@gmail.com>>
>>>>>                            wrote:
>>>>>
>>>>>                                Yea, I just got around to reading
>>>>>                                this thread and that was the point
>>>>>                                I was going to make... the L1 cache
>>>>>                                effectively serves as a
>>>>>                                translator between the CPU's
>>>>>                                word-size read & write requests and the
>>>>>                                coherent block-level requests that
>>>>>                                get snooped.  If you attach a
>>>>>                                CPU-like device (such as the table
>>>>>                                walker) directly to an L2, the
>>>>>                                CPU-like accesses that go to the L2
>>>>>                                will get sent to the L1s but I'm
>>>>>                                not sure they'll be handled
>>>>>                                correctly.  Not that they fundamentally
>>>>>                                couldn't, this just isn't a
>>>>>                                configuration we test so it's likely that
>>>>>                                there are problems... for example,
>>>>>                                the L1 may try to hand ownership
>>>>>                                to the requester but the requester
>>>>>                                won't recognize that and things
>>>>>                                will break.
>>>>>
>>>>>                                Steve
>>>>>
>>>>>                                On Mon, Nov 22, 2010 at 12:00 PM,
>>>>>                                Gabe Black <gbl...@eecs.umich.edu
>>>>>                                <mailto:gbl...@eecs.umich.edu>
>>>>>                                gbl...@eecs.umich.edu
>>>>>                                <mailto:gbl...@eecs.umich.edu>>> wrote:
>>>>>
>>>>>                                   What happens if an entry is in the
>>>>>                                L1 but not the L2?
>>>>>
>>>>>                                   Gabe
>>>>>
>>>>>                                   Ali Saidi wrote:
>>>>>> Between the l1 and l2 caches
>>>>>                                seems like a good place to me. The
>>>>>                                   caches can cache page table
>>>>>                                entries, otherwise a tlb miss would
>>>>>                                   be even more expensive then it is.
>>>>>                                The l1 isn't normally used for
>>>>>                                   such things since it would get
>>>>>                                polluted (look why sparc has a
>>>>>                                   load 128bits from l2, do not
>>>>>                                allocate into l1 instruction).
>>>>>> Ali
>>>>>>
>>>>>> On Nov 22, 2010, at 4:27 AM,
>>>>>                                Gabe Black wrote:
>>>>>>
>>>>>>>   For anybody waiting for an
>>>>>                                x86 FS regression (yes, I know,
>>>>>                                   you can
>>>>>>> all hardly wait, but don't let
>>>>>                                this spoil your Thanksgiving)
>>>>>                                   I'm getting
>>>>>>> closer to having it working,
>>>>>                                but I've discovered some issues
>>>>>                                   with the
>>>>>>> mechanisms behind the --caches
>>>>>                                flag with fs.py and x86. I'm
>>>>>                                   surprised I
>>>>>>> never thought to try it before.
>>>>>                                It also brings up some
>>>>>                                   questions about
>>>>>>> where the table walkers should
>>>>>                                be hooked up in x86 and ARM.
>>>>>                                   Currently
>>>>>>> it's after the L1, if any, but
>>>>>                                before the L2, if any, which
>>>>>                                   seems wrong
>>>>>>> to me. Also caches don't seem
>>>>>                                to propagate requests upwards to
>>>>>                                   the CPUs
>>>>>>> which may or may not be an
>>>>>                                issue. I'm still looking into that.
>>>>>>> Gabe
>>>>>>>
>>>>>                                
>>>>> _______________________________________________
>>>>>>> m5-dev mailing list
>>>>>>> m5-dev@m5sim.org
>>>>>                                <mailto:m5-dev@m5sim.org>
>>>>>                                m5-dev@m5sim.org
>>>>>                                <mailto:m5-dev@m5sim.org>>
>>>>>
>>>>>                                http://m5sim.org/mailman/listinfo/m5-dev
>>>>>>>
>>>>>>
>>>>>                                
>>>>> _______________________________________________
>>>>>> m5-dev mailing list
>>>>>> m5-dev@m5sim.org
>>>>>                                <mailto:m5-dev@m5sim.org>
>>>>>                                m5-dev@m5sim.org
>>>>>                                <mailto:m5-dev@m5sim.org>>
>>>>>
>>>>>                                http://m5sim.org/mailman/listinfo/m5-dev
>>>>>
>>>>>                                 
>>>>> _______________________________________________
>>>>>                                   m5-dev mailing list
>>>>>                                   m5-dev@m5sim.org
>>>>>                                <mailto:m5-dev@m5sim.org>
>>>>>                                m5-dev@m5sim.org
>>>>>                                <mailto:m5-dev@m5sim.org>>
>>>>>
>>>>>
>>>>>                                 http://m5sim.org/mailman/listinfo/m5-dev
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>                            
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>>                            _______________________________________________
>>>>>                            m5-dev mailing list
>>>>>                            m5-dev@m5sim.org <mailto:m5-dev@m5sim.org>
>>>>>                            http://m5sim.org/mailman/listinfo/m5-dev
>>>>>
>>>>>
>>>>>                        _______________________________________________
>>>>>                        m5-dev mailing list
>>>>>                        m5-dev@m5sim.org <mailto:m5-dev@m5sim.org>
>>>>>                        http://m5sim.org/mailman/listinfo/m5-dev
>>>>>
>>>>>
>>>>>                    _______________________________________________
>>>>>                    m5-dev mailing list
>>>>>                    m5-dev@m5sim.org <mailto:m5-dev@m5sim.org>
>>>>>                    http://m5sim.org/mailman/listinfo/m5-dev
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            _______________________________________________
>>>>>            m5-dev mailing list
>>>>>            m5-dev@m5sim.org <mailto:m5-dev@m5sim.org>
>>>>>            http://m5sim.org/mailman/listinfo/m5-dev
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>    _______________________________________________
>>>>    m5-dev mailing list
>>>>    m5-dev@m5sim.org <mailto:m5-dev@m5sim.org>
>>>>    http://m5sim.org/mailman/listinfo/m5-dev
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> m5-dev mailing list
>>>> m5-dev@m5sim.org
>>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>>
>>>>
>>> _______________________________________________
>>> m5-dev mailing list
>>> m5-dev@m5sim.org
>>> http://m5sim.org/mailman/listinfo/m5-dev
>>>
>> X86, ARM: Add L1 caches for the TLB walkers.
>>
>> Small L1 caches are connected to the TLB walkers when caches are used. This
>> allows them to participate in the coherence protocol properly.
>>
>> diff --git a/configs/common/CacheConfig.py b/configs/common/CacheConfig.py
>> --- a/configs/common/CacheConfig.py
>> +++ b/configs/common/CacheConfig.py
>> @@ -43,8 +43,14 @@
>>
>>     for i in xrange(options.num_cpus):
>>         if options.caches:
>> -            system.cpu[i].addPrivateSplitL1Caches(L1Cache(size = '32kB'),
>> -                                                  L1Cache(size = '64kB'))
>> +            if buildEnv['TARGET_ISA'] in ['x86', 'arm']:
>> +                system.cpu[i].addPrivateSplitL1Caches(L1Cache(size = 
>> '32kB'),
>> +                                                      L1Cache(size = 
>> '64kB'),
>> +                                                      
>> PageTableWalkerCache(),
>> +                                                      
>> PageTableWalkerCache())
>> +            else:
>> +                system.cpu[i].addPrivateSplitL1Caches(L1Cache(size = 
>> '32kB'),
>> +                                                      L1Cache(size = 
>> '64kB'))
>>         if options.l2cache:
>>             system.cpu[i].connectMemPorts(system.tol2bus)
>>         else:
>> diff --git a/configs/common/Caches.py b/configs/common/Caches.py
>> --- a/configs/common/Caches.py
>> +++ b/configs/common/Caches.py
>> @@ -42,6 +42,14 @@
>>     mshrs = 20
>>     tgts_per_mshr = 12
>>
>> +class PageTableWalkerCache(BaseCache):
>> +    assoc = 2
>> +    block_size = 64
>> +    latency = '1ns'
>> +    mshrs = 10
>> +    size = '1kB'
>> +    tgts_per_mshr = 12
>> +
>> class IOCache(BaseCache):
>>     assoc = 8
>>     block_size = 64
>> diff --git a/src/cpu/BaseCPU.py b/src/cpu/BaseCPU.py
>> --- a/src/cpu/BaseCPU.py
>> +++ b/src/cpu/BaseCPU.py
>> @@ -166,7 +166,7 @@
>>             if p != 'physmem_port':
>>                 exec('self.%s = bus.port' % p)
>>
>> -    def addPrivateSplitL1Caches(self, ic, dc):
>> +    def addPrivateSplitL1Caches(self, ic, dc, iwc = None, dwc = None):
>>         assert(len(self._mem_ports) < 8)
>>         self.icache = ic
>>         self.dcache = dc
>> @@ -175,12 +175,17 @@
>>         self._mem_ports = ['icache.mem_side', 'dcache.mem_side']
>>         if buildEnv['FULL_SYSTEM']:
>>             if buildEnv['TARGET_ISA'] in ['x86', 'arm']:
>> -                self._mem_ports += ["itb.walker.port", "dtb.walker.port"]
>> +                self.itb_walker_cache = iwc
>> +                self.dtb_walker_cache = dwc
>> +                self.itb.walker.port = iwc.cpu_side
>> +                self.dtb.walker.port = dwc.cpu_side
>> +                self._mem_ports += ["itb_walker_cache.mem_side", \
>> +                                    "dtb_walker_cache.mem_side"]
>>             if buildEnv['TARGET_ISA'] == 'x86':
>>                 self._mem_ports += ["interrupts.pio", "interrupts.int_port"]
>>
>> -    def addTwoLevelCacheHierarchy(self, ic, dc, l2c):
>> -        self.addPrivateSplitL1Caches(ic, dc)
>> +    def addTwoLevelCacheHierarchy(self, ic, dc, l2c, iwc = None, dwc = 
>> None):
>> +        self.addPrivateSplitL1Caches(ic, dc, iwc, dwc)
>>         self.toL2Bus = Bus()
>>         self.connectMemPorts(self.toL2Bus)
>>         self.l2cache = l2c
>> diff --git a/src/cpu/o3/O3CPU.py b/src/cpu/o3/O3CPU.py
>> --- a/src/cpu/o3/O3CPU.py
>> +++ b/src/cpu/o3/O3CPU.py
>> @@ -141,7 +141,7 @@
>>     smtROBThreshold = Param.Int(100, "SMT ROB Threshold Sharing Parameter")
>>     smtCommitPolicy = Param.String('RoundRobin', "SMT Commit Policy")
>>
>> -    def addPrivateSplitL1Caches(self, ic, dc):
>> -        BaseCPU.addPrivateSplitL1Caches(self, ic, dc)
>> +    def addPrivateSplitL1Caches(self, ic, dc, iwc = None, dwc = None):
>> +        BaseCPU.addPrivateSplitL1Caches(self, ic, dc, iwc, dwc)
>>         self.icache.tgts_per_mshr = 20
>>         self.dcache.tgts_per_mshr = 20
>> diff --git a/tests/configs/realview-simple-atomic.py 
>> b/tests/configs/realview-simple-atomic.py
>> --- a/tests/configs/realview-simple-atomic.py
>> +++ b/tests/configs/realview-simple-atomic.py
>> @@ -53,6 +53,17 @@
>>     write_buffers = 8
>>
>> # ---------------------
>> +# Page table walker cache
>> +# ---------------------
>> +class PageTableWalkerCache(BaseCache):
>> +    assoc = 2
>> +    block_size = 64
>> +    latency = '1ns'
>> +    mshrs = 10
>> +    size = '1kB'
>> +    tgts_per_mshr = 12
>> +
>> +# ---------------------
>> # I/O Cache
>> # ---------------------
>> class IOCache(BaseCache):
>> @@ -86,7 +97,9 @@
>>
>> #connect up the cpu and l1s
>> cpu.addPrivateSplitL1Caches(L1(size = '32kB', assoc = 1),
>> -                            L1(size = '32kB', assoc = 4))
>> +                            L1(size = '32kB', assoc = 4),
>> +                            PageTableWalkerCache(),
>> +                            PageTableWalkerCache())
>> # connect cpu level-1 caches to shared level-2 cache
>> cpu.connectMemPorts(system.toL2Bus)
>> cpu.clock = '2GHz'
>> diff --git a/tests/configs/realview-simple-timing.py 
>> b/tests/configs/realview-simple-timing.py
>> --- a/tests/configs/realview-simple-timing.py
>> +++ b/tests/configs/realview-simple-timing.py
>> @@ -54,6 +54,17 @@
>>     write_buffers = 8
>>
>> # ---------------------
>> +# Page table walker cache
>> +# ---------------------
>> +class PageTableWalkerCache(BaseCache):
>> +    assoc = 2
>> +    block_size = 64
>> +    latency = '1ns'
>> +    mshrs = 10
>> +    size = '1kB'
>> +    tgts_per_mshr = 12
>> +
>> +# ---------------------
>> # I/O Cache
>> # ---------------------
>> class IOCache(BaseCache):
>> @@ -88,7 +99,9 @@
>>
>> #connect up the cpu and l1s
>> cpu.addPrivateSplitL1Caches(L1(size = '32kB', assoc = 1),
>> -                            L1(size = '32kB', assoc = 4))
>> +                            L1(size = '32kB', assoc = 4),
>> +                            PageTableWalkerCache(),
>> +                            PageTableWalkerCache())
>> # connect cpu level-1 caches to shared level-2 cache
>> cpu.connectMemPorts(system.toL2Bus)
>> cpu.clock = '2GHz'
>> _______________________________________________
>> m5-dev mailing list
>> m5-dev@m5sim.org
>> http://m5sim.org/mailman/listinfo/m5-dev
> _______________________________________________
> m5-dev mailing list
> m5-dev@m5sim.org
> http://m5sim.org/mailman/listinfo/m5-dev

_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] X86 FS regression

Reply via email to