Hi Ali,Lisa, and Steve,
Thanks a lot for your help.
Following Ali's advice, I have tried two methods to implement the time measure 
in the
guest system. But they still led to the same results which is same as I 
mentioned  in
my first mail. The I/O access time in CacheCPU is still much bigger than it in 
Detail-
edCPU. 
Here are my methods.
Method 1. As I did before, I added a m5 instrustion to the simulator. The 
following code
fragments come from the file isa_desc and pseudo_inst.cc in my m5 code.
In isa_desc:
...
            0x24: gettick({{
                Ra = AlphaPseudo::gettick(xc->xcBase()) + (Rb & 0);
            }}, No_OpClass, IsNonSpeculative);
...
In pseudo_inst.cc:
...
    uint64_t
    gettick (ExecContext *xc)
    {
        return curTick;
    }
...
And I changed the readl and writel function with the following function in 
Linux driver.
static void __stat_writel(u32 v, volatile void __iomem *addr)
{
    int64_t before, after;
    after = 0;
    before = gettick(after);
    writel (v, addr);
    after = gettick ((int64_t)addr);
    
    if (enable_stat){
        iow_cycles += (after - before);
        iow_count ++; 
    }
}
static u32  __stat_readl(const volatile void __iomem *addr)
{
    int64_t before, after;
    u32 ret;
    after = 0;
    before = gettick(after);
    ret = readl (addr);
    after = gettick (ret);
   
    if (enable_stat){ 
        ior_cycles += (after - before);
        ior_count ++;
    }
    
    return ret;
}
Method 2. I use the rpcc instruction to get the tick value, not mine pseudo
instruction. The only difference in driver code is that the gettick function
is replaced by __rpcc function which is a wrapper of rpcc instruction.
static __inline int64_t _rpcc(int64_t dep)
{
    int64_t res;
    asm volatile ("rpcc %0, %1" : "=r"(res) : "r"(dep) : "memory");
    return res;
}

I'm confused with this problem. I think that the I/O register access could
not be fast as the cache access. It seems that the data was come from the 
cache not the device. Because the instruction dependency has been added in
my time measure code, the out-of-order model could not be real reason. And
I have found a strange thing. In detailedCPU model, the first I/O access 
time is similar as the cachedCPU model. The following text is come from the
console log. The strings in it is printed by printk in Linux driver.

XXXXXXXisrXXXXXXX
irq count:16, used 3973 cycles
        In this irq:
        io read count:1, used 1556 cycles
        io write count:1, used 1604 cycles
XXXXXXXisrXXXXXXX
irq count:17, used 4736 cycles
        In this irq:
        io read count:2, used 3240 cycles
        io write count:0, used 0 cycles
        
...from here, the simulator entered the detailed mode...        
XXXXXXXisrXXXXXXX
irq count:18, used 4169 cycles
        In this irq:
        io read count:2, used 1623 cycles
        io write count:0, used 0 cycles
XXXXXXXisrXXXXXXX
irq count:19, used 3818 cycles
        In this irq:
        io read count:1, used 9 cycles
        io write count:1, used 11 cycles

Could anyone give me some more advices? Thanks a lot!

Richard R. Zhang
2006-05-12



发件人: Ali Saidi
发送时间: 2006-04-27 12:14:54
收件人: Steve Reinhardt
抄送: Richard R. Zhang; Lisa Hsu; m5sim-users
主题: Re: [m5sim-users] Is there something wrong with the io access latency?

I  believe  Steve  is  exactly  correct,  the  out-of-order  model  is  not    
enforcing  a  dependency  between  your  two  instructions.  The  way  to  fix  
  
it  is  to  force  a  dependancy  to  a  register  (for  example  the  result  
of    
the  load).  You  need  to  do  this  both  in  the  decoder    and  in  the  
code    
that  executes  the  instruction.

For  example  for  the  rpcc  instruction  (this  code  may  be  a  little  bit 
   
newer  than  yours,  but  same  idea):
   /*  Rb  is  a  fake  dependency  so  here  is  a  fun  way  to  get  the  
parser    
to  understand  that.  */
   Ra  =  xc- >readMiscRegWithEffect(AlphaISA::IPR_CC,  fault)  +  (Rb  &  0);

and  in  some  code:

inline  uint32_t  cycleCounter(uint32_t  dep)
{
         uint32_t  res;
         asm  volatile  ("rpcc  %0,  %1"  :  "=r"(res)  :  "r"  (dep)  :  
"memory");
         return  res;
}

   t1  =  cycleCounter(trash);
   for  (x  =  0;  x   <  count;  x++)  {
                 trash  =  readl(addr);
                 t2  =  cycleCounter(trash);
}


Ali

On  Apr  26,  2006,  at  9:59  PM,  Steve  Reinhardt  wrote:

>
>  My  guess  would  be  that  it  has  to  do  with  the  out-of-order    
>  scheduling  in  the  detailed  CPU.    If  the  instruction  that  reads    
>  curTick  has  no  dependence  on  the  read  or  write  instructions,  then  
>   
>  it  will  get  executed  out-of-order  while  the  read  or  write  is  
> still    
>  stalled.
>
>  I  remember  that  we  ran  into  this  problem  ourselves  but  I  don't    
>  remember  the  details  of  how  we  solved  it...  Ali  or  Nate,  can  you 
>    
>  help  here?
>
>  Steve
>
>  Richard  R.  Zhang  wrote:
> >  Hi  Lisa  and  all  M5  users,
> >  I  find  something  strange  with  the  io  access  latency.  Could  you   
> >  
> >  give  me  a  hint  with  it?
> >  I  have  added  a  new  instruction  to  the  alpha  isa.  This  
> > instruction    
> >  can  get  the  curTick  in  M5.  It  seems  to  work  correctly.  So,  I  
> > plan    
> >  to  use  it  to  measure  the  time  in  the  guest  OS.  Then,  I  added  
> > some    
> >  statements  to  the  ns83820  driver.  These  statements  compute  the    
> >  time  used  by  the  driver  irq  routine  and  the  io  access(just  
> > compute    
> >  the  of  writel  and  readl).  But  the  results  below  puzzled  me,  and 
> >  I    
> >  can't  explain  it.  These  results  come  from  a  netperf  maerts  test  
> >   
> >  under  Sampler  mode,  and  the  memory  configuration  is  STE.
> >  ---------------------------------------------------------------------  
> >  --------
> >  | | CacheCPU  mode | DetailedCPU  mode |
> >  ---------------------------------------------------------------------  
> >  --------
> >  |avg.io  read  time | 1581  cycles | 40  cycles |
> >  ---------------------------------------------------------------------  
> >  --------
> >  |avg.io  write  time | 1561  cycles | 9  cycles |
> >  ---------------------------------------------------------------------  
> >  --------
> >  I  don't  know  why  the  io  access  time  in  CacheCPU  is  much  bigger 
> >    
> >  than  it  in  DetailedCPU.  I  think  that  the  time  in  CacheCPU  mode  
> >   
> >  should  less  than  which  in  DetailedCPU  mode,  at  least  equal  to  
> > it.    
> >  This  is  strange  to  me.  Could  anybody  give  me  the  explain  with  
> > it?    
> >  Thanks  a  lot.
> >  Best  wishes,
> >  Richard  R.  Zhang
> >  2006-04-26
> >  -------------------------------------------------------
> >  Using  Tomcat  but  need  to  do  more?  Need  to  support  web  services, 
> >    
> >  security?
> >  Get  stuff  done  quickly  with  pre-integrated  technology  to  make  
> > your    
> >  job  easier
> >  Download  IBM  WebSphere  Application  Server  v.1.0.1  based  on  Apache  
> >   
> >  Geronimo
> >  http://sel.as-us.falkag.net/sel?  
> >  cmd=lnk&kid=120709&bid=263057&dat=121642
> >  _______________________________________________
> >  m5sim-users  mailing  list
> >  [email protected]
> >  https://lists.sourceforge.net/lists/listinfo/m5sim-users
>
>
>  -------------------------------------------------------
>  Using  Tomcat  but  need  to  do  more?  Need  to  support  web  services,   
>  
>  security?
>  Get  stuff  done  quickly  with  pre-integrated  technology  to  make  your  
>   
>  job  easier
>  Download  IBM  WebSphere  Application  Server  v.1.0.1  based  on  Apache    
>  Geronimo
>  http://sel.as-us.falkag.net/sel?  
>  cmd=lnk&kid=120709&bid=263057&dat=121642
>  _______________________________________________
>  m5sim-users  mailing  list
>  [email protected]
>  https://lists.sourceforge.net/lists/listinfo/m5sim-users
>

Reply via email to