Re: [Open-graphics] Re: PCI and flash

Timothy Miller Sat, 20 Aug 2005 21:51:00 -0700

On 8/20/05, Brian Magnuson <[EMAIL PROTECTED]> wrote:

> > >
> > > sclk will be a 1/2 version of clock which I'm taking to be the PCI clock 
> > > (66
> > > MHz?)  At this speed the steady state read bandwidth is 4.125MB/s so the 
> > > FPGA
> > > program will take about 0.33s which should be fine.
> >
> > PCI will be 33, 66, or 133.  And for a PCI-X to PCIe bridge, I'd like
> > to see if we can't do even faster.
> >
> > > This also means that
> > > between each read data word from PCI there will be 64 wait states.  Can 
> > > we do
> > > this?
> >
> > Sortof.  Most systems should handle this, but what will happen is that
> > after 16 wait states, we'll terminate the transaction with retry.
> > That'll happen a few times before the data arrives, and then we'll
> > grab it when it arrives.  As such, I have some suggestions:
> >
> > (1) Don't try to allow a stream of accesses.  Make them atomic.  If we
> > could do one per clock, that would be one thing, but since it takes so
> > long, there's no point in having the logic necessary to pipeline
> > anything or whatever.  Just don't accept a new request while one is in
> > progress, etc.  If you notice, my assumed interface has "busy" and
> > "read valid" signals.  The busy means it's busy doing anything and
> > that requests will be ignored.
> >
> 
> There wasn't really any pipelining going on.  A read would simply continue
> from the starting address until told to stop.  Busy also exists, although I
> called it ready.  I rather like this mode for programming.  There's a lot of
> overhead in starting and finishing a command, but once started a read gives
> data on every clock.  If the other client (PCI) just wants one word it can
> assert done along with the command.


Ok, I see what you're talking about.  Ok, then we should do something
like cache 16 words.  (16 words because of the nature of distributed
RAMs in these FPGAs.)  Either we give it its own cache, or we can
combine that with the cache in pci_send.  This, I think, is a good
compromise between speed and cache size.  The only problem is if we
get REALLY random access.  In that case, we should detect when a read
is a cache miss, abort the current read transaction, invalidate the
cache, and start again.  Oh, and to complicate things further, we
should start at whatever word was requested, and then wrap around at
the end of the cache line.  The logic for reading from a cache is
pretty straight-forward and can be copied from pci_send.

It's also important to check the address every time.  I don't want to
just assume that what looked like a read from a given address is
what's going to be requested again.

There may be PCI spec rules against this, but let's say one master
requests a read at address 0x0020, and the bus transaction times out
(abort with retry).  Then another bus master comes in trying to read
0x0030.  If we assume that the next read is 0x0020, then we  might
hand the other master the wrong data.  Mind you, such a situation IS a
mess, but I also don't want some spurious thing to hand back bad data
either.  If a request for 0x0020 comes in, and the transaction times
out, I want to treat the next transaction as a separate one, even
though it's most likely going to be the same address.

> 
> >
> > (2) When a read request arrives, start the state machine.  When the
> > read is completed, cache it (one word cache), and compare the address
> > we're requesting.  When the address matches, assert the "data valid",
> > asynchronously, unless we have to register it for speed, which I'll
> > figure out and hack later.  Also, have a timer that invalidates the
> > word after a while, say 128 cycles after it's arrived.  See my
> > pci_send block's cache logic for an example of that.
> 
> Not sure that caching is helpful here, especially a single word cache.
> Wouldn't the typical access pattern be a streaming read?  This just seems sort
> of complicated.

Yeah.  Above, you're saying that the number of clocks for one read is
X+Y, while multiple is X+N*Y.  If X is large, it's good to read
multiple words and cache them.  Typical PATTERNS are streams, but you
can't assume that it will ALWAYS be a stream, lest you make an
incorrect assumption and return bad data.

> 
> > >
> > > 21 state, one-hot state machine.  Lots of states but the next state logic 
> > > for
> > > the most part is dead simple.
> > >
> > > I don't have much in the way of a verification environment yet.  
> > > Apparently SST
> > > does not provide a model of this part.  You have to go through Denali and 
> > > use
> > > their commercial tool which even if we could get it for free probably 
> > > won't
> > > play all that well with iverilog.  So somebody needs to get a BFM 
> > > together to
> > > see if this thing really works.
> >
> > I might be coaxed into writing a simulation model of the PROM chip.
> > Can you point me to the appropriate reference materials?
> 
> http://www.sst.com/downloads/datasheet/S71271.pdf

I hope to hop on that next week then.  :)

> > > Speaking of verification have you given any thought to a generalized test
> > > framework?  Unit tests, regressions, C test interface, etc...
> >
> > I have thought that it should happen.  :)  Actually, I kinda started
> > on it for PCI.  I have some tasks that manipulate PCI signals, and I
> > thought we could expand that.  We could even write another state
> > machine to act as the host so we can simulate DMA and stuff,
> > eventually.
> 
> Heh.  I don't have any grand plans either yet.  Just wondering if maybe you
> did. :)

Nothing specific.  But I figure we could start with some simple tasks
like pci_pio_write and pci_pio_read that a test bench could call to
access the bus.  We did that with TROZ and made something that queued
them, detected bursts, and handled them appropriately.

> > > There will be two clients (PCI and FPGA) but I haven't built any 
> > > arbitration
> > > into the module.  The basic idea is that the FPGA gets priority and the 
> > > flash is
> > > unavailable to PCI while programming is in progress.  Can't see how this 
> > > would
> > > be a problem since it's probably bad form to be programing the flash as 
> > > it's
> > > written into the FPGA. :)
> >
> > Well, I thought about that a bit too, and you're not going to like the
> > answer.  I think PCI MUST get priority, because the host may try to
> > POST the device and map it (being a graphics device) before the FPGA
> > is programmed.  As such, we need to be able to read BIOS while
> > programming the FPGA.  Sucks, neh?
> 
> Yeah, a little more complicated, but logic that belongs outside of this module
> though.  I'm thinking in the arbiter which will need a way to tell the
> requestor (the FPGA progamming logic) that it's being preempted.  The FPGA
> programmer will then need to smarts to start it's next request at the
> appropriate address.  Not a problem.

Ok, so perhaps we can have your word-oriented block that caches N
words, and then we have another arbiter block that's responsible for
programming the FPGA and marshalling requests from the host.

> > > There's a single, 8 bit status/control register in the flash.  We could 
> > > make
> > > this available at an address just above the top of the flash.
> >
> > We could map it to a few different places, such as PCI cfg space.
> > Another thing we should consider is to map the first so many registers
> > of the engine space to the Lattice.
> 
> I'll defer to you here.

It's always better to memory-map.  For debugging purposes, it would be
good to be able to access this stuff from user space.

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Re: PCI and flash

Reply via email to