The PCI core for OGP is a Moore state machine, which means that the
outputs (PCI control signals, etc) are a function of the state of the
machine.  There is no direct connection between the inputs and the
outputs.  However, the NEXT state of the machine is affected by the
inputs, and those inputs come in three varieties:

- The previous state
- Inputs from other logic inside of the chip
- Inputs from OUTSIDE of the chip

It's the last set that's a serious problem.  While the first two are
"instantly" available from registers in the logic of the chip, the
last group suffer LONG propagation delays from the logic in another
device on the bus, the output driver in that device, the bus wires,
and finally, the input buffers in our chip.  This can subtract many
precious nanoseconds away from the time we have to combine all the
inputs in our logic to compute the next state.

In order to do this, you need to move the logic that uses the slow
inputs to as far to the end of the combinatorial logic as possible. 
Unfortunately, logic synthesizers aren't always so smart about
rearranging your logic to take into account the added delay on those
signals.  Some are better than others, but the solution I found that
works well is to carefully construct the logic so that the synthesizer
has no choice about it.

Consider the PCI target.  For that, there are really only two slow
inputs that matter:  frame and irdy.  Given those two signals, there
are four possible combinations:

irdy=0, frame=0
irdy=0, frame=1
irdy=1, frame=0
irdy=1, frame=1

So in my state machine, what I decided to do was generate four
possible next states, one for each combination.  If you look through
older revisions of what I checked into SVN, you'll see variables like
"next_state_f0i0".  The structure of the logic computes all four
(which does increase the size a bit), and then finally at the end
muxes them together based on the slow inputs.  This results in plenty
of slack on the inputs.

Sounds great, right?  Until I decided to integrate in the Master
logic.  Now, in addition to irdy and frame, I have to pay attention to
four more input signals:  gnt, trdy, stop, devsel.

Now we have a problem.  If I were to continue the theme, I would have
to generate 64 different possible next states for each state before
muxing them at the end.  Not only would it blow up the required amount
of logic, but it would require so much repetitiveness in the logic
that no one would be able to write it or debug it in any reasonable
amount of time.  There are something like 18 different states, times
64 possible next states for each, results in a hell of a lot of lines
of code.

As a result, I've dragged my feet a bit since the last time I worked
on it.  One solution I came up with was to generate a separate set of
target and master states.  This reduced the number of combinations
down to 20 from 64 and even prevented me from having to write actual
code for the set that wasn't going to be considered.  But that's still
too many for me to want to deal with.

So, here's what dawned on me tonight:  Of those 64 combinations, only
a handful are meaningful.  For instance, any time you're in a master
state and gnt=1 (deasserted), no other signals matter.  gnt=1 means
that this is your last cycle as a master.  As such, 32 of your 64
states (or 8 of your 16) are all the same thing and there's absolutely
no reason to bother with any of them separately.

So, here's the solution:  Of all 64 combinations, go through and
identify which ones are meaningful and which ones are meaningless or
redundant.  Add a thin layer of logic that reduces those combinations
to a much smaller number.  Now, the state machine only has to compute
that number of possible "next states".  (Whether those states are
numbered in binary or one-hot is something to be determined as a
result of trying to synthesize it, but I'm going to start with
binary.)

For completeness, since there's a dissociation between target and
master states, there are two more inputs for selecing this number: 
idle and master-state.  A target state is when neither is true.

Hopefully, partitioning the logic this way will keep the distance
between the slow inputs and the state registers to a minimum.  Not
partitioning, where inputs are considered directly in the next-state
computation, would leave the slack time for slow inputs uncontrolled
and severely limit the speed.

_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Reply via email to