Re: [Open-graphics] Sample VGA translation code, for nanocontroller

Petter Urkedal Sun, 09 Sep 2007 04:03:43 -0700

On 2007-09-08, Timothy Normand Miller wrote:
> > First, a simplified suboptimal solution without continuous
> > thread-switching:  When a thread requests reading from the source, one
> > of two things can happen:
> >
> >   * The source is nonempty and the next data belongs to the current
> >     thread.
> >
> >   * The source is empty or the next data belongs to another thread.
> >
> > In the former case the instruction is executed.  In the latter case, the
> > read instruction will be propagated down the pipeline as a noop, the PC
> > is reset to point to the same read instruction, and a context switch
> > happens.  Another case which could trigger a context switch, would be
> > interrupts.
> 
> We already have a branch-delay architecture.  We do this so we don't
> have to have any flow control in the pipeline.  By the time this read
> gets down to the MEMIO stage, a few other instructions have already
> been fetched and partially processed.  We would have to flush the
> pipeline, ensuring that there are no effects of those instructions.
> What if one of those following instructions is itself a branch?  What
> then?  If we're going to do this, we might as well have a design with
> flow control, exceptions, interrupts, and the like.  Think about the
> complexity necessary to do this.


I should have pointed out that I assumed register-like IO, and that the
scheduler will make it's decision parallel to the register-fetch stage,
in other words based on the output registered by the instruction-fetch
stage.  The decision is whether to turn the instruction into a noop at
the output of the register-fetch and reset the PC for the thread.  Then
I don't think flushing the pipeline is needed.

> > However, if we can multi-task the most critical code, we should try to
> > guarantee that
> >
> >     I1. The same thread never runs in two consecutive cycles.
> >
> > That way we
> >   * get rid of the delay slot,
> >   * can rip out half of the register forwarding, and
> >   * can pipeline the adder over two stages without imposing
> >     data-dependency constraints in the machine code.
> 
> Those are all nice benefits.  I'm just wondering if a half-speed
> thread will keep up with some of the chores?

Only if we could parallel the most critical task, which I guess is DMA
command packet decoding.  That is really the thing which will determine
if this makes any sense at all.  Nevertheless, I think I can see that
this will have an impact on both the command stream and the FIFOs which
we'll be happy to avoid, at least until some time when we have a
complete working design and feel very ambitious.

> > First, with only 2 context, I1 predetermines that the two threads must
> > be run alternatively.  That could easily mean that half of the cycles
> > will process inserted noops when both threads tries to read.  Therefore,
> > with 2 threads should abandon I1, or use a smarter read-data FILO which
> > splits the replies into two sources, one for each thread.
> 
> We already have a proliferation of FIFOs.  I'm already suggesting too
> many.  Adding more.... ugh.

Point taken.  It would go a long way with only muxing between the next
two entries on the source end.  But it may very well be that the
resulting scheduling decision would be timing-bottleneck, considering
that it need to feed back to the next-PC of the instruction fetch;
downwards the pipeline we have more time. [1]

[Insert:] Observing that we never run the same thread two cycles in a
row due to I1, we actually have one cycle to fix up the next-PC of the
thread which blocked, though I suspect it'll take a fair amount of
gates.

> > With 4 contexts the scheduling gets more interesting.  One thing we can
> > decide right away is that
> >
> >     S1.  If there is data in the read-reply FIFO and the owner was not
> >     run on the previous cycles, then we run the owner.
> >
> > Otherwise, we know that any thread which tries to read can not proceed.
> > Further heuristic is difficult since we don't know if the next
> > instruction for a thread is a read before we fetch it.  The easiest
> > solution is probably to make an attempt, if we fail we set a
> > "pending-read" bit for that thread, and we don't re-schedule it before
> > S1 applies.  A more predictive heuristic is to keep track of how many
> 
> See, all of this makes me think about things like branch prediction
> and the like.  Branch predition, of course, requires that we be able
> to flush the pipeline and restart.

As noted about, I'm counting on doing the scheduling decision right
after instruction fetch, so as long as we ban jump in the delay slot, we
can just turn the wrongly fetched instruction into a harmless one.

> > If we more than one async source to deal with, scheduling decisions
> > become more complex, however, it such a case we could probably exploit
> > the independence of the resources, so that only selected threads have
> > access to each resource.
> 
> I just don't like the idea of having to have any sort of instruction
> scheduling (besides the most trivial) in a design that we're trying to
> keep as small as possible.

I completely agree.  I'm only considering what we could do with a simple
context-switching mechanism.  The problems I see are

  * Difficult to make good use of the parallelism where it matters most
    (DMA interception).

  * Feedback from scheduling to instruction-fetch may become a
    bottleneck.  [Insert:] Maybe we can avoid this, as noted above.

  * Either will have proliferation of FIFOs, well have imperfect
    scheduling decisions, or we have to give up on thread-alternation
    and thus the pipelining opportunities (which was the main point).
_______________________________________________
Open-graphics mailing list
[email protected]
http://lists.duskglow.com/mailman/listinfo/open-graphics
List service provided by Duskglow Consulting, LLC (www.duskglow.com)

Re: [Open-graphics] Sample VGA translation code, for nanocontroller

Reply via email to