Re: OT? Upper limits of FSB

2019-01-09 Thread Paul Koning via cctalk



> On Jan 9, 2019, at 2:54 PM, dwight via cctalk  wrote:
> 
> ...
> Of course in an embedded processor you can run in kernel mode and busy wait
> if you want.

Yes, and that has a number of advantages.  You get well defined latencies and 
everything that the program does gets done within bounded time (if you do the 
simple thing of putting work limits on all your work loops).  Knowing that your 
real time system can't ever block a task is a very good property.  By contrast, 
with interrupts, never mind priority schedulers, it's much harder to ensure 
this.

I've built network switches whose data paths were coded this way, and they are 
both simple and reliable.

paul



Re: OT? Upper limits of FSB

2019-01-09 Thread dwight via cctalk
As long as things stay in a pipe, instruction decode and execution looks to 
execute in one cycle. Pipe flushes are the penalty. That is where speculative 
execution pays off. ( also food for Meltdown and Spectre type security holes ). 
Such loops are quite fast if the prediction was right.
Running small unrolled loops only give you a small advantage if the predictor 
is working well for your code. Large unrolled loops only gain a small amount 
percentage wise, as always.
If one is unrolling a large amount, one may end up getting a cache miss. That 
can easily eats up any benefit of unrolling the loops. Before speculative 
execution, unrolling had a clear advantage.
Dwight


From: cctalk  on behalf of Eric Korpela via 
cctalk 
Sent: Wednesday, January 9, 2019 11:06 AM
To: ben; General Discussion: On-Topic and Off-Topic Posts
Subject: Re: OT? Upper limits of FSB

On Tue, Jan 8, 2019 at 3:01 PM ben via cctalk  wrote:

> I bet I/O loops throw every thing off.
>

Even worse than you might think.  For user mode code you've got at least
two context switches which are typically thousands of CPU cycles.  On the
plus side when you start waiting for I/O the CPU will execute another
context switch to resume running something else while waiting for the I/O
to complete.  By the time you get back to your process, it's likely
process memory may be at L3 or back in main memory.  Depending upon what
else is going on it might add 1 to 50 microseconds per I/O just for context
switching and reloading caches.

Of course in an embedded processor you can run in kernel mode and busy wait
if you want.

Even fast memory mapped I/O (i.e. PCIe graphics card) that doesn't trigger
a page fault is going to have variable latency and will probably have
special cache handling.


--
Eric Korpela
korp...@ssl.berkeley.edu
AST:7731^29u18e3


Re: OT? Upper limits of FSB

2019-01-09 Thread Eric Korpela via cctalk
On Tue, Jan 8, 2019 at 3:01 PM ben via cctalk  wrote:

> I bet I/O loops throw every thing off.
>

Even worse than you might think.  For user mode code you've got at least
two context switches which are typically thousands of CPU cycles.  On the
plus side when you start waiting for I/O the CPU will execute another
context switch to resume running something else while waiting for the I/O
to complete.  By the time you get back to your process, it's likely
process memory may be at L3 or back in main memory.  Depending upon what
else is going on it might add 1 to 50 microseconds per I/O just for context
switching and reloading caches.

Of course in an embedded processor you can run in kernel mode and busy wait
if you want.

Even fast memory mapped I/O (i.e. PCIe graphics card) that doesn't trigger
a page fault is going to have variable latency and will probably have
special cache handling.


-- 
Eric Korpela
korp...@ssl.berkeley.edu
AST:7731^29u18e3


Re: OT? Upper limits of FSB

2019-01-08 Thread ben via cctalk

On 1/8/2019 3:51 PM, Guy Sotomayor Jr via cctalk wrote:

Some architectures (I’m thinking of the latest Intel CPUs) have a small loop 
cache
whose aim is to keep a loop entirely within that cache.  That cache operates at 
the
full speed of the instruction fetch/execute (actually I think it keeps the 
decoded uOps)
cycles (e.g. you can’t go faster).  L1 caches impose a penalty and of course 
there is
the instruction decode time as well both of which are avoided.

TTFN - Guy


I bet I/O loops throw every thing off.
Ben.



Re: OT? Upper limits of FSB

2019-01-08 Thread Guy Sotomayor Jr via cctalk
Some architectures (I’m thinking of the latest Intel CPUs) have a small loop 
cache
whose aim is to keep a loop entirely within that cache.  That cache operates at 
the
full speed of the instruction fetch/execute (actually I think it keeps the 
decoded uOps)
cycles (e.g. you can’t go faster).  L1 caches impose a penalty and of course 
there is
the instruction decode time as well both of which are avoided.

TTFN - Guy

> On Jan 8, 2019, at 2:43 PM, Chuck Guzis via cctalk  
> wrote:
> 
> On 1/8/19 1:23 PM, Tapley, Mark via cctalk wrote:
> 
>> Why so (why surprising, I mean)? Understood an unrolled loop executes
>> faster...
> 
> That can't always be true, can it?
> 
> I'm thinking of an architecture where the instruction cache is slow to
> fill and multiple overlapping operations are involved and branch
> prediction assumes a branch taken.  I'd say it was very close in that case.
> 
> --Chuck
> 



Re: OT? Upper limits of FSB

2019-01-08 Thread Chuck Guzis via cctalk
On 1/8/19 1:23 PM, Tapley, Mark via cctalk wrote:

> Why so (why surprising, I mean)? Understood an unrolled loop executes
> faster...

That can't always be true, can it?

I'm thinking of an architecture where the instruction cache is slow to
fill and multiple overlapping operations are involved and branch
prediction assumes a branch taken.  I'd say it was very close in that case.

--Chuck



Re: OT? Upper limits of FSB

2019-01-08 Thread Tapley, Mark via cctalk
> On Jan 6, 2019, at 1:31 PM, dwight via cctalk  wrote:
> 
> Surprisingly, this is actually good for older languages like Forth that are 
> fugal with RAM.

Why so (why surprising, I mean)? Understood an unrolled loop executes faster, 
RISC instruction sets have lower information density than CISC instruction sets 
and therefore bigger RAM footprint, and look-up tables are faster than long 
division (or working an infinite series for a transcendental function…). But 
I’ve been worried for a while that the lesson many software engineers are 
learning is 

(more RAM usage) == (faster execution)

and I don’t think that’s a valid lesson. Dwight, thanks for pointing out the 
counter-example!

Re: OT? Upper limits of FSB

2019-01-06 Thread dwight via cctalk
Probably the factor that most think limits thing is the turn-around time. If 
they were limited to one byte request and wait for that data to return, the 
limits of wires would be a wall. Today's serial RAMs send a burts of data 
rather than a word or byte at a time. These blocks of data can use multiple 
serial lanes at one time where the data bits aren't even exactly arriving at 
the same time. There are FIFOs and parallelizers that bring things back 
together. The latency of the first fetch is slower than it used to be for 
traditional fetches but after that things are quite quick. Surprisingly, this 
is actually good for older languages like Forth that are fugal with RAM. Entire 
applications ( less data in some cases ) can be in the CPU's cache for 
immediate use.
Dwight

From: cctalk  on behalf of Curious Marc via 
cctalk 
Sent: Saturday, January 5, 2019 9:40 PM
To: Jeffrey S. Worley; General Discussion: On-Topic and Off-Topic Posts
Subject: Re: OT? Upper limits of FSB

Interconnects at 28Gb/s/lane have been out for a while now, supported by quite 
a few chips. 56Gb/s PAM4 is around the corner, and we run 100Gb/s in the lab 
right now. Just sayin’ ;-). That said, we throw in about every equalization 
trick we know of, PCB materials are getting quite exotic and connectors are 
pretty interesting. We have to hand hold our customers to design their 
interconnect traces and connector breakouts. And you can’t go too far, with 
increasing reliance on micro-twinax or on-board optics for longer distances and 
backplanes.
Marc

> On Jan 4, 2019, at 11:02 PM, Jeffrey S. Worley via cctalk 
>  wrote:
>
> Apropos of nothing, I've been confuse for some time regarding maximum
> clock rates for local bus.
>
> My admittedly old information, which comes from the 3rd ed. of "High
> Performance Computer Architecture", a course I audited, indicates a
> maximum speed on the order of 1ghz for very very short trace lengths.
>
> Late model computers boast multi-hundred to multi gigahertz fsb's.  Am
> I wrong in thinking this is an aggregate of several serial lines
> running at 1 to 200mhz?  No straight answer has presented on searches
> online.
>
> So here's the question.  Is maximum fsb on standard, non-optical bus
> still limited to a maximum of a couple of hundred megahertz, or did
> something happen in the last decade or two that changed things
> dramatically?  I understand, at least think I do, that these
> ridiculously high frequency claims would not survive capacitance issues
> and RFI issues. When my brother claimed a 3.2ghz bus speed for his
> machine I just told him that was wrong, impossible for practical
> purposes, that it had to be an aggregate figure, a 'Pentium rating'
> sort of number rather than the actual clock speed.  I envision
> switching bus tech akin to present networking, paralleled to sidestep
> the limit while keeping pin and trace counts low.?  Something like
> the PCIe 'lane' scheme in present use?  This is surmise based on my own
> experience.
>
> When I was current, the way out of this limitation was fiber-optics for
> the bus.  This was used in supercomputing and allowed interconnects of
> longer length at ridiculous speeds.
>
> Thanks for allowing me to entertain this question.  Though it is not
> specifically a classic computer question, it does relate to development
> and history.
>
>
>
> Best,
>
> Technoid Mutant (Jeff Worley)
>
>
>
>
>


Re: OT? Upper limits of FSB

2019-01-05 Thread Curious Marc via cctalk
Interconnects at 28Gb/s/lane have been out for a while now, supported by quite 
a few chips. 56Gb/s PAM4 is around the corner, and we run 100Gb/s in the lab 
right now. Just sayin’ ;-). That said, we throw in about every equalization 
trick we know of, PCB materials are getting quite exotic and connectors are 
pretty interesting. We have to hand hold our customers to design their 
interconnect traces and connector breakouts. And you can’t go too far, with 
increasing reliance on micro-twinax or on-board optics for longer distances and 
backplanes.
Marc

> On Jan 4, 2019, at 11:02 PM, Jeffrey S. Worley via cctalk 
>  wrote:
> 
> Apropos of nothing, I've been confuse for some time regarding maximum
> clock rates for local bus.
> 
> My admittedly old information, which comes from the 3rd ed. of "High
> Performance Computer Architecture", a course I audited, indicates a
> maximum speed on the order of 1ghz for very very short trace lengths.
> 
> Late model computers boast multi-hundred to multi gigahertz fsb's.  Am
> I wrong in thinking this is an aggregate of several serial lines
> running at 1 to 200mhz?  No straight answer has presented on searches
> online.
> 
> So here's the question.  Is maximum fsb on standard, non-optical bus
> still limited to a maximum of a couple of hundred megahertz, or did
> something happen in the last decade or two that changed things
> dramatically?  I understand, at least think I do, that these
> ridiculously high frequency claims would not survive capacitance issues
> and RFI issues. When my brother claimed a 3.2ghz bus speed for his
> machine I just told him that was wrong, impossible for practical
> purposes, that it had to be an aggregate figure, a 'Pentium rating'
> sort of number rather than the actual clock speed.  I envision
> switching bus tech akin to present networking, paralleled to sidestep
> the limit while keeping pin and trace counts low.?  Something like
> the PCIe 'lane' scheme in present use?  This is surmise based on my own
> experience.
> 
> When I was current, the way out of this limitation was fiber-optics for
> the bus.  This was used in supercomputing and allowed interconnects of
> longer length at ridiculous speeds.
> 
> Thanks for allowing me to entertain this question.  Though it is not
> specifically a classic computer question, it does relate to development
> and history.
> 
> 
> 
> Best,
> 
> Technoid Mutant (Jeff Worley)
> 
> 
> 
> 
> 


Re: OT? Upper limits of FSB

2019-01-05 Thread Peter Corlett via cctalk
On Sat, Jan 05, 2019 at 02:02:35AM -0500, Jeffrey S. Worley via cctalk wrote:
> [...] So here's the question. Is maximum fsb on standard, non-optical bus
> still limited to a maximum of a couple of hundred megahertz, or did something
> happen in the last decade or two that changed things dramatically? [...]

Yes to both questions.

High-speed computer systems no longer resemble the simple diagrams in computer
science textbooks where there is a CPU with a parallel bus attached to memory
and I/O devices like it's still the 1970s. Sadly, the speed of light has
stubbornly failed to increase in line with Moore's Law, so we've had to reduce
the length of busses instead.

The PC's front-side-bus *was* such a 1970s-style bus, however by the 2000s it
had withered from an 8MHz bus snaking all over the board and into and out of
ISA cards to a few hundred MHz between just the CPU and the northbridge. To go
faster still, the northbridge's functionality moved on-die and the FSB is now
ancient history. (If antiquity means "before 2010".)

In general, we don't bother with parallel busses any more, just point-to-point
self-clocked serial links which can run into the gigahertz range. The bandwidth
is increased further if necessary by adding more links, but this is not the
same as a parallel bus as each link has its own independent clock and that adds
a lot of extra complexity to the receiver.



Re: OT? Upper limits of FSB

2019-01-05 Thread alan--- via cctalk



I'll assume you've read:

https://en.wikipedia.org/wiki/Front-side_bus

Even though synchronization base clocks have remained low, parallel 
buses can run up in the low GHz range (sub 4) in terms of data line 
transitions per second with as many as 128 parallel wires in sync.  It's 
not just FSBs, memory is the same way.  While parasitic effects do 
affect the limit.  Routing 192, 288, etc wires in parallel with matched 
trace length on a PCB to get arrival times in-phase has been a large 
problem as well.  So the trend is rather than running more and more 
wires in parallel with synchronized transfers across the entire span, 
span are being broken up into smaller and smaller units that run either 
unsynchronized or with their own timing delays.  Even with memory - 
starting with DDR3 - each byte group is 'trained' separately by the 
controller and can run at different phase offsets to match the trace 
group routing.  And parallel FSBs have been replaced with  number of 
differential pairs running independently as the data is queued and 
reassembled on the receiving end (QPI & HyperTransport).


Same trend in I/O buses starting with PCIe.  Instead of 64 wires @ 66 
MHz in PCI-X, a dual lane PCIe gen 1.0 link can handle a similar load 
with just 6 wires.


-Alan


On 2019-01-05 02:02, Jeffrey S. Worley via cctalk wrote:

Apropos of nothing, I've been confuse for some time regarding maximum
clock rates for local bus.

My admittedly old information, which comes from the 3rd ed. of "High
Performance Computer Architecture", a course I audited, indicates a
maximum speed on the order of 1ghz for very very short trace lengths.

Late model computers boast multi-hundred to multi gigahertz fsb's.  Am
I wrong in thinking this is an aggregate of several serial lines
running at 1 to 200mhz?  No straight answer has presented on searches
online.

So here's the question.  Is maximum fsb on standard, non-optical bus
still limited to a maximum of a couple of hundred megahertz, or did
something happen in the last decade or two that changed things
dramatically?  I understand, at least think I do, that these
ridiculously high frequency claims would not survive capacitance issues
and RFI issues. When my brother claimed a 3.2ghz bus speed for his
machine I just told him that was wrong, impossible for practical
purposes, that it had to be an aggregate figure, a 'Pentium rating'
sort of number rather than the actual clock speed.  I envision
switching bus tech akin to present networking, paralleled to sidestep
the limit while keeping pin and trace counts low.?  Something like
the PCIe 'lane' scheme in present use?  This is surmise based on my own
experience.

When I was current, the way out of this limitation was fiber-optics for
the bus.  This was used in supercomputing and allowed interconnects of
longer length at ridiculous speeds.

Thanks for allowing me to entertain this question.  Though it is not
specifically a classic computer question, it does relate to development
and history.



Best,

Technoid Mutant (Jeff Worley)


Re: OT? Upper limits of FSB

2019-01-05 Thread Eric Smith via cctalk
On Sat, Jan 5, 2019, 00:02 Jeffrey S. Worley via cctalk <
cctalk@classiccmp.org> wrote:

> Apropos of nothing, I've been confuse for some time regarding maximum
> clock rates for local bus.
>
> My admittedly old information, which comes from the 3rd ed. of "High
> Performance Computer Architecture", a course I audited, indicates a
> maximum speed on the order of 1ghz for very very short trace lengths.
>
> Late model computers boast multi-hundred to multi gigahertz fsb's.  Am
> I wrong in thinking this is an aggregate of several serial lines
> running at 1 to 200mhz?  No straight answer has presented on searches
> online.
>

Each individual lane of PCIe gen 3 has one each transmit and receive
differential pair which operate at a serial rate of 8 Gbps each. Gen 4 will
be 16 Gbps. About 1.5% of that gets taken up by the 128b/130b line code
overhead. There is additional overhead consumed by transaction framing,
which means that long burst transfers will get much higher performance than
individual 64-bit or smaller reads and writes.

Doing those data rates with a multi-drop bus like legacy ISA or PCI would
be almost impossible. Parallel multi-drop busses maxed out below 200 MHz.
There are two tricks that make PCIe work:

1) PCIe is not a bus. It consists of strictly point-to-point links, with
tightly controlled impedance.

2) PCIe multi-lane logical links don't assume any phase relationships
between the lanes as a parallel bus would, so there is little or no problem
with timing skew between lanes. The lanes are serialized and deserialized
separately at the endpoints, and higher-level logic in the endpoints is
responsible for distributing the data across the lanes of multi-lane links.

If you have a motherboard with a 16 lane PCIe slot, two 4 lane slots, and
two 1 lane slots, every one of the 26 lanes is an electrically separate set
of point-to-point receive and transmit differential pairs.

The way the chipset is wired makes all of these electrically independent
PCIe lanes collectively act like a "bus" as viewed by the processor (or
north bridge) and PCIe devices.


OT? Upper limits of FSB

2019-01-04 Thread Jeffrey S. Worley via cctalk
Apropos of nothing, I've been confuse for some time regarding maximum
clock rates for local bus.

My admittedly old information, which comes from the 3rd ed. of "High
Performance Computer Architecture", a course I audited, indicates a
maximum speed on the order of 1ghz for very very short trace lengths.

Late model computers boast multi-hundred to multi gigahertz fsb's.  Am
I wrong in thinking this is an aggregate of several serial lines
running at 1 to 200mhz?  No straight answer has presented on searches
online.

So here's the question.  Is maximum fsb on standard, non-optical bus
still limited to a maximum of a couple of hundred megahertz, or did
something happen in the last decade or two that changed things
dramatically?  I understand, at least think I do, that these
ridiculously high frequency claims would not survive capacitance issues
and RFI issues. When my brother claimed a 3.2ghz bus speed for his
machine I just told him that was wrong, impossible for practical
purposes, that it had to be an aggregate figure, a 'Pentium rating'
sort of number rather than the actual clock speed.  I envision
switching bus tech akin to present networking, paralleled to sidestep
the limit while keeping pin and trace counts low.?  Something like
the PCIe 'lane' scheme in present use?  This is surmise based on my own
experience.

When I was current, the way out of this limitation was fiber-optics for
the bus.  This was used in supercomputing and allowed interconnects of
longer length at ridiculous speeds.

Thanks for allowing me to entertain this question.  Though it is not
specifically a classic computer question, it does relate to development
and history.



Best,

Technoid Mutant (Jeff Worley)