Re: OT? Upper limits of FSB
> On Jan 9, 2019, at 2:54 PM, dwight via cctalk wrote: > > ... > Of course in an embedded processor you can run in kernel mode and busy wait > if you want. Yes, and that has a number of advantages. You get well defined latencies and everything that the program does gets done within bounded time (if you do the simple thing of putting work limits on all your work loops). Knowing that your real time system can't ever block a task is a very good property. By contrast, with interrupts, never mind priority schedulers, it's much harder to ensure this. I've built network switches whose data paths were coded this way, and they are both simple and reliable. paul
Re: OT? Upper limits of FSB
As long as things stay in a pipe, instruction decode and execution looks to execute in one cycle. Pipe flushes are the penalty. That is where speculative execution pays off. ( also food for Meltdown and Spectre type security holes ). Such loops are quite fast if the prediction was right. Running small unrolled loops only give you a small advantage if the predictor is working well for your code. Large unrolled loops only gain a small amount percentage wise, as always. If one is unrolling a large amount, one may end up getting a cache miss. That can easily eats up any benefit of unrolling the loops. Before speculative execution, unrolling had a clear advantage. Dwight From: cctalk on behalf of Eric Korpela via cctalk Sent: Wednesday, January 9, 2019 11:06 AM To: ben; General Discussion: On-Topic and Off-Topic Posts Subject: Re: OT? Upper limits of FSB On Tue, Jan 8, 2019 at 3:01 PM ben via cctalk wrote: > I bet I/O loops throw every thing off. > Even worse than you might think. For user mode code you've got at least two context switches which are typically thousands of CPU cycles. On the plus side when you start waiting for I/O the CPU will execute another context switch to resume running something else while waiting for the I/O to complete. By the time you get back to your process, it's likely process memory may be at L3 or back in main memory. Depending upon what else is going on it might add 1 to 50 microseconds per I/O just for context switching and reloading caches. Of course in an embedded processor you can run in kernel mode and busy wait if you want. Even fast memory mapped I/O (i.e. PCIe graphics card) that doesn't trigger a page fault is going to have variable latency and will probably have special cache handling. -- Eric Korpela korp...@ssl.berkeley.edu AST:7731^29u18e3
Re: OT? Upper limits of FSB
On Tue, Jan 8, 2019 at 3:01 PM ben via cctalk wrote: > I bet I/O loops throw every thing off. > Even worse than you might think. For user mode code you've got at least two context switches which are typically thousands of CPU cycles. On the plus side when you start waiting for I/O the CPU will execute another context switch to resume running something else while waiting for the I/O to complete. By the time you get back to your process, it's likely process memory may be at L3 or back in main memory. Depending upon what else is going on it might add 1 to 50 microseconds per I/O just for context switching and reloading caches. Of course in an embedded processor you can run in kernel mode and busy wait if you want. Even fast memory mapped I/O (i.e. PCIe graphics card) that doesn't trigger a page fault is going to have variable latency and will probably have special cache handling. -- Eric Korpela korp...@ssl.berkeley.edu AST:7731^29u18e3
Re: OT? Upper limits of FSB
On 1/8/2019 3:51 PM, Guy Sotomayor Jr via cctalk wrote: Some architectures (I’m thinking of the latest Intel CPUs) have a small loop cache whose aim is to keep a loop entirely within that cache. That cache operates at the full speed of the instruction fetch/execute (actually I think it keeps the decoded uOps) cycles (e.g. you can’t go faster). L1 caches impose a penalty and of course there is the instruction decode time as well both of which are avoided. TTFN - Guy I bet I/O loops throw every thing off. Ben.
Re: OT? Upper limits of FSB
Some architectures (I’m thinking of the latest Intel CPUs) have a small loop cache whose aim is to keep a loop entirely within that cache. That cache operates at the full speed of the instruction fetch/execute (actually I think it keeps the decoded uOps) cycles (e.g. you can’t go faster). L1 caches impose a penalty and of course there is the instruction decode time as well both of which are avoided. TTFN - Guy > On Jan 8, 2019, at 2:43 PM, Chuck Guzis via cctalk > wrote: > > On 1/8/19 1:23 PM, Tapley, Mark via cctalk wrote: > >> Why so (why surprising, I mean)? Understood an unrolled loop executes >> faster... > > That can't always be true, can it? > > I'm thinking of an architecture where the instruction cache is slow to > fill and multiple overlapping operations are involved and branch > prediction assumes a branch taken. I'd say it was very close in that case. > > --Chuck >
Re: OT? Upper limits of FSB
On 1/8/19 1:23 PM, Tapley, Mark via cctalk wrote: > Why so (why surprising, I mean)? Understood an unrolled loop executes > faster... That can't always be true, can it? I'm thinking of an architecture where the instruction cache is slow to fill and multiple overlapping operations are involved and branch prediction assumes a branch taken. I'd say it was very close in that case. --Chuck
Re: OT? Upper limits of FSB
> On Jan 6, 2019, at 1:31 PM, dwight via cctalk wrote: > > Surprisingly, this is actually good for older languages like Forth that are > fugal with RAM. Why so (why surprising, I mean)? Understood an unrolled loop executes faster, RISC instruction sets have lower information density than CISC instruction sets and therefore bigger RAM footprint, and look-up tables are faster than long division (or working an infinite series for a transcendental function…). But I’ve been worried for a while that the lesson many software engineers are learning is (more RAM usage) == (faster execution) and I don’t think that’s a valid lesson. Dwight, thanks for pointing out the counter-example!
Re: OT? Upper limits of FSB
Probably the factor that most think limits thing is the turn-around time. If they were limited to one byte request and wait for that data to return, the limits of wires would be a wall. Today's serial RAMs send a burts of data rather than a word or byte at a time. These blocks of data can use multiple serial lanes at one time where the data bits aren't even exactly arriving at the same time. There are FIFOs and parallelizers that bring things back together. The latency of the first fetch is slower than it used to be for traditional fetches but after that things are quite quick. Surprisingly, this is actually good for older languages like Forth that are fugal with RAM. Entire applications ( less data in some cases ) can be in the CPU's cache for immediate use. Dwight From: cctalk on behalf of Curious Marc via cctalk Sent: Saturday, January 5, 2019 9:40 PM To: Jeffrey S. Worley; General Discussion: On-Topic and Off-Topic Posts Subject: Re: OT? Upper limits of FSB Interconnects at 28Gb/s/lane have been out for a while now, supported by quite a few chips. 56Gb/s PAM4 is around the corner, and we run 100Gb/s in the lab right now. Just sayin’ ;-). That said, we throw in about every equalization trick we know of, PCB materials are getting quite exotic and connectors are pretty interesting. We have to hand hold our customers to design their interconnect traces and connector breakouts. And you can’t go too far, with increasing reliance on micro-twinax or on-board optics for longer distances and backplanes. Marc > On Jan 4, 2019, at 11:02 PM, Jeffrey S. Worley via cctalk > wrote: > > Apropos of nothing, I've been confuse for some time regarding maximum > clock rates for local bus. > > My admittedly old information, which comes from the 3rd ed. of "High > Performance Computer Architecture", a course I audited, indicates a > maximum speed on the order of 1ghz for very very short trace lengths. > > Late model computers boast multi-hundred to multi gigahertz fsb's. Am > I wrong in thinking this is an aggregate of several serial lines > running at 1 to 200mhz? No straight answer has presented on searches > online. > > So here's the question. Is maximum fsb on standard, non-optical bus > still limited to a maximum of a couple of hundred megahertz, or did > something happen in the last decade or two that changed things > dramatically? I understand, at least think I do, that these > ridiculously high frequency claims would not survive capacitance issues > and RFI issues. When my brother claimed a 3.2ghz bus speed for his > machine I just told him that was wrong, impossible for practical > purposes, that it had to be an aggregate figure, a 'Pentium rating' > sort of number rather than the actual clock speed. I envision > switching bus tech akin to present networking, paralleled to sidestep > the limit while keeping pin and trace counts low.? Something like > the PCIe 'lane' scheme in present use? This is surmise based on my own > experience. > > When I was current, the way out of this limitation was fiber-optics for > the bus. This was used in supercomputing and allowed interconnects of > longer length at ridiculous speeds. > > Thanks for allowing me to entertain this question. Though it is not > specifically a classic computer question, it does relate to development > and history. > > > > Best, > > Technoid Mutant (Jeff Worley) > > > > >
Re: OT? Upper limits of FSB
Interconnects at 28Gb/s/lane have been out for a while now, supported by quite a few chips. 56Gb/s PAM4 is around the corner, and we run 100Gb/s in the lab right now. Just sayin’ ;-). That said, we throw in about every equalization trick we know of, PCB materials are getting quite exotic and connectors are pretty interesting. We have to hand hold our customers to design their interconnect traces and connector breakouts. And you can’t go too far, with increasing reliance on micro-twinax or on-board optics for longer distances and backplanes. Marc > On Jan 4, 2019, at 11:02 PM, Jeffrey S. Worley via cctalk > wrote: > > Apropos of nothing, I've been confuse for some time regarding maximum > clock rates for local bus. > > My admittedly old information, which comes from the 3rd ed. of "High > Performance Computer Architecture", a course I audited, indicates a > maximum speed on the order of 1ghz for very very short trace lengths. > > Late model computers boast multi-hundred to multi gigahertz fsb's. Am > I wrong in thinking this is an aggregate of several serial lines > running at 1 to 200mhz? No straight answer has presented on searches > online. > > So here's the question. Is maximum fsb on standard, non-optical bus > still limited to a maximum of a couple of hundred megahertz, or did > something happen in the last decade or two that changed things > dramatically? I understand, at least think I do, that these > ridiculously high frequency claims would not survive capacitance issues > and RFI issues. When my brother claimed a 3.2ghz bus speed for his > machine I just told him that was wrong, impossible for practical > purposes, that it had to be an aggregate figure, a 'Pentium rating' > sort of number rather than the actual clock speed. I envision > switching bus tech akin to present networking, paralleled to sidestep > the limit while keeping pin and trace counts low.? Something like > the PCIe 'lane' scheme in present use? This is surmise based on my own > experience. > > When I was current, the way out of this limitation was fiber-optics for > the bus. This was used in supercomputing and allowed interconnects of > longer length at ridiculous speeds. > > Thanks for allowing me to entertain this question. Though it is not > specifically a classic computer question, it does relate to development > and history. > > > > Best, > > Technoid Mutant (Jeff Worley) > > > > >
Re: OT? Upper limits of FSB
On Sat, Jan 05, 2019 at 02:02:35AM -0500, Jeffrey S. Worley via cctalk wrote: > [...] So here's the question. Is maximum fsb on standard, non-optical bus > still limited to a maximum of a couple of hundred megahertz, or did something > happen in the last decade or two that changed things dramatically? [...] Yes to both questions. High-speed computer systems no longer resemble the simple diagrams in computer science textbooks where there is a CPU with a parallel bus attached to memory and I/O devices like it's still the 1970s. Sadly, the speed of light has stubbornly failed to increase in line with Moore's Law, so we've had to reduce the length of busses instead. The PC's front-side-bus *was* such a 1970s-style bus, however by the 2000s it had withered from an 8MHz bus snaking all over the board and into and out of ISA cards to a few hundred MHz between just the CPU and the northbridge. To go faster still, the northbridge's functionality moved on-die and the FSB is now ancient history. (If antiquity means "before 2010".) In general, we don't bother with parallel busses any more, just point-to-point self-clocked serial links which can run into the gigahertz range. The bandwidth is increased further if necessary by adding more links, but this is not the same as a parallel bus as each link has its own independent clock and that adds a lot of extra complexity to the receiver.
Re: OT? Upper limits of FSB
I'll assume you've read: https://en.wikipedia.org/wiki/Front-side_bus Even though synchronization base clocks have remained low, parallel buses can run up in the low GHz range (sub 4) in terms of data line transitions per second with as many as 128 parallel wires in sync. It's not just FSBs, memory is the same way. While parasitic effects do affect the limit. Routing 192, 288, etc wires in parallel with matched trace length on a PCB to get arrival times in-phase has been a large problem as well. So the trend is rather than running more and more wires in parallel with synchronized transfers across the entire span, span are being broken up into smaller and smaller units that run either unsynchronized or with their own timing delays. Even with memory - starting with DDR3 - each byte group is 'trained' separately by the controller and can run at different phase offsets to match the trace group routing. And parallel FSBs have been replaced with number of differential pairs running independently as the data is queued and reassembled on the receiving end (QPI & HyperTransport). Same trend in I/O buses starting with PCIe. Instead of 64 wires @ 66 MHz in PCI-X, a dual lane PCIe gen 1.0 link can handle a similar load with just 6 wires. -Alan On 2019-01-05 02:02, Jeffrey S. Worley via cctalk wrote: Apropos of nothing, I've been confuse for some time regarding maximum clock rates for local bus. My admittedly old information, which comes from the 3rd ed. of "High Performance Computer Architecture", a course I audited, indicates a maximum speed on the order of 1ghz for very very short trace lengths. Late model computers boast multi-hundred to multi gigahertz fsb's. Am I wrong in thinking this is an aggregate of several serial lines running at 1 to 200mhz? No straight answer has presented on searches online. So here's the question. Is maximum fsb on standard, non-optical bus still limited to a maximum of a couple of hundred megahertz, or did something happen in the last decade or two that changed things dramatically? I understand, at least think I do, that these ridiculously high frequency claims would not survive capacitance issues and RFI issues. When my brother claimed a 3.2ghz bus speed for his machine I just told him that was wrong, impossible for practical purposes, that it had to be an aggregate figure, a 'Pentium rating' sort of number rather than the actual clock speed. I envision switching bus tech akin to present networking, paralleled to sidestep the limit while keeping pin and trace counts low.? Something like the PCIe 'lane' scheme in present use? This is surmise based on my own experience. When I was current, the way out of this limitation was fiber-optics for the bus. This was used in supercomputing and allowed interconnects of longer length at ridiculous speeds. Thanks for allowing me to entertain this question. Though it is not specifically a classic computer question, it does relate to development and history. Best, Technoid Mutant (Jeff Worley)
Re: OT? Upper limits of FSB
On Sat, Jan 5, 2019, 00:02 Jeffrey S. Worley via cctalk < cctalk@classiccmp.org> wrote: > Apropos of nothing, I've been confuse for some time regarding maximum > clock rates for local bus. > > My admittedly old information, which comes from the 3rd ed. of "High > Performance Computer Architecture", a course I audited, indicates a > maximum speed on the order of 1ghz for very very short trace lengths. > > Late model computers boast multi-hundred to multi gigahertz fsb's. Am > I wrong in thinking this is an aggregate of several serial lines > running at 1 to 200mhz? No straight answer has presented on searches > online. > Each individual lane of PCIe gen 3 has one each transmit and receive differential pair which operate at a serial rate of 8 Gbps each. Gen 4 will be 16 Gbps. About 1.5% of that gets taken up by the 128b/130b line code overhead. There is additional overhead consumed by transaction framing, which means that long burst transfers will get much higher performance than individual 64-bit or smaller reads and writes. Doing those data rates with a multi-drop bus like legacy ISA or PCI would be almost impossible. Parallel multi-drop busses maxed out below 200 MHz. There are two tricks that make PCIe work: 1) PCIe is not a bus. It consists of strictly point-to-point links, with tightly controlled impedance. 2) PCIe multi-lane logical links don't assume any phase relationships between the lanes as a parallel bus would, so there is little or no problem with timing skew between lanes. The lanes are serialized and deserialized separately at the endpoints, and higher-level logic in the endpoints is responsible for distributing the data across the lanes of multi-lane links. If you have a motherboard with a 16 lane PCIe slot, two 4 lane slots, and two 1 lane slots, every one of the 26 lanes is an electrically separate set of point-to-point receive and transmit differential pairs. The way the chipset is wired makes all of these electrically independent PCIe lanes collectively act like a "bus" as viewed by the processor (or north bridge) and PCIe devices.
OT? Upper limits of FSB
Apropos of nothing, I've been confuse for some time regarding maximum clock rates for local bus. My admittedly old information, which comes from the 3rd ed. of "High Performance Computer Architecture", a course I audited, indicates a maximum speed on the order of 1ghz for very very short trace lengths. Late model computers boast multi-hundred to multi gigahertz fsb's. Am I wrong in thinking this is an aggregate of several serial lines running at 1 to 200mhz? No straight answer has presented on searches online. So here's the question. Is maximum fsb on standard, non-optical bus still limited to a maximum of a couple of hundred megahertz, or did something happen in the last decade or two that changed things dramatically? I understand, at least think I do, that these ridiculously high frequency claims would not survive capacitance issues and RFI issues. When my brother claimed a 3.2ghz bus speed for his machine I just told him that was wrong, impossible for practical purposes, that it had to be an aggregate figure, a 'Pentium rating' sort of number rather than the actual clock speed. I envision switching bus tech akin to present networking, paralleled to sidestep the limit while keeping pin and trace counts low.? Something like the PCIe 'lane' scheme in present use? This is surmise based on my own experience. When I was current, the way out of this limitation was fiber-optics for the bus. This was used in supercomputing and allowed interconnects of longer length at ridiculous speeds. Thanks for allowing me to entertain this question. Though it is not specifically a classic computer question, it does relate to development and history. Best, Technoid Mutant (Jeff Worley)