Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Wed, 07 Jun 2017 00:35:04 PDT (-0700), Arnd Bergmann wrote: > On Tue, Jun 6, 2017 at 10:53 PM, Palmer Dabbeltwrote: >> On Tue, 06 Jun 2017 02:31:02 PDT (-0700), Arnd Bergmann wrote: >>> On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt wrote: On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: > On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt > wrote: >> On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: >>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt >>> wrote: > >>> Also, it would be good to replace the multiply+div64 >>> with a single multiplication here, see how x86 and arm do it >>> (for the tsc/__timer_delay case). >> >> Makes sense. I think this should do it >> >> >> https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 >> >> but I'm finding this hard to test as this only works for 2ms sleeps. It >> seems >> at least in the right ballpark > > + if (usecs > MAX_UDELAY_US) { > + __delay((u64)usecs * riscv_timebase / 100ULL); > + return; > + } > > You still do the 64-bit division here. What I meant is to completely > avoid the division and use a multiply+shift. The goal here was to avoid the error case that ARM has on overflow and instead just delay for the requested time. This should only divide when the delay is >=2ms, so the division won't cost much in comparison. The normal case should have no division in it. I can copy ARM's error handling if you think that's better, but it seemed more complicated than just computing the correct answer. >>> >>> I think the intention originally was to avoid overflowing the 32-bit >>> argument in >>> >>> void __delay(unsigned long cycles) >>> >>> If you need to delay for more than 4 billion clocksource cycles, >>> your code is still broken. >> >> Maybe I'm crazy, but I thought the goal was to avoid overflowing on the >> multiply. Specifically, the code looks like >> >> udelay(long input) { >> long a = input * MUL_VAL; >> long b = a >> SHIFT_VAL; >> __delay(b); >> } >> >> so the place there's extra overflow is at computing a, not b (the input to >> __delay). When I modified the ARM code I went and recalculated the point at >> which the multiply would overflow and it matched the value from the ARM code, >> which is 2000us. > > Ah, that's right. But then you might run into the next overflow at a larger > delay interval. Well, that requires a huge delay as there's a (u64) cast in there. The timebase is a u32 which puts the delay at an hour, which seems unlikely :). > >> While I can buy the argument that 2000us is still too long, the real reason I >> wrote the code this way is because I thought it was easier than having an >> error >> case. If you think the error is better then I'll do it that way. > > It's probably ok as long as the complexity is not in the inline wrapper, and > you > avoid the division for small delays. I put an unlikely() around the comparison and it looks like the complier does well enough: there are no references to riscv_timebase outside of udelay (so it's not inline anywhere), and there's just one branch on the hot path to test for overflow (which is probably free anyway, as it's in the shadow of a mul and never taken).
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Wed, 07 Jun 2017 00:35:04 PDT (-0700), Arnd Bergmann wrote: > On Tue, Jun 6, 2017 at 10:53 PM, Palmer Dabbelt wrote: >> On Tue, 06 Jun 2017 02:31:02 PDT (-0700), Arnd Bergmann wrote: >>> On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt wrote: On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: > On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt > wrote: >> On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: >>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt >>> wrote: > >>> Also, it would be good to replace the multiply+div64 >>> with a single multiplication here, see how x86 and arm do it >>> (for the tsc/__timer_delay case). >> >> Makes sense. I think this should do it >> >> >> https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 >> >> but I'm finding this hard to test as this only works for 2ms sleeps. It >> seems >> at least in the right ballpark > > + if (usecs > MAX_UDELAY_US) { > + __delay((u64)usecs * riscv_timebase / 100ULL); > + return; > + } > > You still do the 64-bit division here. What I meant is to completely > avoid the division and use a multiply+shift. The goal here was to avoid the error case that ARM has on overflow and instead just delay for the requested time. This should only divide when the delay is >=2ms, so the division won't cost much in comparison. The normal case should have no division in it. I can copy ARM's error handling if you think that's better, but it seemed more complicated than just computing the correct answer. >>> >>> I think the intention originally was to avoid overflowing the 32-bit >>> argument in >>> >>> void __delay(unsigned long cycles) >>> >>> If you need to delay for more than 4 billion clocksource cycles, >>> your code is still broken. >> >> Maybe I'm crazy, but I thought the goal was to avoid overflowing on the >> multiply. Specifically, the code looks like >> >> udelay(long input) { >> long a = input * MUL_VAL; >> long b = a >> SHIFT_VAL; >> __delay(b); >> } >> >> so the place there's extra overflow is at computing a, not b (the input to >> __delay). When I modified the ARM code I went and recalculated the point at >> which the multiply would overflow and it matched the value from the ARM code, >> which is 2000us. > > Ah, that's right. But then you might run into the next overflow at a larger > delay interval. Well, that requires a huge delay as there's a (u64) cast in there. The timebase is a u32 which puts the delay at an hour, which seems unlikely :). > >> While I can buy the argument that 2000us is still too long, the real reason I >> wrote the code this way is because I thought it was easier than having an >> error >> case. If you think the error is better then I'll do it that way. > > It's probably ok as long as the complexity is not in the inline wrapper, and > you > avoid the division for small delays. I put an unlikely() around the comparison and it looks like the complier does well enough: there are no references to riscv_timebase outside of udelay (so it's not inline anywhere), and there's just one branch on the hot path to test for overflow (which is probably free anyway, as it's in the shadow of a mul and never taken).
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, Jun 6, 2017 at 10:53 PM, Palmer Dabbeltwrote: > On Tue, 06 Jun 2017 02:31:02 PDT (-0700), Arnd Bergmann wrote: >> On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt wrote: >>> On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt wrote: > On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: >> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt >> wrote: >> Also, it would be good to replace the multiply+div64 >> with a single multiplication here, see how x86 and arm do it >> (for the tsc/__timer_delay case). > > Makes sense. I think this should do it > > > https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 > > but I'm finding this hard to test as this only works for 2ms sleeps. It > seems > at least in the right ballpark + if (usecs > MAX_UDELAY_US) { + __delay((u64)usecs * riscv_timebase / 100ULL); + return; + } You still do the 64-bit division here. What I meant is to completely avoid the division and use a multiply+shift. >>> >>> The goal here was to avoid the error case that ARM has on overflow and >>> instead >>> just delay for the requested time. This should only divide when the delay >>> is =2ms, so the division won't cost much in comparison. >>> >>> The normal case should have no division in it. >>> >>> I can copy ARM's error handling if you think that's better, but it seemed >>> more >>> complicated than just computing the correct answer. >> >> I think the intention originally was to avoid overflowing the 32-bit >> argument in >> >> void __delay(unsigned long cycles) >> >> If you need to delay for more than 4 billion clocksource cycles, >> your code is still broken. > > Maybe I'm crazy, but I thought the goal was to avoid overflowing on the > multiply. Specifically, the code looks like > > udelay(long input) { > long a = input * MUL_VAL; > long b = a >> SHIFT_VAL; > __delay(b); > } > > so the place there's extra overflow is at computing a, not b (the input to > __delay). When I modified the ARM code I went and recalculated the point at > which the multiply would overflow and it matched the value from the ARM code, > which is 2000us. Ah, that's right. But then you might run into the next overflow at a larger delay interval. > While I can buy the argument that 2000us is still too long, the real reason I > wrote the code this way is because I thought it was easier than having an > error > case. If you think the error is better then I'll do it that way. It's probably ok as long as the complexity is not in the inline wrapper, and you avoid the division for small delays. Arnd
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, Jun 6, 2017 at 10:53 PM, Palmer Dabbelt wrote: > On Tue, 06 Jun 2017 02:31:02 PDT (-0700), Arnd Bergmann wrote: >> On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt wrote: >>> On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt wrote: > On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: >> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt >> wrote: >> Also, it would be good to replace the multiply+div64 >> with a single multiplication here, see how x86 and arm do it >> (for the tsc/__timer_delay case). > > Makes sense. I think this should do it > > > https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 > > but I'm finding this hard to test as this only works for 2ms sleeps. It > seems > at least in the right ballpark + if (usecs > MAX_UDELAY_US) { + __delay((u64)usecs * riscv_timebase / 100ULL); + return; + } You still do the 64-bit division here. What I meant is to completely avoid the division and use a multiply+shift. >>> >>> The goal here was to avoid the error case that ARM has on overflow and >>> instead >>> just delay for the requested time. This should only divide when the delay >>> is =2ms, so the division won't cost much in comparison. >>> >>> The normal case should have no division in it. >>> >>> I can copy ARM's error handling if you think that's better, but it seemed >>> more >>> complicated than just computing the correct answer. >> >> I think the intention originally was to avoid overflowing the 32-bit >> argument in >> >> void __delay(unsigned long cycles) >> >> If you need to delay for more than 4 billion clocksource cycles, >> your code is still broken. > > Maybe I'm crazy, but I thought the goal was to avoid overflowing on the > multiply. Specifically, the code looks like > > udelay(long input) { > long a = input * MUL_VAL; > long b = a >> SHIFT_VAL; > __delay(b); > } > > so the place there's extra overflow is at computing a, not b (the input to > __delay). When I modified the ARM code I went and recalculated the point at > which the multiply would overflow and it matched the value from the ARM code, > which is 2000us. Ah, that's right. But then you might run into the next overflow at a larger delay interval. > While I can buy the argument that 2000us is still too long, the real reason I > wrote the code this way is because I thought it was easier than having an > error > case. If you think the error is better then I'll do it that way. It's probably ok as long as the complexity is not in the inline wrapper, and you avoid the division for small delays. Arnd
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, 06 Jun 2017 02:31:02 PDT (-0700), Arnd Bergmann wrote: > On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbeltwrote: >> On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: >>> On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt wrote: On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt > wrote: >>> > Also, it would be good to replace the multiply+div64 > with a single multiplication here, see how x86 and arm do it > (for the tsc/__timer_delay case). Makes sense. I think this should do it https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 but I'm finding this hard to test as this only works for 2ms sleeps. It seems at least in the right ballpark >>> >>> + if (usecs > MAX_UDELAY_US) { >>> + __delay((u64)usecs * riscv_timebase / 100ULL); >>> + return; >>> + } >>> >>> You still do the 64-bit division here. What I meant is to completely >>> avoid the division and use a multiply+shift. >> >> The goal here was to avoid the error case that ARM has on overflow and >> instead >> just delay for the requested time. This should only divide when the delay is >>>=2ms, so the division won't cost much in comparison. >> >> The normal case should have no division in it. >> >> I can copy ARM's error handling if you think that's better, but it seemed >> more >> complicated than just computing the correct answer. > > I think the intention originally was to avoid overflowing the 32-bit > argument in > > void __delay(unsigned long cycles) > > If you need to delay for more than 4 billion clocksource cycles, > your code is still broken. Maybe I'm crazy, but I thought the goal was to avoid overflowing on the multiply. Specifically, the code looks like udelay(long input) { long a = input * MUL_VAL; long b = a >> SHIFT_VAL; __delay(b); } so the place there's extra overflow is at computing a, not b (the input to __delay). When I modified the ARM code I went and recalculated the point at which the multiply would overflow and it matched the value from the ARM code, which is 2000us. While I can buy the argument that 2000us is still too long, the real reason I wrote the code this way is because I thought it was easier than having an error case. If you think the error is better then I'll do it that way.
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, 06 Jun 2017 02:31:02 PDT (-0700), Arnd Bergmann wrote: > On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt wrote: >> On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: >>> On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt wrote: On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt > wrote: >>> > Also, it would be good to replace the multiply+div64 > with a single multiplication here, see how x86 and arm do it > (for the tsc/__timer_delay case). Makes sense. I think this should do it https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 but I'm finding this hard to test as this only works for 2ms sleeps. It seems at least in the right ballpark >>> >>> + if (usecs > MAX_UDELAY_US) { >>> + __delay((u64)usecs * riscv_timebase / 100ULL); >>> + return; >>> + } >>> >>> You still do the 64-bit division here. What I meant is to completely >>> avoid the division and use a multiply+shift. >> >> The goal here was to avoid the error case that ARM has on overflow and >> instead >> just delay for the requested time. This should only divide when the delay is >>>=2ms, so the division won't cost much in comparison. >> >> The normal case should have no division in it. >> >> I can copy ARM's error handling if you think that's better, but it seemed >> more >> complicated than just computing the correct answer. > > I think the intention originally was to avoid overflowing the 32-bit > argument in > > void __delay(unsigned long cycles) > > If you need to delay for more than 4 billion clocksource cycles, > your code is still broken. Maybe I'm crazy, but I thought the goal was to avoid overflowing on the multiply. Specifically, the code looks like udelay(long input) { long a = input * MUL_VAL; long b = a >> SHIFT_VAL; __delay(b); } so the place there's extra overflow is at computing a, not b (the input to __delay). When I modified the ARM code I went and recalculated the point at which the multiply would overflow and it matched the value from the ARM code, which is 2000us. While I can buy the argument that 2000us is still too long, the real reason I wrote the code this way is because I thought it was easier than having an error case. If you think the error is better then I'll do it that way.
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbeltwrote: > On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: >> On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt wrote: >>> On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> Also, it would be good to replace the multiply+div64 with a single multiplication here, see how x86 and arm do it (for the tsc/__timer_delay case). >>> >>> Makes sense. I think this should do it >>> >>> >>> https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 >>> >>> but I'm finding this hard to test as this only works for 2ms sleeps. It >>> seems >>> at least in the right ballpark >> >> + if (usecs > MAX_UDELAY_US) { >> + __delay((u64)usecs * riscv_timebase / 100ULL); >> + return; >> + } >> >> You still do the 64-bit division here. What I meant is to completely >> avoid the division and use a multiply+shift. > > The goal here was to avoid the error case that ARM has on overflow and instead > just delay for the requested time. This should only divide when the delay is >>=2ms, so the division won't cost much in comparison. > > The normal case should have no division in it. > > I can copy ARM's error handling if you think that's better, but it seemed more > complicated than just computing the correct answer. I think the intention originally was to avoid overflowing the 32-bit argument in void __delay(unsigned long cycles) If you need to delay for more than 4 billion clocksource cycles, your code is still broken. >> Also, you don't need to base anything on HZ, as you do not rely >> on the delay calibration but always use a timer. > > That makes sense, I just based this blindly off the ARM version. I'll see if > that lets me avoid unnecessary overflow for ndelay. If it doesn't then I'd > prefer to just keep exactly the same constraints ARM has to avoid unexpected > behavior. Right, I should have been more specific here, as ARM has two implementations (loop and timer) and it gets much easier if you know you can rely on the timer to be available. Arnd
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt wrote: > On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: >> On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt wrote: >>> On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> Also, it would be good to replace the multiply+div64 with a single multiplication here, see how x86 and arm do it (for the tsc/__timer_delay case). >>> >>> Makes sense. I think this should do it >>> >>> >>> https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 >>> >>> but I'm finding this hard to test as this only works for 2ms sleeps. It >>> seems >>> at least in the right ballpark >> >> + if (usecs > MAX_UDELAY_US) { >> + __delay((u64)usecs * riscv_timebase / 100ULL); >> + return; >> + } >> >> You still do the 64-bit division here. What I meant is to completely >> avoid the division and use a multiply+shift. > > The goal here was to avoid the error case that ARM has on overflow and instead > just delay for the requested time. This should only divide when the delay is >>=2ms, so the division won't cost much in comparison. > > The normal case should have no division in it. > > I can copy ARM's error handling if you think that's better, but it seemed more > complicated than just computing the correct answer. I think the intention originally was to avoid overflowing the 32-bit argument in void __delay(unsigned long cycles) If you need to delay for more than 4 billion clocksource cycles, your code is still broken. >> Also, you don't need to base anything on HZ, as you do not rely >> on the delay calibration but always use a timer. > > That makes sense, I just based this blindly off the ARM version. I'll see if > that lets me avoid unnecessary overflow for ndelay. If it doesn't then I'd > prefer to just keep exactly the same constraints ARM has to avoid unexpected > behavior. Right, I should have been more specific here, as ARM has two implementations (loop and timer) and it gets much easier if you know you can rely on the timer to be available. Arnd
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: > On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbeltwrote: >> On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: >>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: > >>> Also, it would be good to replace the multiply+div64 >>> with a single multiplication here, see how x86 and arm do it >>> (for the tsc/__timer_delay case). >> >> Makes sense. I think this should do it >> >> >> https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 >> >> but I'm finding this hard to test as this only works for 2ms sleeps. It >> seems >> at least in the right ballpark > > + if (usecs > MAX_UDELAY_US) { > + __delay((u64)usecs * riscv_timebase / 100ULL); > + return; > + } > > You still do the 64-bit division here. What I meant is to completely > avoid the division and use a multiply+shift. The goal here was to avoid the error case that ARM has on overflow and instead just delay for the requested time. This should only divide when the delay is >=2ms, so the division won't cost much in comparison. The normal case should have no division in it. I can copy ARM's error handling if you think that's better, but it seemed more complicated than just computing the correct answer. > Also, you don't need to base anything on HZ, as you do not rely > on the delay calibration but always use a timer. That makes sense, I just based this blindly off the ARM version. I'll see if that lets me avoid unnecessary overflow for ndelay. If it doesn't then I'd prefer to just keep exactly the same constraints ARM has to avoid unexpected behavior.
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote: > On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt wrote: >> On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: >>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: > >>> Also, it would be good to replace the multiply+div64 >>> with a single multiplication here, see how x86 and arm do it >>> (for the tsc/__timer_delay case). >> >> Makes sense. I think this should do it >> >> >> https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 >> >> but I'm finding this hard to test as this only works for 2ms sleeps. It >> seems >> at least in the right ballpark > > + if (usecs > MAX_UDELAY_US) { > + __delay((u64)usecs * riscv_timebase / 100ULL); > + return; > + } > > You still do the 64-bit division here. What I meant is to completely > avoid the division and use a multiply+shift. The goal here was to avoid the error case that ARM has on overflow and instead just delay for the requested time. This should only divide when the delay is >=2ms, so the division won't cost much in comparison. The normal case should have no division in it. I can copy ARM's error handling if you think that's better, but it seemed more complicated than just computing the correct answer. > Also, you don't need to base anything on HZ, as you do not rely > on the delay calibration but always use a timer. That makes sense, I just based this blindly off the ARM version. I'll see if that lets me avoid unnecessary overflow for ndelay. If it doesn't then I'd prefer to just keep exactly the same constraints ARM has to avoid unexpected behavior.
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbeltwrote: > On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: >> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> Also, it would be good to replace the multiply+div64 >> with a single multiplication here, see how x86 and arm do it >> (for the tsc/__timer_delay case). > > Makes sense. I think this should do it > > > https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 > > but I'm finding this hard to test as this only works for 2ms sleeps. It seems > at least in the right ballpark + if (usecs > MAX_UDELAY_US) { + __delay((u64)usecs * riscv_timebase / 100ULL); + return; + } You still do the 64-bit division here. What I meant is to completely avoid the division and use a multiply+shift. Also, you don't need to base anything on HZ, as you do not rely on the delay calibration but always use a timer. Arnd
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt wrote: > On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: >> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> Also, it would be good to replace the multiply+div64 >> with a single multiplication here, see how x86 and arm do it >> (for the tsc/__timer_delay case). > > Makes sense. I think this should do it > > > https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 > > but I'm finding this hard to test as this only works for 2ms sleeps. It seems > at least in the right ballpark + if (usecs > MAX_UDELAY_US) { + __delay((u64)usecs * riscv_timebase / 100ULL); + return; + } You still do the 64-bit division here. What I meant is to completely avoid the division and use a multiply+shift. Also, you don't need to base anything on HZ, as you do not rely on the delay calibration but always use a timer. Arnd
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbeltwrote: >> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile >> new file mode 100644 >> index ..f644e582f4b8 >> --- /dev/null >> +++ b/arch/riscv/lib/Makefile > >> + >> +void __delay(unsigned long cycles) >> +{ >> + u64 t0 = get_cycles(); >> + >> + while ((unsigned long)(get_cycles() - t0) < cycles) >> + cpu_relax(); >> +} >> + >> +void udelay(unsigned long usecs) >> +{ >> + u64 ucycles = (u64)usecs * timebase; >> + do_div(ucycles, 100U); >> + __delay((unsigned long)ucycles); >> +} >> +EXPORT_SYMBOL(udelay); >> + >> +void ndelay(unsigned long nsecs) >> +{ >> + u64 ncycles = (u64)nsecs * timebase; >> + do_div(ncycles, 10U); >> + __delay((unsigned long)ncycles); >> +} > > I'd be slightly worried about a global 'timebase' identifier that > might conflict with a variable in some random driver. Makes sense. I've renamed it to riscv_timebase https://github.com/riscv/riscv-linux/commit/ed7d769e2c14e8809c3c125e0bba2978cb6fd37b > Also, it would be good to replace the multiply+div64 > with a single multiplication here, see how x86 and arm do it > (for the tsc/__timer_delay case). Makes sense. I think this should do it https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 but I'm finding this hard to test as this only works for 2ms sleeps. It seems at least in the right ballpark [0.048000] before 1000x usleep 1000 [1.06] before 1000x nsleep 100 [2.072000] done > Arnd Thanks for the feedback. I'll incorporate this along with all the other feedback into a v2.
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote: > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile >> new file mode 100644 >> index ..f644e582f4b8 >> --- /dev/null >> +++ b/arch/riscv/lib/Makefile > >> + >> +void __delay(unsigned long cycles) >> +{ >> + u64 t0 = get_cycles(); >> + >> + while ((unsigned long)(get_cycles() - t0) < cycles) >> + cpu_relax(); >> +} >> + >> +void udelay(unsigned long usecs) >> +{ >> + u64 ucycles = (u64)usecs * timebase; >> + do_div(ucycles, 100U); >> + __delay((unsigned long)ucycles); >> +} >> +EXPORT_SYMBOL(udelay); >> + >> +void ndelay(unsigned long nsecs) >> +{ >> + u64 ncycles = (u64)nsecs * timebase; >> + do_div(ncycles, 10U); >> + __delay((unsigned long)ncycles); >> +} > > I'd be slightly worried about a global 'timebase' identifier that > might conflict with a variable in some random driver. Makes sense. I've renamed it to riscv_timebase https://github.com/riscv/riscv-linux/commit/ed7d769e2c14e8809c3c125e0bba2978cb6fd37b > Also, it would be good to replace the multiply+div64 > with a single multiplication here, see how x86 and arm do it > (for the tsc/__timer_delay case). Makes sense. I think this should do it https://github.com/riscv/riscv-linux/commit/d397332f6ebff42f3ecb385e9cf3284fdeda6776 but I'm finding this hard to test as this only works for 2ms sleeps. It seems at least in the right ballpark [0.048000] before 1000x usleep 1000 [1.06] before 1000x nsleep 100 [2.072000] done > Arnd Thanks for the feedback. I'll incorporate this along with all the other feedback into a v2.
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, 23 May 2017 03:47:34 PDT (-0700), ge...@linux-m68k.org wrote: > Hi Palmer, > > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbeltwrote: >> arch/riscv/lib/Makefile | 5 ++ >> arch/riscv/lib/ashldi3.c | 42 > > At least this one has already two identical copies in arch/score/lib/ashldi3.c > and arch/sh/lib/ashldi3.c. Probably these should be moved to lib/, and built > depending on a new config symbol that is selected on score, sh, and riscv. > > Didn't check the others. Thanks. It looks like there's actually a lot of these on many architectures. I have another patch set to correct all of these that you're To'd on, I'll just pick up the first patch in the set for v2 https://lkml.org/lkml/2017/5/23/1280
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, 23 May 2017 03:47:34 PDT (-0700), ge...@linux-m68k.org wrote: > Hi Palmer, > > On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: >> arch/riscv/lib/Makefile | 5 ++ >> arch/riscv/lib/ashldi3.c | 42 > > At least this one has already two identical copies in arch/score/lib/ashldi3.c > and arch/sh/lib/ashldi3.c. Probably these should be moved to lib/, and built > depending on a new config symbol that is selected on score, sh, and riscv. > > Didn't check the others. Thanks. It looks like there's actually a lot of these on many architectures. I have another patch set to correct all of these that you're To'd on, I'll just pick up the first patch in the set for v2 https://lkml.org/lkml/2017/5/23/1280
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbeltwrote: > diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile > new file mode 100644 > index ..f644e582f4b8 > --- /dev/null > +++ b/arch/riscv/lib/Makefile > + > +void __delay(unsigned long cycles) > +{ > + u64 t0 = get_cycles(); > + > + while ((unsigned long)(get_cycles() - t0) < cycles) > + cpu_relax(); > +} > + > +void udelay(unsigned long usecs) > +{ > + u64 ucycles = (u64)usecs * timebase; > + do_div(ucycles, 100U); > + __delay((unsigned long)ucycles); > +} > +EXPORT_SYMBOL(udelay); > + > +void ndelay(unsigned long nsecs) > +{ > + u64 ncycles = (u64)nsecs * timebase; > + do_div(ncycles, 10U); > + __delay((unsigned long)ncycles); > +} I'd be slightly worried about a global 'timebase' identifier that might conflict with a variable in some random driver. Also, it would be good to replace the multiply+div64 with a single multiplication here, see how x86 and arm do it (for the tsc/__timer_delay case). Arnd
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: > diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile > new file mode 100644 > index ..f644e582f4b8 > --- /dev/null > +++ b/arch/riscv/lib/Makefile > + > +void __delay(unsigned long cycles) > +{ > + u64 t0 = get_cycles(); > + > + while ((unsigned long)(get_cycles() - t0) < cycles) > + cpu_relax(); > +} > + > +void udelay(unsigned long usecs) > +{ > + u64 ucycles = (u64)usecs * timebase; > + do_div(ucycles, 100U); > + __delay((unsigned long)ucycles); > +} > +EXPORT_SYMBOL(udelay); > + > +void ndelay(unsigned long nsecs) > +{ > + u64 ncycles = (u64)nsecs * timebase; > + do_div(ncycles, 10U); > + __delay((unsigned long)ncycles); > +} I'd be slightly worried about a global 'timebase' identifier that might conflict with a variable in some random driver. Also, it would be good to replace the multiply+div64 with a single multiplication here, see how x86 and arm do it (for the tsc/__timer_delay case). Arnd
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
Hi Palmer, On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbeltwrote: > arch/riscv/lib/Makefile | 5 ++ > arch/riscv/lib/ashldi3.c | 42 At least this one has already two identical copies in arch/score/lib/ashldi3.c and arch/sh/lib/ashldi3.c. Probably these should be moved to lib/, and built depending on a new config symbol that is selected on score, sh, and riscv. Didn't check the others. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH 5/7] RISC-V: arch/riscv/lib
Hi Palmer, On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt wrote: > arch/riscv/lib/Makefile | 5 ++ > arch/riscv/lib/ashldi3.c | 42 At least this one has already two identical copies in arch/score/lib/ashldi3.c and arch/sh/lib/ashldi3.c. Probably these should be moved to lib/, and built depending on a new config symbol that is selected on score, sh, and riscv. Didn't check the others. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
[PATCH 5/7] RISC-V: arch/riscv/lib
--- arch/riscv/lib/Makefile | 5 ++ arch/riscv/lib/ashldi3.c | 42 arch/riscv/lib/ashrdi3.c | 44 + arch/riscv/lib/delay.c | 42 arch/riscv/lib/libgcc.h | 46 + arch/riscv/lib/lshrdi3.c | 42 arch/riscv/lib/memcpy.S | 99 + arch/riscv/lib/memset.S | 119 arch/riscv/lib/uaccess.S | 125 +++ 9 files changed, 564 insertions(+) create mode 100644 arch/riscv/lib/Makefile create mode 100644 arch/riscv/lib/ashldi3.c create mode 100644 arch/riscv/lib/ashrdi3.c create mode 100644 arch/riscv/lib/delay.c create mode 100644 arch/riscv/lib/libgcc.h create mode 100644 arch/riscv/lib/lshrdi3.c create mode 100644 arch/riscv/lib/memcpy.S create mode 100644 arch/riscv/lib/memset.S create mode 100644 arch/riscv/lib/uaccess.S diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile new file mode 100644 index ..f644e582f4b8 --- /dev/null +++ b/arch/riscv/lib/Makefile @@ -0,0 +1,5 @@ +lib-y := delay.o memcpy.o memset.o uaccess.o + +ifeq ($(CONFIG_64BIT),) +lib-y += ashldi3.o ashrdi3.o lshrdi3.o +endif diff --git a/arch/riscv/lib/ashldi3.c b/arch/riscv/lib/ashldi3.c new file mode 100644 index ..9fb71e82ff16 --- /dev/null +++ b/arch/riscv/lib/ashldi3.c @@ -0,0 +1,42 @@ +/* + * Copyright (C) 2014 Darius Rad+ * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ + +#include + +#include "libgcc.h" + +long long __ashldi3(long long u, word_type b) +{ + DWunion uu, w; + word_type bm; + + if (b == 0) + return u; + + uu.ll = u; + bm = 32 - b; + + if (bm <= 0) { + w.s.low = 0; + w.s.high = (unsigned int) uu.s.low << -bm; + } else { + const unsigned int carries = (unsigned int) uu.s.low >> bm; + + w.s.low = (unsigned int) uu.s.low << b; + w.s.high = ((unsigned int) uu.s.high << b) | carries; + } + + return w.ll; +} +EXPORT_SYMBOL(__ashldi3); diff --git a/arch/riscv/lib/ashrdi3.c b/arch/riscv/lib/ashrdi3.c new file mode 100644 index ..8a92e7e8de33 --- /dev/null +++ b/arch/riscv/lib/ashrdi3.c @@ -0,0 +1,44 @@ +/* + * Copyright (C) 2014 Darius Rad + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ + +#include + +#include "libgcc.h" + +long long __ashrdi3(long long u, word_type b) +{ + DWunion uu, w; + word_type bm; + + if (b == 0) + return u; + + uu.ll = u; + bm = 32 - b; + + if (bm <= 0) { + /* w.s.high = 1..1 or 0..0 */ + w.s.high = + uu.s.high >> 31; + w.s.low = uu.s.high >> -bm; + } else { + const unsigned int carries = (unsigned int) uu.s.high << bm; + + w.s.high = uu.s.high >> b; + w.s.low = ((unsigned int) uu.s.low >> b) | carries; + } + + return w.ll; +} +EXPORT_SYMBOL(__ashrdi3); diff --git a/arch/riscv/lib/delay.c b/arch/riscv/lib/delay.c new file mode 100644 index ..3c7bf85a0b04 --- /dev/null +++ b/arch/riscv/lib/delay.c @@ -0,0 +1,42 @@ +/* + * Copyright (C) 2012 Regents of the University of California + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include + +void __delay(unsigned long cycles) +{ + u64 t0 = get_cycles(); + + while ((unsigned long)(get_cycles() - t0) < cycles) + cpu_relax(); +} +
[PATCH 5/7] RISC-V: arch/riscv/lib
--- arch/riscv/lib/Makefile | 5 ++ arch/riscv/lib/ashldi3.c | 42 arch/riscv/lib/ashrdi3.c | 44 + arch/riscv/lib/delay.c | 42 arch/riscv/lib/libgcc.h | 46 + arch/riscv/lib/lshrdi3.c | 42 arch/riscv/lib/memcpy.S | 99 + arch/riscv/lib/memset.S | 119 arch/riscv/lib/uaccess.S | 125 +++ 9 files changed, 564 insertions(+) create mode 100644 arch/riscv/lib/Makefile create mode 100644 arch/riscv/lib/ashldi3.c create mode 100644 arch/riscv/lib/ashrdi3.c create mode 100644 arch/riscv/lib/delay.c create mode 100644 arch/riscv/lib/libgcc.h create mode 100644 arch/riscv/lib/lshrdi3.c create mode 100644 arch/riscv/lib/memcpy.S create mode 100644 arch/riscv/lib/memset.S create mode 100644 arch/riscv/lib/uaccess.S diff --git a/arch/riscv/lib/Makefile b/arch/riscv/lib/Makefile new file mode 100644 index ..f644e582f4b8 --- /dev/null +++ b/arch/riscv/lib/Makefile @@ -0,0 +1,5 @@ +lib-y := delay.o memcpy.o memset.o uaccess.o + +ifeq ($(CONFIG_64BIT),) +lib-y += ashldi3.o ashrdi3.o lshrdi3.o +endif diff --git a/arch/riscv/lib/ashldi3.c b/arch/riscv/lib/ashldi3.c new file mode 100644 index ..9fb71e82ff16 --- /dev/null +++ b/arch/riscv/lib/ashldi3.c @@ -0,0 +1,42 @@ +/* + * Copyright (C) 2014 Darius Rad + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ + +#include + +#include "libgcc.h" + +long long __ashldi3(long long u, word_type b) +{ + DWunion uu, w; + word_type bm; + + if (b == 0) + return u; + + uu.ll = u; + bm = 32 - b; + + if (bm <= 0) { + w.s.low = 0; + w.s.high = (unsigned int) uu.s.low << -bm; + } else { + const unsigned int carries = (unsigned int) uu.s.low >> bm; + + w.s.low = (unsigned int) uu.s.low << b; + w.s.high = ((unsigned int) uu.s.high << b) | carries; + } + + return w.ll; +} +EXPORT_SYMBOL(__ashldi3); diff --git a/arch/riscv/lib/ashrdi3.c b/arch/riscv/lib/ashrdi3.c new file mode 100644 index ..8a92e7e8de33 --- /dev/null +++ b/arch/riscv/lib/ashrdi3.c @@ -0,0 +1,44 @@ +/* + * Copyright (C) 2014 Darius Rad + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ + +#include + +#include "libgcc.h" + +long long __ashrdi3(long long u, word_type b) +{ + DWunion uu, w; + word_type bm; + + if (b == 0) + return u; + + uu.ll = u; + bm = 32 - b; + + if (bm <= 0) { + /* w.s.high = 1..1 or 0..0 */ + w.s.high = + uu.s.high >> 31; + w.s.low = uu.s.high >> -bm; + } else { + const unsigned int carries = (unsigned int) uu.s.high << bm; + + w.s.high = uu.s.high >> b; + w.s.low = ((unsigned int) uu.s.low >> b) | carries; + } + + return w.ll; +} +EXPORT_SYMBOL(__ashrdi3); diff --git a/arch/riscv/lib/delay.c b/arch/riscv/lib/delay.c new file mode 100644 index ..3c7bf85a0b04 --- /dev/null +++ b/arch/riscv/lib/delay.c @@ -0,0 +1,42 @@ +/* + * Copyright (C) 2012 Regents of the University of California + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation, version 2. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or + * NON INFRINGEMENT. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include + +void __delay(unsigned long cycles) +{ + u64 t0 = get_cycles(); + + while ((unsigned long)(get_cycles() - t0) < cycles) + cpu_relax(); +} + +void udelay(unsigned long usecs) +{ +