Re: [Chapel-developers] Floating Point Thoughts

Brad Chamberlain Mon, 23 Oct 2017 04:41:27 -0700

Hi Damian --


I'm not going to be the best correspondent this week, but some of the things 
you're running into relate to the "Chapel does not currently do much 
compile-time optimization of floating point 'param' values" comment I made 
earlier which, in retrospect, should've really read "much compile-time 
computation on floating point param values."


Specifically, we historically haven't supported compile-time computation on 
floating point values because we were nervous (maybe wisely, maybe out of 
paranoia, though most recent conversations have suggested the latter) that the 
compiler's floating point semantics might differ from the execution-time node's 
semantics and lead to confusion.  This comes up in the following GitHub issues 
and is something I think we're very open to reconsidering this as the language 
continues to evolve:

https://github.com/chapel-lang/chapel/issues/5922


<https://github.com/chapel-lang/chapel/issues/5922>The weak support for 32-bit 
param reals is, I believe, more of an implementation issue than an intention by 
any means.  This issue relates to that:


https://github.com/chapel-lang/chapel/issues/5122


The one other note I'll throw in here is that we've tried very hard to avoid 
the need for suffixes on literal values because it seems somewhat funny and 
somewhat difficult to extend; but I think there are questions about whether the 
current approach (cast real literal to the type you want) is sufficient and/or 
whether suffix-based support would be more convenient/less typing.


I'll leave it at that for now, but others may chime in with more detail or 
thoughts,

-Brad








________________________________
From: Damian McGuckin <[email protected]>
Sent: Monday, October 23, 2017 3:38:24 AM
To: David Keaton
Cc: [email protected]
Subject: Re: [Chapel-developers] Floating Point Thoughts


Hi,

I probably jumped in too far, too fast, without a life jacket.

Consider the program below which defines some paramaters of the IEEE 754
binary32 and binary64 floating point numbers which correspond to real(32)
and real(64) numbers.

There are defined 2 groups of functions, one for real(32) and a hopefully
orthogonal set for real(64). There is a third group which works for either
size real(?) number.

I have defined the first two groups as 'param' because I know the values.
Sorry, thought I knew the values.

The trivial program runs fine with the type R as real(64) on line 83.

Changing R to be real(32) sends it seriously off the deep end. So I
stripped param from the first batch and it works.

I notice that

a)      Changing the type definition R to be real(32) appears to imply
         that X:real(32) is run-time conversion. And ditto for x:int(32).
         How do I define a 32 bit which is a compile time constant, i.e.
         equivalent of C's

                 ((float) 1.2345)

         which is a run-time constant would be nice. If not, Chapel needs

                 param t = 1.2345f;

         whereby one can assert that

                 t.type == real(32)

         This will mean that the real(32) proc's can be then have an
         identical definition to the real(64) proc's.

         That also applies to binary floating point constants.

         Actually my preference is that type version of a compile-time
         constant should result in another compile-time constant.

         Once this issue is resolved, every proc in the first group
         should then be able to be defined as a param. Am I correct?

b)      When do real numbers become citizens of full standing? While

                 param _1p52 = 1 << 52;

         is a compile time expression, trying to do

                 param bad = 1.0 / 0x1.0p52;

         fails. My reading of 8.4.1 of the specification, parameters
         expressions can be applications of the binary operators +, -, *,
         /, **, ==, !=, <=, >=, <, and > on operands that are real,
         imaginary or complex parameter expressions.

         I think my use complies but obviously not. What have I done wrong
         above or have I just misread things? Do I need some option to
         enable this feature?

         Once this issue is resolved, every proc in the third group
         should then be able to be defined as a param. Am I correct?

For now, I have some routines in C that I would like to use in place of
fpReal??ToRaw and fpRawToReal?? in my code. They do type punning which is
not possible in Chapel itself. For now, I I can sympathize with that if
onle because I cannot think of an elegant, generic, way to do it for the
moment.

If I compile them into a C '.o' file, how do I link them? I do not want to
use LLVM for the moment if possible as none of the machines to which I
have easy access have a sufficiently high revisions of Cmake installed.

// This program is broken
//
// Note : VDS = Visible Digits in the Significand


// IEEE754 binary32 Model

proc fpOneBit(type T) param where T == real(32) return 1:uint(32);
proc fpBias(type T) param where T == real(32) return 0x7f:int(32);
proc fpEinfB(type T) param where T == real(32) return 0xff:uint(32);
proc fpEmax(type T) param where T == real(32) return +127:int(32);
proc fpEmin(type T) param where T == real(32) return -126:int(32);
proc fpVDS(type T) param where T == real(32) return 23:uint(32);
proc fpEpsISO(type T) param where T == real(32) return 0x1.0p-23:real(32);
proc fpVmax(type T) param where T == real(32) return 0x1.0p+127:real(32);
proc fpVmin(type T) param where T == real(32) return 0x1.0p-126:real(32);
proc fpDekkerSplit(type T) param where T == real(32) return 0x1.0p12:real(32);

// IEEE754 binary64 Model

proc fpOneBit(type T) param where T == real(64) return 1:uint(64);
proc fpBias(type T) param where T == real(64) return 0x3ff:int(64);
proc fpEinfB(type T) param where T == real(64) return 0x7ff:uint(64);
proc fpEmax(type T) param where T == real(64) return +1023:int(64);
proc fpEmin(type T) param where T == real(64) return -1022:int(64);
proc fpVDS(type T) param where T == real(64) return 52:uint(64);
proc fpEpsISO(type T) param where T == real(64) return 0x1.0p-52:real(64);
proc fpVmax(type T) param where T == real(64) return 0x1.0p+1023:real(64);
proc fpVmin(type T) param where T == real(64) return 0x1.0p-1022:real(64);
proc fpDekkerSplit(type T) param where T == real(64) return 0x1.0p27:real(64);

// IEEE754 binary? Model

proc fpNegMsk(type T) return fpOneBit(T) << (numBits(T) - 1);
proc fpInfRaw(type T) return fpEinfB(T) << fpVDS(T);
proc fpHuge(type T) return fpVmax(T) * (2.0:T - fpEpsISO(T));
proc fpTiny(type T) return fpVmin(T) * fpEpsISO(T);
proc fpDekker(type T) return fpDekker(T) + 1.0:T;

// this next 4 Routines need to be flicked and replaced with
//
// extern proc fpReal64ToRaw(x : real(64)) : uint(64);
// extern proc fpReal32ToRaw(x : real(32)) : uint(32);
// extern proc fpRawToReal64(x : uint(64)) : real(64);
// extern proc fpRawToReal32(x : uint(32)) : real(32);
//
// These routines will be written in C - how do I link them??

proc fpReal64ToRaw(x : real(64)) : uint(64)
{
         return 1234:uint(64);
}

proc fpReal32ToRaw(x : real(32)) : uint(32)
{
         return 1234:uint(32);
}

proc fpRawToReal64(x : uint(64)) : real(64)
{
         return 789.0:real(64);
}

proc fpRawToReal32(x : uint(32)) : real(32)
{
         return 789.0:real(32);
}

proc fpRawToReal(x : uint(32)) : real(32) return fpRawToReal32(x);
proc fpRawToReal(x : uint(64)) : real(64) return fpRawToReal64(x);
proc fpRealToRaw(x : real(32)) : uint(32) return fpReal32ToRaw(x);
proc fpRealToRaw(x : real(64)) : uint(64) return fpReal64ToRaw(x);

module T
{
         proc main()
         {
                 param _1p52 = 1 << 52;
                 var bad = 1.0 / 0x1.0p52;
                 // param bad = 1.0 / _1p52;
                 type R = real(64);
                 type U = uint(numBits(R));
                 var x : R = fpRawToReal(10:U);
                 var t = 2.0:R;

                 t -= fpEpsISO(t.type);
                 t -= 2.0:R;

                 writeln("Epsilon(64) ", fpEpsISO(real(64)), " matches ", bad);
                 writeln("p = ", fpVDS(x.type), " for real(?) where ? = ", 
numBits(R));
                 writeln("Largest! Normal Float ", fpHuge(x.type));
                 writeln("Smallest Normal Float ", fpVmin(x.type));
                 writeln("Smallest Actual Float ", fpTiny(x.type));
                 writeln("Sign Bit Mask is ", fpNegMsk(x.type));
                 writeln("Inf. Raw Bits is ", fpInfRaw(x.type));
                 writeln("Dummy call to check : ", fpRealToRaw(t));
                 writeln("Dummy call to check : ", fpRawToReal(_1p52));
         }
}

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] Floating Point Thoughts

Reply via email to