Re: [Chapel-developers] Floating Point Thoughts

Damian McGuckin Mon, 23 Oct 2017 03:38:55 -0700

Hi,


I probably jumped in too far, too fast, without a life jacket.

Consider the program below which defines some paramaters of the IEEE 754

binary32 and binary64 floating point numbers which correspond to real(32)and real(64) numbers.

There are defined 2 groups of functions, one for real(32) and a hopefullyorthogonal set for real(64). There is a third group which works for eithersize real(?) number.


I have defined the first two groups as 'param' because I know the values.
Sorry, thought I knew the values.

The trivial program runs fine with the type R as real(64) on line 83.

Changing R to be real(32) sends it seriously off the deep end. So I
stripped param from the first batch and it works.

I notice that

a)      Changing the type definition R to be real(32) appears to imply
        that X:real(32) is run-time conversion. And ditto for x:int(32).
        How do I define a 32 bit which is a compile time constant, i.e.
        equivalent of C's

                ((float) 1.2345)

        which is a run-time constant would be nice. If not, Chapel needs

                param t = 1.2345f;

        whereby one can assert that

                t.type == real(32)

        This will mean that the real(32) proc's can be then have an
        identical definition to the real(64) proc's.

        That also applies to binary floating point constants.

        Actually my preference is that type version of a compile-time
        constant should result in another compile-time constant.

        Once this issue is resolved, every proc in the first group
        should then be able to be defined as a param. Am I correct?

b)      When do real numbers become citizens of full standing? While

                param _1p52 = 1 << 52;

        is a compile time expression, trying to do

                param bad = 1.0 / 0x1.0p52;

        fails. My reading of 8.4.1 of the specification, parameters
        expressions can be applications of the binary operators +, -, *,
        /, **, ==, !=, <=, >=, <, and > on operands that are real,
        imaginary or complex parameter expressions.

        I think my use complies but obviously not. What have I done wrong
        above or have I just misread things? Do I need some option to
        enable this feature?

        Once this issue is resolved, every proc in the third group
        should then be able to be defined as a param. Am I correct?

For now, I have some routines in C that I would like to use in place offpReal??ToRaw and fpRawToReal?? in my code. They do type punning which isnot possible in Chapel itself. For now, I I can sympathize with that ifonle because I cannot think of an elegant, generic, way to do it for themoment.

If I compile them into a C '.o' file, how do I link them? I do not want touse LLVM for the moment if possible as none of the machines to which Ihave easy access have a sufficiently high revisions of Cmake installed.


// This program is broken
//
// Note : VDS = Visible Digits in the Significand


// IEEE754 binary32 Model

proc fpOneBit(type T) param where T == real(32) return 1:uint(32);
proc fpBias(type T) param where T == real(32) return 0x7f:int(32);
proc fpEinfB(type T) param where T == real(32) return 0xff:uint(32);
proc fpEmax(type T) param where T == real(32) return +127:int(32);
proc fpEmin(type T) param where T == real(32) return -126:int(32);
proc fpVDS(type T) param where T == real(32) return 23:uint(32);
proc fpEpsISO(type T) param where T == real(32) return 0x1.0p-23:real(32);
proc fpVmax(type T) param where T == real(32) return 0x1.0p+127:real(32);
proc fpVmin(type T) param where T == real(32) return 0x1.0p-126:real(32);
proc fpDekkerSplit(type T) param where T == real(32) return 0x1.0p12:real(32);

// IEEE754 binary64 Model

proc fpOneBit(type T) param where T == real(64) return 1:uint(64);
proc fpBias(type T) param where T == real(64) return 0x3ff:int(64);
proc fpEinfB(type T) param where T == real(64) return 0x7ff:uint(64);
proc fpEmax(type T) param where T == real(64) return +1023:int(64);
proc fpEmin(type T) param where T == real(64) return -1022:int(64);
proc fpVDS(type T) param where T == real(64) return 52:uint(64);
proc fpEpsISO(type T) param where T == real(64) return 0x1.0p-52:real(64);
proc fpVmax(type T) param where T == real(64) return 0x1.0p+1023:real(64);
proc fpVmin(type T) param where T == real(64) return 0x1.0p-1022:real(64);
proc fpDekkerSplit(type T) param where T == real(64) return 0x1.0p27:real(64);

// IEEE754 binary? Model

proc fpNegMsk(type T) return fpOneBit(T) << (numBits(T) - 1);
proc fpInfRaw(type T) return fpEinfB(T) << fpVDS(T);
proc fpHuge(type T) return fpVmax(T) * (2.0:T - fpEpsISO(T));
proc fpTiny(type T) return fpVmin(T) * fpEpsISO(T);
proc fpDekker(type T) return fpDekker(T) + 1.0:T;

// this next 4 Routines need to be flicked and replaced with
//
// extern proc fpReal64ToRaw(x : real(64)) : uint(64);
// extern proc fpReal32ToRaw(x : real(32)) : uint(32);
// extern proc fpRawToReal64(x : uint(64)) : real(64);
// extern proc fpRawToReal32(x : uint(32)) : real(32);
//
// These routines will be written in C - how do I link them??

proc fpReal64ToRaw(x : real(64)) : uint(64)
{
        return 1234:uint(64);
}

proc fpReal32ToRaw(x : real(32)) : uint(32)
{
        return 1234:uint(32);
}

proc fpRawToReal64(x : uint(64)) : real(64)
{
        return 789.0:real(64);
}

proc fpRawToReal32(x : uint(32)) : real(32)
{
        return 789.0:real(32);
}

proc fpRawToReal(x : uint(32)) : real(32) return fpRawToReal32(x);
proc fpRawToReal(x : uint(64)) : real(64) return fpRawToReal64(x);
proc fpRealToRaw(x : real(32)) : uint(32) return fpReal32ToRaw(x);
proc fpRealToRaw(x : real(64)) : uint(64) return fpReal64ToRaw(x);

module T
{
        proc main()
        {
                param _1p52 = 1 << 52;
                var bad = 1.0 / 0x1.0p52;
                // param bad = 1.0 / _1p52;
                type R = real(64);
                type U = uint(numBits(R));
                var x : R = fpRawToReal(10:U);
                var t = 2.0:R;

                t -= fpEpsISO(t.type);
                t -= 2.0:R;

                writeln("Epsilon(64) ", fpEpsISO(real(64)), " matches ", bad);
                writeln("p = ", fpVDS(x.type), " for real(?) where ? = ", 
numBits(R));
                writeln("Largest! Normal Float ", fpHuge(x.type));
                writeln("Smallest Normal Float ", fpVmin(x.type));
                writeln("Smallest Actual Float ", fpTiny(x.type));
                writeln("Sign Bit Mask is ", fpNegMsk(x.type));
                writeln("Inf. Raw Bits is ", fpInfRaw(x.type));
                writeln("Dummy call to check : ", fpRealToRaw(t));
                writeln("Dummy call to check : ", fpRawToReal(_1p52));
        }
}

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Chapel-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/chapel-developers

Re: [Chapel-developers] Floating Point Thoughts

Reply via email to