I apologize if I'm missing something basic, but why are floats being accumulated in the first place? Can't arange and linspace operations with floats be done internally similar to `start + np.arange(num_steps) * step_size`? I.e. always accumulate (really increment) integers to limit errors.
On Fri, Feb 9, 2018 at 3:43 PM, Benjamin Root <ben.v.r...@gmail.com> wrote: > > > On Fri, Feb 9, 2018 at 12:19 PM, Chris Barker <chris.bar...@noaa.gov> > wrote: > >> On Wed, Feb 7, 2018 at 12:09 AM, Ralf Gommers <ralf.gomm...@gmail.com> >> wrote: >>> >>> It is partly a plea for some development of numerically accurate >>>> functions for computing lat/lon grids from a combination of inputs: bounds, >>>> counts, and resolutions. >>>> >>> >> Can you be more specific about what problems you've run into -- I work >> with lat-lon grids all the time, and have never had a problem. >> >> float32 degrees gives you about 1 meter accuracy or better, so I can see >> how losing a few digits might be an issue, though I would argue that you >> maybe shouldn't use float32 if you are worried about anything close to 1m >> accuracy... -- or shift to a relative coordinate system of some sort. >> > > The issue isn't so much the accuracy of the coordinates themselves. I am > only worried about 1km resolution (which is approximately 0.01 degrees at > mid-latitudes). My concern is with consistent *construction* of a > coordinate grid with even spacing. As it stands right now. If I provide a > corner coordinate, a resolution, and the number of pixels, the result is > not terrible (indeed, this is the approach used by gdal/rasterio). If I > have start/end coordinates and the number of pixels, the result is not bad, > either (use linspace). But, if I have start/end coordinates and a > resolution, then determining the number of pixels from that is actually > tricky to get right in the general case, especially with float32 and large > grids, and especially if the bounding box specified isn't exactly divisible > by the resolution. > > >> >> I have been playing around with the decimal package a bit lately, >>>> >>> >> sigh. decimal is so often looked at a solution to a problem it isn't >> designed for. lat-lon is natively Sexagesimal -- maybe we need that dtype >> :-) >> >> what you get from decimal is variable precision -- maybe a binary >> variable precision lib is a better answer -- that would be a good thing to >> have easy access to in numpy, but in this case, if you want better accuracy >> in a computation that will end up in float32, just use float64. >> > > I am not concerned about computing distances or anything like that, I am > trying to properly construct my grid. I need consistent results regardless > of which way the grid is specified (start/end/count, start/res/count, > start/end/res). I have found that loading up the grid specs (using in a > config file or command-line) using the Decimal class allows me to exactly > and consistently represent the grid specification, and gets me most of the > way there. But the problems with arange() is frustrating, and I have to > have extra logic to go around that and over to linspace() instead. > > >> >> and I discovered the concept of "fused multiply-add" operations for >>>> improved accuracy. I have come to realize that fma operations could be used >>>> to greatly improve the accuracy of linspace() and arange(). >>>> >>> >> arange() is problematic for non-integer use anyway, by its very >> definition (getting the "end point" correct requires the right step, even >> without FP error). >> >> and would it really help with linspace? it's computing a delta with one >> division in fp, then multiplying it by an integer (represented in fp -- >> why? why not keep that an integer till the multiply?). >> > > Sorry, that was a left-over from a previous draft of my email after I > discovered that linspace's accuracy was on par with fma(). And while > arange() has inherent problems, it can still be made better than it is now. > In fact, I haven't investigated this, but I did recently discover some unit > tests of mine started to fail after a numpy upgrade, and traced it back to > a reduction in the accuracy of a usage of arange() with float32s. So, > something got worse at some point, which means we could still get accuracy > back if we can figure out what changed. > > >> >> In particular, I have been needing improved results for computing >>>> latitude/longitude grids, which tend to be done in float32's to save memory >>>> (at least, this is true in data I come across). >>>> >>> >>> If you care about saving memory *and* accuracy, wouldn't it make more >>> sense to do your computations in float64, and convert to float32 at the >>> end? >>> >> >> that does seem to be the easy option :-) >> > > Kinda missing the point, isn't it? Isn't that like saying "convert all > your data to float64s prior to calling np.mean()"? That's ridiculous. > Instead, we made np.mean() upcast the inner-loop operation, and even allow > an option to specify what the dtype that should be used for the aggregator. > > >> >> >>> Now, to the crux of my problem. It is next to impossible to generate a >>>> non-trivial numpy array of coordinates, even in double precision, without >>>> hitting significant numerical errors. >>>> >>> >> I'm confused, the example you posted doesn't have significant errors... >> > > Hmm, "errors" was the wrong word. "Differences between methods" might be > more along the lines of what I was thinking. Remember, I am looking for > consistency. > > >> >> >>> Which has lead me down the path of using the decimal package (which >>>> doesn't play very nicely with numpy because of the lack of casting rules >>>> for it). Consider the following: >>>> ``` >>>> $ cat test_fma.py >>>> from __future__ import print_function >>>> import numpy as np >>>> res = np.float32(0.01) >>>> cnt = 7001 >>>> x0 = np.float32(-115.0) >>>> x1 = res * cnt + x0 >>>> print("res * cnt + x0 = %.16f" % x1) >>>> x = np.arange(-115.0, -44.99 + (res / 2), 0.01, dtype='float32') >>>> print("len(arange()): %d arange()[-1]: %16f" % (len(x), x[-1])) >>>> x = np.linspace(-115.0, -44.99, cnt, dtype='float32') >>>> print("linspace()[-1]: %.16f" % x[-1]) >>>> >>>> $ python test_fma.py >>>> res * cnt + x0 = -44.9900015648454428 >>>> len(arange()): 7002 arange()[-1]: -44.975044 >>>> linspace()[-1]: -44.9900016784667969 >>>> ``` >>>> arange just produces silly results (puts out an extra element... adding >>>> half of the resolution is typically mentioned as a solution on mailing >>>> lists to get around arange()'s limitations -- I personally don't do this). >>>> >>> >> The real solution is "don't do that" arange is not the right tool for the >> job. >> > > Well, it isn't the right tool because as far as I am concerned, it is > useless for anything but integers. Why not fix it to be more suitable for > floating point? > > >> >> Then there is this: >> >> res * cnt + x0 = -44.9900015648454428 >> linspace()[-1]: -44.9900016784667969 >> >> that's as good as you are ever going to get with 32 bit floats... >> > > Consistency is the key thing. I am fine with one of those values, so long > as that value is what happens no matter which way I specify my grid. > > >> >> Though I just noticed something about your numbers -- there should be a >> nice even base ten delta if you have 7001 gaps -- but linspace produces N >> points, not N gaps -- so maybe you want: >> >> >> In [*17*]: l = np.linspace(-115.0, -44.99, 7002) >> >> >> In [*18*]: l[:5] >> >> Out[*18*]: array([-115. , -114.99, -114.98, -114.97, -114.96]) >> >> >> In [*19*]: l[-5:] >> >> Out[*19*]: array([-45.03, -45.02, -45.01, -45. , -44.99]) >> >> >> or, in float32 -- not as pretty: >> >> >> In [*20*]: l = np.linspace(-115.0, -44.99, 7002, dtype=np.float32) >> >> >> In [*21*]: l[:5] >> >> Out[*21*]: >> >> array([-115. , -114.98999786, -114.98000336, -114.97000122, >> >> -114.95999908], dtype=float32) >> >> >> In [*22*]: l[-5:] >> >> Out[*22*]: array([-45.02999878, -45.02000046, -45.00999832, -45. , >> -44.99000168], dtype=float32) >> >> >> but still as good as you get with float32, and exactly the same result as >> computing in float64 and converting: >> >> >> >> In [*25*]: l = np.linspace(-115.0, -44.99, 7002).astype(np.float32) >> >> >> In [*26*]: l[:5] >> >> Out[*26*]: >> >> array([-115. , -114.98999786, -114.98000336, -114.97000122, >> >> -114.95999908], dtype=float32) >> >> >> In [*27*]: l[-5:] >> >> Out[*27*]: array([-45.02999878, -45.02000046, -45.00999832, -45. , >> -44.99000168], dtype=float32) >> > > Argh! I got myself mixed up between specifying pixel corners versus pixel > centers. rasterio has been messing me up on this. > > >> >> >>>> So, does it make any sense to improve arange by utilizing fma() under >>>> the hood? >>>> >>> >> no -- this is simply not the right use-case for arange() anyway. >> > > arange() has accuracy problems, so why not fix it? > > >>> l4 = np.arange(-115, -44.99, 0.01, dtype=np.float32) > >>> np.median(np.diff(l4)) > 0.0099945068 > >>> np.float32(0.01) > 0.0099999998 > > There is something significantly wrong here if arange(), which takes a > resolution parameter, can't seem to produce a sequence with the proper > delta. > > > >> >> >>> Also, any plans for making fma() available as a ufunc? >>>> >>> >> could be nice -- especially if used internally. >> >> >>> Notice that most of my examples required knowing the number of grid >>>> points ahead of time. But what if I didn't know that? What if I just have >>>> the bounds and the resolution? Then arange() is the natural fit, but as I >>>> showed, its accuracy is lacking, and you have to do some sort of hack to do >>>> a closed interval. >>>> >>> >> no -- it's not -- if you have the bounds and the resolution, you have an >> over-specified problem. That is: >> >> x_min + (n * delta_x) == x_max >> >> If there is ANY error in either delta_x or x_max (or x_min), then you'll >> get a missmatch. which is why arange is not the answer (you can make the >> algorithm a bit more accurate, I suppose but there is still fp limited >> precision -- if you can't exactly represent either delta_x or x_max, then >> you CAN'T use the arange() definition and expect to work consistently. >> >> The "right" way to do it is to compute N with: round((x_max - x_min) / >> delta), and then use linspace: >> >> linspace(x_min, x_max, N+1) >> >> (note that it's too bad you need to do N+1 -- if I had to do it over >> again, I'd use N as the number of "gaps" rather than the number of points >> -- that's more commonly what people want, if they care at all) >> >> This way, you get a grid with the endpoints as exact as they can be, and >> the deltas as close to each-other as they can be as well. >> >> maybe you can do a better algorithm in linspace to save an ULP, but it's >> hard to imagine when that would matter. >> > > Yes, it is overspecified. My problem is that different tools require > different specs (ahem... rasterio/gdal), and I have gird specs coming from > other sources. And I need to produce data onto the same grid so that tools > like xarray won't yell at me when I am trying to do an operation between > gridded data that should have the same coordinates, but are off slightly > because they were computed differently for whatever reason. > > I guess I am crying out for some sort of tool that will help the community > stop making the same mistakes. A one-stop shop that'll allow us to specify > a grid in a few different ways and still produce the right thing, and even > do the inverse... provide a coordinate array and get grids specs in > whatever form we want. Maybe even have options for dealing with pixel > corner vs. pixel centers, too? There are additional fun problems such as > padding out coordinate arrays, which np.pad doesn't really do a great job > with. > > Cheers! > Ben Root > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion