Hi, If we introduced such a type, personally I would prefer restricting its range to regular numbers. I would leave -0, ±inf and the various NaNs to the real float and double types. NULL will always be a possiblity of course, which already provides some flexibility.
Br, Zoltan On Tue, Nov 20, 2018 at 9:19 AM Roman Karlstetter < [email protected]> wrote: > Hi, > > thanks for your response. > I already thought about using half-precision. I think that it might be a > good alternative for use-cases where the values span a very wide range. > However, when we deal with things like temperature sensor measurements, we > "waste" precision for high absolute values (that occur only very rarely or > never at all) and lose precision for small values (which occur frequently). > In addition to that, half precision is centered at zero (like float and > double), and that might not be the case for all types of measurement > values. But I think it makes sense to add support for half precision to > parquet anyway. > > One possible mapping from the encoded representation to the actual value > is, e.g., to linearly map from given min and max measurement values to a > range of integers with a given bit-width. > This easily allows to trade precision with storage space by using more or > less bits. > > Concerning the definition for special values: there are for sure things > that need special treatment, like handling NaNs, or handling values that > fall outside the representable range. > Possible alternatives are: > - clipping too large/small values to max/min values. That would include > +-inf. > - NaNs: use one of the encoded values, e.g., 0 or all-bits-1 for NaN > - denormal/subnormal or +-zero values: these could just be rounded to the > closest value that is representable with the chosen encoding. > > Now that I think about it again, the name “QuantizedFloat” is probably > also not ideal, because a IEEE float or double is of course also quantized. > It’s just that what I have in mind is more regularly quantized in the > supported interval. > > Any further opinions on that? > > Roman > > > Von: Ryan Blue > Gesendet: Freitag, 16. November 2018 18:47 > An: Parquet Dev > Betreff: Re: Proposal for new LogicalType: QuantizedFloat > > I like this idea because we don't really have any good encoding for > floating point values other than dictionary encoding. The most effective > recommendation I have for our users is to know when to use float instead of > double, which is along the same lines. > > I think the next thing to do is to make sure we have a solid definition for > quantized float. Is it just dropping bits from the significand? What about > limiting the exponent? How does it work for denormal values? > > It may make sense to add support for half-precision (16-bit) floats > instead. Have you considered that option? > > rb > > On Thu, Nov 8, 2018 at 1:39 PM Roman Karlstetter < > [email protected]> wrote: > > > Hi everyone, > > > > I want to propose a new LogicalType for parquet-format. > > > > First, I want to provide some motivation for that type. > > In a lot of cases for sensor measurement data, the value read from the > > sensor (ADC) is provided in an integer format, in many cases with a > > precision of 8 to 16 bit (and almost never 32 bit). > > However, the raw value is (almost) always converted in some way to a > > physical unit which is then further processed by applications. > > A simple example might be a temperature sensor that has an measurement > > range of -55°C to +125°C and has a precision of 0.0625°C (-> requires 12 > > bit). > > > > Applications want to process such data with (single precision) floating > > point logic. > > Currently, for that reason, we would store such sensor measurement data > as > > well as analysis results (statistics, ...) as floating point values in > the > > parquet format. > > However, that is of course not optimal, as we're blowing up the 12 bit > from > > the sensor to 32 bit of floating point data. Moreover, the floating point > > representation cannot be compressed/encoded so easily in comparison > integer > > representation, especially with the currently supported encodings for > > floating point values. > > The DECIMAL logical type cannot represent all such cases, as it is > centered > > around 0 and does not support precisions like in the example above. > > > > Now to my actual request: > > I suggest to introduce a new LogicalType QuantizedFloat (name to be > > discussed), which makes it possible to represent such sensor data > > efficiently in the parquet format in integer presentation, but which is > > transformed to floating point values when read in the application. > > That would require some kind of specification for the mapping of stored > > values to floating point representation, in the simplest case a linear > > mapping to a complete range of bits (for the example above: min:-128°C, > > max:127.9375°C mapped to signed 12 bit integer - the same bits might also > > be interpreted as Kelvin or even Fahrenheit, and only the min/max range > > would have to be changed). > > The uses for such a type would be manifold: it would be capable of > storing > > floating point data which is known to cover only a certain absolute range > > with a limited number of bits. This is of course a lossy representation > of > > values, but in many scientific or engineering applications, this is > > acceptable, especially when saving storage space. > > > > What it the process of adding something like that and what needs to be > > implemented? > > > > Kind Regards, > > Roman > > > > > -- > Ryan Blue > Software Engineer > Netflix > >
