Hi,

If we introduced such a type, personally I would prefer restricting its
range to regular numbers. I would leave -0, ±inf and the various NaNs to
the real float and double types. NULL will always be a possiblity of
course, which already provides some flexibility.

Br,

Zoltan

On Tue, Nov 20, 2018 at 9:19 AM Roman Karlstetter <
[email protected]> wrote:

> Hi,
>
> thanks for your response.
> I already thought about using half-precision. I think that it might be a
> good alternative for use-cases where the values span a very wide range.
> However, when we deal with things like temperature sensor measurements, we
> "waste" precision for high absolute values (that occur only very rarely or
> never at all) and lose precision for small values (which occur frequently).
> In addition to that, half precision is centered at zero (like float and
> double), and that might not be the case for all types of measurement
> values. But I think it makes sense to add support for half precision to
> parquet anyway.
>
> One possible mapping from the encoded representation to the actual value
> is, e.g., to linearly map from given min and max measurement values to a
> range of integers with a given bit-width.
> This easily allows to trade precision with storage space by using more or
> less bits.
>
> Concerning the definition for special values: there are for sure things
> that need special treatment, like handling NaNs, or handling values that
> fall outside the representable range.
> Possible alternatives are:
>  - clipping too large/small values to max/min values. That would include
> +-inf.
>  - NaNs: use one of the encoded values, e.g., 0 or all-bits-1 for NaN
>  - denormal/subnormal or +-zero values: these could just be rounded to the
> closest value that is representable with the chosen encoding.
>
> Now that I think about it again, the name “QuantizedFloat” is probably
> also not ideal, because a IEEE float or double is of course also quantized.
> It’s just that what I have in mind is more regularly quantized in the
> supported interval.
>
> Any further opinions on that?
>
> Roman
>
>
> Von: Ryan Blue
> Gesendet: Freitag, 16. November 2018 18:47
> An: Parquet Dev
> Betreff: Re: Proposal for new LogicalType: QuantizedFloat
>
> I like this idea because we don't really have any good encoding for
> floating point values other than dictionary encoding. The most effective
> recommendation I have for our users is to know when to use float instead of
> double, which is along the same lines.
>
> I think the next thing to do is to make sure we have a solid definition for
> quantized float. Is it just dropping bits from the significand? What about
> limiting the exponent? How does it work for denormal values?
>
> It may make sense to add support for half-precision (16-bit) floats
> instead. Have you considered that option?
>
> rb
>
> On Thu, Nov 8, 2018 at 1:39 PM Roman Karlstetter <
> [email protected]> wrote:
>
> > Hi everyone,
> >
> > I want to propose a new LogicalType for parquet-format.
> >
> > First, I want to provide some motivation for that type.
> > In a lot of cases for sensor measurement data, the value read from the
> > sensor (ADC) is provided in an integer format, in many cases with a
> > precision of 8 to 16 bit (and almost never 32 bit).
> > However, the raw value is (almost) always converted in some way to a
> > physical unit which is then further processed by applications.
> > A simple example might be a temperature sensor that has an measurement
> > range of -55°C to +125°C and has a precision of 0.0625°C (-> requires 12
> > bit).
> >
> > Applications want to process such data with (single precision) floating
> > point logic.
> > Currently, for that reason, we would store such sensor measurement data
> as
> > well as analysis results (statistics, ...) as floating point values in
> the
> > parquet format.
> > However, that is of course not optimal, as we're blowing up the 12 bit
> from
> > the sensor to 32 bit of floating point data. Moreover, the floating point
> > representation cannot be compressed/encoded so easily in comparison
> integer
> > representation, especially with the currently supported encodings for
> > floating point values.
> > The DECIMAL logical type cannot represent all such cases, as it is
> centered
> > around 0 and does not support precisions like in the example above.
> >
> > Now to my actual request:
> > I suggest to introduce a new LogicalType QuantizedFloat (name to be
> > discussed), which makes it possible to represent such sensor data
> > efficiently in the parquet format in integer presentation, but which is
> > transformed to floating point values when read in the application.
> > That would require some kind of specification for the mapping of stored
> > values to floating point representation, in the simplest case a linear
> > mapping to a complete range of bits (for the example above: min:-128°C,
> > max:127.9375°C mapped to signed 12 bit integer - the same bits might also
> > be interpreted as Kelvin or even Fahrenheit, and only the min/max range
> > would have to be changed).
> > The uses for such a type would be manifold: it would be capable of
> storing
> > floating point data which is known to cover only a certain absolute range
> > with a limited number of bits. This is of course a lossy representation
> of
> > values, but in many scientific or engineering applications, this is
> > acceptable, especially when saving storage space.
> >
> > What it the process of adding something like that and what needs to be
> > implemented?
> >
> > Kind Regards,
> > Roman
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>

Reply via email to