Hi,

thanks for your response.
I already thought about using half-precision. I think that it might be a good 
alternative for use-cases where the values span a very wide range. However, 
when we deal with things like temperature sensor measurements, we "waste" 
precision for high absolute values (that occur only very rarely or never at 
all) and lose precision for small values (which occur frequently). In addition 
to that, half precision is centered at zero (like float and double), and that 
might not be the case for all types of measurement values. But I think it makes 
sense to add support for half precision to parquet anyway.

One possible mapping from the encoded representation to the actual value is, 
e.g., to linearly map from given min and max measurement values to a range of 
integers with a given bit-width.
This easily allows to trade precision with storage space by using more or less 
bits.

Concerning the definition for special values: there are for sure things that 
need special treatment, like handling NaNs, or handling values that fall 
outside the representable range.
Possible alternatives are:
 - clipping too large/small values to max/min values. That would include +-inf.
 - NaNs: use one of the encoded values, e.g., 0 or all-bits-1 for NaN
 - denormal/subnormal or +-zero values: these could just be rounded to the 
closest value that is representable with the chosen encoding.
 
Now that I think about it again, the name “QuantizedFloat” is probably also not 
ideal, because a IEEE float or double is of course also quantized. It’s just 
that what I have in mind is more regularly quantized in the supported interval.

Any further opinions on that?

Roman


Von: Ryan Blue
Gesendet: Freitag, 16. November 2018 18:47
An: Parquet Dev
Betreff: Re: Proposal for new LogicalType: QuantizedFloat

I like this idea because we don't really have any good encoding for
floating point values other than dictionary encoding. The most effective
recommendation I have for our users is to know when to use float instead of
double, which is along the same lines.

I think the next thing to do is to make sure we have a solid definition for
quantized float. Is it just dropping bits from the significand? What about
limiting the exponent? How does it work for denormal values?

It may make sense to add support for half-precision (16-bit) floats
instead. Have you considered that option?

rb

On Thu, Nov 8, 2018 at 1:39 PM Roman Karlstetter <
[email protected]> wrote:

> Hi everyone,
>
> I want to propose a new LogicalType for parquet-format.
>
> First, I want to provide some motivation for that type.
> In a lot of cases for sensor measurement data, the value read from the
> sensor (ADC) is provided in an integer format, in many cases with a
> precision of 8 to 16 bit (and almost never 32 bit).
> However, the raw value is (almost) always converted in some way to a
> physical unit which is then further processed by applications.
> A simple example might be a temperature sensor that has an measurement
> range of -55°C to +125°C and has a precision of 0.0625°C (-> requires 12
> bit).
>
> Applications want to process such data with (single precision) floating
> point logic.
> Currently, for that reason, we would store such sensor measurement data as
> well as analysis results (statistics, ...) as floating point values in the
> parquet format.
> However, that is of course not optimal, as we're blowing up the 12 bit from
> the sensor to 32 bit of floating point data. Moreover, the floating point
> representation cannot be compressed/encoded so easily in comparison integer
> representation, especially with the currently supported encodings for
> floating point values.
> The DECIMAL logical type cannot represent all such cases, as it is centered
> around 0 and does not support precisions like in the example above.
>
> Now to my actual request:
> I suggest to introduce a new LogicalType QuantizedFloat (name to be
> discussed), which makes it possible to represent such sensor data
> efficiently in the parquet format in integer presentation, but which is
> transformed to floating point values when read in the application.
> That would require some kind of specification for the mapping of stored
> values to floating point representation, in the simplest case a linear
> mapping to a complete range of bits (for the example above: min:-128°C,
> max:127.9375°C mapped to signed 12 bit integer - the same bits might also
> be interpreted as Kelvin or even Fahrenheit, and only the min/max range
> would have to be changed).
> The uses for such a type would be manifold: it would be capable of storing
> floating point data which is known to cover only a certain absolute range
> with a limited number of bits. This is of course a lossy representation of
> values, but in many scientific or engineering applications, this is
> acceptable, especially when saving storage space.
>
> What it the process of adding something like that and what needs to be
> implemented?
>
> Kind Regards,
> Roman
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to