Hi everyone,

I want to propose a new LogicalType for parquet-format.

First, I want to provide some motivation for that type.
In a lot of cases for sensor measurement data, the value read from the
sensor (ADC) is provided in an integer format, in many cases with a
precision of 8 to 16 bit (and almost never 32 bit).
However, the raw value is (almost) always converted in some way to a
physical unit which is then further processed by applications.
A simple example might be a temperature sensor that has an measurement
range of -55°C to +125°C and has a precision of 0.0625°C (-> requires 12
bit).

Applications want to process such data with (single precision) floating
point logic.
Currently, for that reason, we would store such sensor measurement data as
well as analysis results (statistics, ...) as floating point values in the
parquet format.
However, that is of course not optimal, as we're blowing up the 12 bit from
the sensor to 32 bit of floating point data. Moreover, the floating point
representation cannot be compressed/encoded so easily in comparison integer
representation, especially with the currently supported encodings for
floating point values.
The DECIMAL logical type cannot represent all such cases, as it is centered
around 0 and does not support precisions like in the example above.

Now to my actual request:
I suggest to introduce a new LogicalType QuantizedFloat (name to be
discussed), which makes it possible to represent such sensor data
efficiently in the parquet format in integer presentation, but which is
transformed to floating point values when read in the application.
That would require some kind of specification for the mapping of stored
values to floating point representation, in the simplest case a linear
mapping to a complete range of bits (for the example above: min:-128°C,
max:127.9375°C mapped to signed 12 bit integer - the same bits might also
be interpreted as Kelvin or even Fahrenheit, and only the min/max range
would have to be changed).
The uses for such a type would be manifold: it would be capable of storing
floating point data which is known to cover only a certain absolute range
with a limited number of bits. This is of course a lossy representation of
values, but in many scientific or engineering applications, this is
acceptable, especially when saving storage space.

What it the process of adding something like that and what needs to be
implemented?

Kind Regards,
Roman

Reply via email to