Hi everyone, I want to propose a new LogicalType for parquet-format.
First, I want to provide some motivation for that type. In a lot of cases for sensor measurement data, the value read from the sensor (ADC) is provided in an integer format, in many cases with a precision of 8 to 16 bit (and almost never 32 bit). However, the raw value is (almost) always converted in some way to a physical unit which is then further processed by applications. A simple example might be a temperature sensor that has an measurement range of -55°C to +125°C and has a precision of 0.0625°C (-> requires 12 bit). Applications want to process such data with (single precision) floating point logic. Currently, for that reason, we would store such sensor measurement data as well as analysis results (statistics, ...) as floating point values in the parquet format. However, that is of course not optimal, as we're blowing up the 12 bit from the sensor to 32 bit of floating point data. Moreover, the floating point representation cannot be compressed/encoded so easily in comparison integer representation, especially with the currently supported encodings for floating point values. The DECIMAL logical type cannot represent all such cases, as it is centered around 0 and does not support precisions like in the example above. Now to my actual request: I suggest to introduce a new LogicalType QuantizedFloat (name to be discussed), which makes it possible to represent such sensor data efficiently in the parquet format in integer presentation, but which is transformed to floating point values when read in the application. That would require some kind of specification for the mapping of stored values to floating point representation, in the simplest case a linear mapping to a complete range of bits (for the example above: min:-128°C, max:127.9375°C mapped to signed 12 bit integer - the same bits might also be interpreted as Kelvin or even Fahrenheit, and only the min/max range would have to be changed). The uses for such a type would be manifold: it would be capable of storing floating point data which is known to cover only a certain absolute range with a limited number of bits. This is of course a lossy representation of values, but in many scientific or engineering applications, this is acceptable, especially when saving storage space. What it the process of adding something like that and what needs to be implemented? Kind Regards, Roman
