The intent was for binary to store the minimum number of bytes for each unscaled value. Fixed should be used if you want to store all values with the same number of bytes because that avoids writing a length for each byte array. Binary works well for the case you described, where you have a large precision, but enough small values to offset the cost of storing the length.
rb On Wed, Nov 16, 2016 at 2:41 PM, Henry Robinson <[email protected]> wrote: > Hi - > > I'm adding binary encoding support for decimal to Impala, and have one > question about some wording in the spec: > > "binary: precision is not limited, but is required. The minimum number of > bytes to store the unscaled value should be used" > > https://github.com/apache/parquet-format/blob/master/ > LogicalTypes.md#decimal > > When the spec says 'the minimum number of bytes', which of the following > does that mean: > > * the minimum number of bytes to store a particular unscaled value must be > used (so for '8' it's one byte, for '550' it's two bytes and so on), and > the encoded length is value dependent. > > or > > * the minimum number of bytes for the given precision must be used (so all > values in a given column should have the same byte length). > > If it's the latter, the implementation is much easier because > FIXED_LEN_BYTE_ARRAY becomes a special case of BINARY, but the former > offers more opportunity for compact representations on a high precision > column that in practice has low precision values. > > Thanks, > Henry > -- Ryan Blue Software Engineer Netflix
