Compression happens on physical types I think given what I have seen so far. Does that mean that you can take a low cardinality data set like lets say the numbers 1-1000 but because you are combining two values together to be placed into an in32 that means that the cardinality of the data that actually gets compressed is drawn from a much larger space (1000 * 1000) potentially unique values. Is this the case? Are there architectural reasons that smaller data types are packed together and stored in larger ones or was this done for reducing the burden of implementing compression algorithms on more datatypes? Would adding more compressible physical_types even be useful?
Felipe On Wed, Jul 26, 2017 at 1:31 PM, Wes McKinney <[email protected]> wrote: > We are using std::copy to cast the values on the write side (from > int16_t to int32_t for storage in Parquet) > > https://github.com/apache/parquet-cpp/blob/master/src/ > parquet/arrow/writer.cc#L336 > > and then casting back on read > > On Wed, Jul 26, 2017 at 2:28 PM, Felipe Aramburu <[email protected]> > wrote: > > How does this work when you are trying to move from a representation like > > int32 to int16? reinterpret_cast can exhibit undefined behavior if you > are > > trying to cast between types that have different sizes. Should I just get > > the first 4 bytes and handle it manually or is there a more concise way > to > > do that? > > > > On Wed, Jul 26, 2017 at 12:20 PM, Felipe Aramburu <[email protected]> > > wrote: > > > >> perfect thats what I was hoping for :) > >> > >> On Wed, Jul 26, 2017 at 11:33 AM, Wes McKinney <[email protected]> > >> wrote: > >> > >>> hi Felipe, > >>> > >>> In C++ it is the equivalent of > >>> > >>> uint64_t val = ...; > >>> int64_t encoded_val = *reinterpret_cast<int64_t*>(&val); > >>> > >>> So no alteration of the bit pattern > >>> > >>> - Wes > >>> > >>> On Wed, Jul 26, 2017 at 12:18 PM, Felipe Aramburu < > [email protected]> > >>> wrote: > >>> > https://github.com/Parquet/parquet-format/blob/master/src/ > >>> thrift/parquet.thrift > >>> > > >>> > > >>> > This file doesnt really specify how to interpret an unsigned type > >>> stored in > >>> > a signed type. > >>> > > >>> > So If I make a UINT64 as my logical type but its being stored as an > >>> int64 > >>> > are you shifting the value or are you storing the BYTE > representation of > >>> > the UNIT64 inside of an int64, or is it something else? > >>> > > >>> > I can't seem to find the code that actually converts from the > physical > >>> > types to the logical types which would also help explain how this > >>> happens. > >>> > > >>> > Felipe > >>> > >> > >> >
