If you have a column whose distinct values are low cardinality, very likely those values will be written out in a dictionary page, and the dictionary indices within the rest of the column chunk will be RLE-encoded and bitpacked (and then compressed). So there isn't too much concern about wasted storage unless you disable dictionary encoding (but I'm not sure why you would). I believe all of the writers have dictionary encoding enabled by default for all column types.
On Wed, Jul 26, 2017 at 2:48 PM, Felipe Aramburu <[email protected]> wrote: > Compression happens on physical types I think given what I have seen so > far. Does that mean that you can take a low cardinality data set like lets > say the numbers 1-1000 but because you are combining two values together to > be placed into an in32 that means that the cardinality of the data that > actually gets compressed is drawn from a much larger space (1000 * 1000) > potentially unique values. Is this the case? Are there architectural > reasons that smaller data types are packed together and stored in larger > ones or was this done for reducing the burden of implementing compression > algorithms on more datatypes? Would adding more compressible physical_types > even be useful? > > Felipe > > > > On Wed, Jul 26, 2017 at 1:31 PM, Wes McKinney <[email protected]> wrote: > >> We are using std::copy to cast the values on the write side (from >> int16_t to int32_t for storage in Parquet) >> >> https://github.com/apache/parquet-cpp/blob/master/src/ >> parquet/arrow/writer.cc#L336 >> >> and then casting back on read >> >> On Wed, Jul 26, 2017 at 2:28 PM, Felipe Aramburu <[email protected]> >> wrote: >> > How does this work when you are trying to move from a representation like >> > int32 to int16? reinterpret_cast can exhibit undefined behavior if you >> are >> > trying to cast between types that have different sizes. Should I just get >> > the first 4 bytes and handle it manually or is there a more concise way >> to >> > do that? >> > >> > On Wed, Jul 26, 2017 at 12:20 PM, Felipe Aramburu <[email protected]> >> > wrote: >> > >> >> perfect thats what I was hoping for :) >> >> >> >> On Wed, Jul 26, 2017 at 11:33 AM, Wes McKinney <[email protected]> >> >> wrote: >> >> >> >>> hi Felipe, >> >>> >> >>> In C++ it is the equivalent of >> >>> >> >>> uint64_t val = ...; >> >>> int64_t encoded_val = *reinterpret_cast<int64_t*>(&val); >> >>> >> >>> So no alteration of the bit pattern >> >>> >> >>> - Wes >> >>> >> >>> On Wed, Jul 26, 2017 at 12:18 PM, Felipe Aramburu < >> [email protected]> >> >>> wrote: >> >>> > https://github.com/Parquet/parquet-format/blob/master/src/ >> >>> thrift/parquet.thrift >> >>> > >> >>> > >> >>> > This file doesnt really specify how to interpret an unsigned type >> >>> stored in >> >>> > a signed type. >> >>> > >> >>> > So If I make a UINT64 as my logical type but its being stored as an >> >>> int64 >> >>> > are you shifting the value or are you storing the BYTE >> representation of >> >>> > the UNIT64 inside of an int64, or is it something else? >> >>> > >> >>> > I can't seem to find the code that actually converts from the >> physical >> >>> > types to the logical types which would also help explain how this >> >>> happens. >> >>> > >> >>> > Felipe >> >>> >> >> >> >> >>
