If you have a column whose distinct values are low cardinality, very
likely those values will be written out in a dictionary page, and the
dictionary indices within the rest of the column chunk will be
RLE-encoded and bitpacked (and then compressed). So there isn't too
much concern about wasted storage unless you disable dictionary
encoding (but I'm not sure why you would). I believe all of the
writers have dictionary encoding enabled by default for all column
types.

On Wed, Jul 26, 2017 at 2:48 PM, Felipe Aramburu <[email protected]> wrote:
> Compression happens on physical types I  think given what I have seen so
> far. Does that mean that you can take a low cardinality data set like lets
> say the numbers 1-1000 but because you are combining two values together to
> be placed into an in32 that means that the cardinality of the data that
> actually gets compressed is drawn from a much larger space (1000 * 1000)
> potentially unique values. Is this the case? Are there architectural
> reasons that smaller data types are packed together and stored in larger
> ones or was this done for reducing the burden of implementing compression
> algorithms on more datatypes? Would adding more compressible physical_types
> even be useful?
>
> Felipe
>
>
>
> On Wed, Jul 26, 2017 at 1:31 PM, Wes McKinney <[email protected]> wrote:
>
>> We are using std::copy to cast the values on the write side (from
>> int16_t to int32_t for storage in Parquet)
>>
>> https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/arrow/writer.cc#L336
>>
>> and then casting back on read
>>
>> On Wed, Jul 26, 2017 at 2:28 PM, Felipe Aramburu <[email protected]>
>> wrote:
>> > How does this work when you are trying to move from a representation like
>> > int32 to int16? reinterpret_cast can exhibit undefined behavior if you
>> are
>> > trying to cast between types that have different sizes. Should I just get
>> > the first 4 bytes and handle it manually or is there a more concise way
>> to
>> > do that?
>> >
>> > On Wed, Jul 26, 2017 at 12:20 PM, Felipe Aramburu <[email protected]>
>> > wrote:
>> >
>> >> perfect thats what I was hoping for :)
>> >>
>> >> On Wed, Jul 26, 2017 at 11:33 AM, Wes McKinney <[email protected]>
>> >> wrote:
>> >>
>> >>> hi Felipe,
>> >>>
>> >>> In C++ it is the equivalent of
>> >>>
>> >>> uint64_t val = ...;
>> >>> int64_t encoded_val = *reinterpret_cast<int64_t*>(&val);
>> >>>
>> >>> So no alteration of the bit pattern
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Wed, Jul 26, 2017 at 12:18 PM, Felipe Aramburu <
>> [email protected]>
>> >>> wrote:
>> >>> > https://github.com/Parquet/parquet-format/blob/master/src/
>> >>> thrift/parquet.thrift
>> >>> >
>> >>> >
>> >>> > This file doesnt really specify how to interpret an unsigned type
>> >>> stored in
>> >>> > a signed type.
>> >>> >
>> >>> > So If I make a UINT64 as my logical type but its being stored as an
>> >>> int64
>> >>> > are you shifting the value or are you storing the BYTE
>> representation of
>> >>> > the UNIT64 inside of an int64, or is it something else?
>> >>> >
>> >>> > I can't seem to find the code that actually converts from the
>> physical
>> >>> > types to the logical types which would also help explain how this
>> >>> happens.
>> >>> >
>> >>> > Felipe
>> >>>
>> >>
>> >>
>>

Reply via email to