Re: MessageType :: Type :: Encoding options

Stefán Baxter Thu, 04 Feb 2016 14:55:39 -0800

Hi Ryan,

Thank you for taking the time.


I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on the
optional dictionary encoding it's used for almost anything/everything. I
even have some time-stamp fields that is turned into a dictionary. (I would
have guessed delta integer)

I have a ~5M entries  in my test file and the dictionary based one ends up
550mb and the non-dictionary based one ends up 790mb (still faster).

So I started wondering if I could affect these decisions to compare size,
speed etc. but I understand the rational behind automatic selection it just
deemed somewhat naive in that Drill scenario.

Another matter... can you point me to an example that shows me how to deal
with Avro having no timestamp fields and conversion to Parquet.

Best regards,
 -Stefán



On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <[email protected]> wrote:

> Hi Stefán,
>
> The Schema converter will map Avro types to their Parquet equivalents, for
> which there really aren't really choices or options. The mapping is
> straight-forward, like long to int64.
>
> For the individual column encodings, Parquet chooses those automatically
> based on the column type and data. For example, dictionary encoding is used
> if it gets better results than plain encoding and integer columns always
> use the bit packing and run-length encoding hybrid. There aren't many
> choices you would make on a per-column basis here, either.
>
> There are two options you can control that affect encodings: the
> dictionary page size and whether to use the 2.0 encodings. The 2.0
> encodings are delta binary and delta integer, which can't be read by some
> older readers or by Apache Impala. They get great compression on certain
> types though. You can also control the maximum dictionary size, which could
> help if you have columns that should be dictionary-encoded but are falling
> back to plain encoding because the dictionary gets too big.
>
> Both of those options are exposed by the builder when you create a writer:
>
>   AvroParquetWriter.builder(outputPath)
>         .withSchema(schema)
>         .withDataModel(ReflectData.get())
>         .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>         .withDictionaryPageSize(2*1024*1024)
>         .build();
>
> The default dictionary page size is 1MB.
>
> rb
>
>
> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>
>> Hi,
>>
>> I'm using parquet-mr/parquet-avro to write parquet files.
>>
>> I want to control/override the encoding type for a column and I find no
>> documentation or examples regarding that.
>>
>> My schema (MessageType) is converted with AvroSchemaConverter and I wonder
>> how I can either set or hint columns to use a particular encoding option.
>> Is that possible?
>>
>> Regards,
>>   -Stefán
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: MessageType :: Type :: Encoding options

Reply via email to