Re: MessageType :: Type :: Encoding options

Ryan Blue Thu, 04 Feb 2016 14:18:00 -0800

Hi Stefán,

The Schema converter will map Avro types to their Parquet equivalents,for which there really aren't really choices or options. The mapping isstraight-forward, like long to int64.

For the individual column encodings, Parquet chooses those automaticallybased on the column type and data. For example, dictionary encoding isused if it gets better results than plain encoding and integer columnsalways use the bit packing and run-length encoding hybrid. There aren'tmany choices you would make on a per-column basis here, either.

There are two options you can control that affect encodings: thedictionary page size and whether to use the 2.0 encodings. The 2.0encodings are delta binary and delta integer, which can't be read bysome older readers or by Apache Impala. They get great compression oncertain types though. You can also control the maximum dictionary size,which could help if you have columns that should be dictionary-encodedbut are falling back to plain encoding because the dictionary gets too big.


Both of those options are exposed by the builder when you create a writer:

  AvroParquetWriter.builder(outputPath)
        .withSchema(schema)
        .withDataModel(ReflectData.get())
        .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
        .withDictionaryPageSize(2*1024*1024)
        .build();

The default dictionary page size is 1MB.

rb

On 02/04/2016 01:35 PM, Stefán Baxter wrote:

Hi,

I'm using parquet-mr/parquet-avro to write parquet files.

I want to control/override the encoding type for a column and I find no
documentation or examples regarding that.

My schema (MessageType) is converted with AvroSchemaConverter and I wonder
how I can either set or hint columns to use a particular encoding option.
Is that possible?

Regards,
  -Stefán



--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: MessageType :: Type :: Encoding options

Reply via email to