Re: MessageType :: Type :: Encoding options

Stefán Baxter Thu, 04 Feb 2016 15:14:52 -0800

I'm not looking to turn it off, absolutely not, I'm looking to use it in
the most effective way :)


Is there something I can do right now to force these fields to be timestamp
fields in Parquet?

Regards,
 -Stefan

On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <[email protected]> wrote:

> Got it. You can also turn off dictionary encoding with an option on the
> builder.
>
> For timestamp, the support was just released in Avro 1.8.0 and there's a
> pending pull request for adding the same logical types API to parquet-avro:
> https://github.com/apache/parquet-mr/pull/318
>
> Once that's merged, you'll just have to add conversions to your data model
> like this:
>
>   GenericData model = new GenericData();
>   model.addLogicalTypeConversion(
>       new TimeConversions.TimestampConversion());
>
> Then pass that model into the builder.
>
> rb
>
>
> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>
>> Hi Ryan,
>>
>> Thank you for taking the time.
>>
>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on the
>> optional dictionary encoding it's used for almost anything/everything. I
>> even have some time-stamp fields that is turned into a dictionary. (I
>> would
>> have guessed delta integer)
>>
>> I have a ~5M entries  in my test file and the dictionary based one ends up
>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>
>> So I started wondering if I could affect these decisions to compare size,
>> speed etc. but I understand the rational behind automatic selection it
>> just
>> deemed somewhat naive in that Drill scenario.
>>
>> Another matter... can you point me to an example that shows me how to deal
>> with Avro having no timestamp fields and conversion to Parquet.
>>
>> Best regards,
>>   -Stefán
>>
>>
>>
>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <[email protected]> wrote:
>>
>> Hi Stefán,
>>>
>>> The Schema converter will map Avro types to their Parquet equivalents,
>>> for
>>> which there really aren't really choices or options. The mapping is
>>> straight-forward, like long to int64.
>>>
>>> For the individual column encodings, Parquet chooses those automatically
>>> based on the column type and data. For example, dictionary encoding is
>>> used
>>> if it gets better results than plain encoding and integer columns always
>>> use the bit packing and run-length encoding hybrid. There aren't many
>>> choices you would make on a per-column basis here, either.
>>>
>>> There are two options you can control that affect encodings: the
>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>> encodings are delta binary and delta integer, which can't be read by some
>>> older readers or by Apache Impala. They get great compression on certain
>>> types though. You can also control the maximum dictionary size, which
>>> could
>>> help if you have columns that should be dictionary-encoded but are
>>> falling
>>> back to plain encoding because the dictionary gets too big.
>>>
>>> Both of those options are exposed by the builder when you create a
>>> writer:
>>>
>>>    AvroParquetWriter.builder(outputPath)
>>>          .withSchema(schema)
>>>          .withDataModel(ReflectData.get())
>>>          .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>          .withDictionaryPageSize(2*1024*1024)
>>>          .build();
>>>
>>> The default dictionary page size is 1MB.
>>>
>>> rb
>>>
>>>
>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>
>>> Hi,
>>>>
>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>
>>>> I want to control/override the encoding type for a column and I find no
>>>> documentation or examples regarding that.
>>>>
>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>> wonder
>>>> how I can either set or hint columns to use a particular encoding
>>>> option.
>>>> Is that possible?
>>>>
>>>> Regards,
>>>>    -Stefán
>>>>
>>>>
>>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: MessageType :: Type :: Encoding options

Reply via email to