Re: MessageType :: Type :: Encoding options

Stefán Baxter Thu, 04 Feb 2016 15:22:06 -0800

thnx.

This a time-stamp field from a is a smaller sample using the new settings:
Feb 4, 2016 11:06:43 PM INFO:
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for
[occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages,
encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw,
7,058B comp}


Any reason that comes to mind why this is not a integer delta? (time
between these entries is often a few seconds.

-Stefan


On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <[email protected]> wrote:

> You should be getting the underlying data back instead of Timestamp
> objects. You can pull in Avro 1.8.0 and use the conversions yourself rather
> than waiting for them to be included in the library.
>
> rb
>
>
> On 02/04/2016 03:14 PM, Stefán Baxter wrote:
>
>> I'm not looking to turn it off, absolutely not, I'm looking to use it in
>> the most effective way :)
>>
>> Is there something I can do right now to force these fields to be
>> timestamp
>> fields in Parquet?
>>
>> Regards,
>>   -Stefan
>>
>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <[email protected]> wrote:
>>
>> Got it. You can also turn off dictionary encoding with an option on the
>>> builder.
>>>
>>> For timestamp, the support was just released in Avro 1.8.0 and there's a
>>> pending pull request for adding the same logical types API to
>>> parquet-avro:
>>> https://github.com/apache/parquet-mr/pull/318
>>>
>>> Once that's merged, you'll just have to add conversions to your data
>>> model
>>> like this:
>>>
>>>    GenericData model = new GenericData();
>>>    model.addLogicalTypeConversion(
>>>        new TimeConversions.TimestampConversion());
>>>
>>> Then pass that model into the builder.
>>>
>>> rb
>>>
>>>
>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>>
>>> Hi Ryan,
>>>>
>>>> Thank you for taking the time.
>>>>
>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on
>>>> the
>>>> optional dictionary encoding it's used for almost anything/everything. I
>>>> even have some time-stamp fields that is turned into a dictionary. (I
>>>> would
>>>> have guessed delta integer)
>>>>
>>>> I have a ~5M entries  in my test file and the dictionary based one ends
>>>> up
>>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>>
>>>> So I started wondering if I could affect these decisions to compare
>>>> size,
>>>> speed etc. but I understand the rational behind automatic selection it
>>>> just
>>>> deemed somewhat naive in that Drill scenario.
>>>>
>>>> Another matter... can you point me to an example that shows me how to
>>>> deal
>>>> with Avro having no timestamp fields and conversion to Parquet.
>>>>
>>>> Best regards,
>>>>    -Stefán
>>>>
>>>>
>>>>
>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <[email protected]> wrote:
>>>>
>>>> Hi Stefán,
>>>>
>>>>>
>>>>> The Schema converter will map Avro types to their Parquet equivalents,
>>>>> for
>>>>> which there really aren't really choices or options. The mapping is
>>>>> straight-forward, like long to int64.
>>>>>
>>>>> For the individual column encodings, Parquet chooses those
>>>>> automatically
>>>>> based on the column type and data. For example, dictionary encoding is
>>>>> used
>>>>> if it gets better results than plain encoding and integer columns
>>>>> always
>>>>> use the bit packing and run-length encoding hybrid. There aren't many
>>>>> choices you would make on a per-column basis here, either.
>>>>>
>>>>> There are two options you can control that affect encodings: the
>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>>> encodings are delta binary and delta integer, which can't be read by
>>>>> some
>>>>> older readers or by Apache Impala. They get great compression on
>>>>> certain
>>>>> types though. You can also control the maximum dictionary size, which
>>>>> could
>>>>> help if you have columns that should be dictionary-encoded but are
>>>>> falling
>>>>> back to plain encoding because the dictionary gets too big.
>>>>>
>>>>> Both of those options are exposed by the builder when you create a
>>>>> writer:
>>>>>
>>>>>     AvroParquetWriter.builder(outputPath)
>>>>>           .withSchema(schema)
>>>>>           .withDataModel(ReflectData.get())
>>>>>
>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>>           .withDictionaryPageSize(2*1024*1024)
>>>>>           .build();
>>>>>
>>>>> The default dictionary page size is 1MB.
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>>
>>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>>
>>>>>> I want to control/override the encoding type for a column and I find
>>>>>> no
>>>>>> documentation or examples regarding that.
>>>>>>
>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>>> wonder
>>>>>> how I can either set or hint columns to use a particular encoding
>>>>>> option.
>>>>>> Is that possible?
>>>>>>
>>>>>> Regards,
>>>>>>     -Stefán
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Cloudera, Inc.
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: MessageType :: Type :: Encoding options

Reply via email to