Re: MessageType :: Type :: Encoding options

Stefán Baxter Thu, 04 Feb 2016 15:26:02 -0800

great, and yes, I'm using the settings you provided me with :)

 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)




On Thu, Feb 4, 2016 at 11:24 PM, Ryan Blue <[email protected]> wrote:

> Delta int64 encoding isn't released yet. We have a PR that I'm on the hook
> for getting in. :)
>
> Also, it's one of the 2.0 format encodings, so you'll need that option
> turned on.
>
> rb
>
>
> On 02/04/2016 03:21 PM, Stefán Baxter wrote:
>
>> thnx.
>>
>> This a time-stamp field from a is a smaller sample using the new settings:
>> Feb 4, 2016 11:06:43 PM INFO:
>> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for
>> [occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages,
>> encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw,
>> 7,058B comp}
>>
>> Any reason that comes to mind why this is not a integer delta? (time
>> between these entries is often a few seconds.
>>
>> -Stefan
>>
>>
>> On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <[email protected]> wrote:
>>
>> You should be getting the underlying data back instead of Timestamp
>>> objects. You can pull in Avro 1.8.0 and use the conversions yourself
>>> rather
>>> than waiting for them to be included in the library.
>>>
>>> rb
>>>
>>>
>>> On 02/04/2016 03:14 PM, Stefán Baxter wrote:
>>>
>>> I'm not looking to turn it off, absolutely not, I'm looking to use it in
>>>> the most effective way :)
>>>>
>>>> Is there something I can do right now to force these fields to be
>>>> timestamp
>>>> fields in Parquet?
>>>>
>>>> Regards,
>>>>    -Stefan
>>>>
>>>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <[email protected]> wrote:
>>>>
>>>> Got it. You can also turn off dictionary encoding with an option on the
>>>>
>>>>> builder.
>>>>>
>>>>> For timestamp, the support was just released in Avro 1.8.0 and there's
>>>>> a
>>>>> pending pull request for adding the same logical types API to
>>>>> parquet-avro:
>>>>> https://github.com/apache/parquet-mr/pull/318
>>>>>
>>>>> Once that's merged, you'll just have to add conversions to your data
>>>>> model
>>>>> like this:
>>>>>
>>>>>     GenericData model = new GenericData();
>>>>>     model.addLogicalTypeConversion(
>>>>>         new TimeConversions.TimestampConversion());
>>>>>
>>>>> Then pass that model into the builder.
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>>>
>>>>>> Thank you for taking the time.
>>>>>>
>>>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on
>>>>>> the
>>>>>> optional dictionary encoding it's used for almost
>>>>>> anything/everything. I
>>>>>> even have some time-stamp fields that is turned into a dictionary. (I
>>>>>> would
>>>>>> have guessed delta integer)
>>>>>>
>>>>>> I have a ~5M entries  in my test file and the dictionary based one
>>>>>> ends
>>>>>> up
>>>>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>>>>
>>>>>> So I started wondering if I could affect these decisions to compare
>>>>>> size,
>>>>>> speed etc. but I understand the rational behind automatic selection it
>>>>>> just
>>>>>> deemed somewhat naive in that Drill scenario.
>>>>>>
>>>>>> Another matter... can you point me to an example that shows me how to
>>>>>> deal
>>>>>> with Avro having no timestamp fields and conversion to Parquet.
>>>>>>
>>>>>> Best regards,
>>>>>>     -Stefán
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>> Hi Stefán,
>>>>>>
>>>>>>
>>>>>>> The Schema converter will map Avro types to their Parquet
>>>>>>> equivalents,
>>>>>>> for
>>>>>>> which there really aren't really choices or options. The mapping is
>>>>>>> straight-forward, like long to int64.
>>>>>>>
>>>>>>> For the individual column encodings, Parquet chooses those
>>>>>>> automatically
>>>>>>> based on the column type and data. For example, dictionary encoding
>>>>>>> is
>>>>>>> used
>>>>>>> if it gets better results than plain encoding and integer columns
>>>>>>> always
>>>>>>> use the bit packing and run-length encoding hybrid. There aren't many
>>>>>>> choices you would make on a per-column basis here, either.
>>>>>>>
>>>>>>> There are two options you can control that affect encodings: the
>>>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>>>>> encodings are delta binary and delta integer, which can't be read by
>>>>>>> some
>>>>>>> older readers or by Apache Impala. They get great compression on
>>>>>>> certain
>>>>>>> types though. You can also control the maximum dictionary size, which
>>>>>>> could
>>>>>>> help if you have columns that should be dictionary-encoded but are
>>>>>>> falling
>>>>>>> back to plain encoding because the dictionary gets too big.
>>>>>>>
>>>>>>> Both of those options are exposed by the builder when you create a
>>>>>>> writer:
>>>>>>>
>>>>>>>      AvroParquetWriter.builder(outputPath)
>>>>>>>            .withSchema(schema)
>>>>>>>            .withDataModel(ReflectData.get())
>>>>>>>
>>>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>>>>            .withDictionaryPageSize(2*1024*1024)
>>>>>>>            .build();
>>>>>>>
>>>>>>> The default dictionary page size is 1MB.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>>
>>>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>>>>
>>>>>>>> I want to control/override the encoding type for a column and I find
>>>>>>>> no
>>>>>>>> documentation or examples regarding that.
>>>>>>>>
>>>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>>>>> wonder
>>>>>>>> how I can either set or hint columns to use a particular encoding
>>>>>>>> option.
>>>>>>>> Is that possible?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>      -Stefán
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Cloudera, Inc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Cloudera, Inc.
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: MessageType :: Type :: Encoding options

Reply via email to