Re: MessageType :: Type :: Encoding options

Stefán Baxter Thu, 04 Feb 2016 23:08:04 -0800

Hi Ryan,

Can you tell me when the Big64 Delta encoding will be available as a part
of your release and if Drill will need an updated Parquet version to read
it?


Regards,
 -Stefan

On Thu, Feb 4, 2016 at 11:25 PM, Stefán Baxter <[email protected]>
wrote:

>
> great, and yes, I'm using the settings you provided me with :)
>
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>
>
>
> On Thu, Feb 4, 2016 at 11:24 PM, Ryan Blue <[email protected]> wrote:
>
>> Delta int64 encoding isn't released yet. We have a PR that I'm on the
>> hook for getting in. :)
>>
>> Also, it's one of the 2.0 format encodings, so you'll need that option
>> turned on.
>>
>> rb
>>
>>
>> On 02/04/2016 03:21 PM, Stefán Baxter wrote:
>>
>>> thnx.
>>>
>>> This a time-stamp field from a is a smaller sample using the new
>>> settings:
>>> Feb 4, 2016 11:06:43 PM INFO:
>>> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for
>>> [occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages,
>>> encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw,
>>> 7,058B comp}
>>>
>>> Any reason that comes to mind why this is not a integer delta? (time
>>> between these entries is often a few seconds.
>>>
>>> -Stefan
>>>
>>>
>>> On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <[email protected]> wrote:
>>>
>>> You should be getting the underlying data back instead of Timestamp
>>>> objects. You can pull in Avro 1.8.0 and use the conversions yourself
>>>> rather
>>>> than waiting for them to be included in the library.
>>>>
>>>> rb
>>>>
>>>>
>>>> On 02/04/2016 03:14 PM, Stefán Baxter wrote:
>>>>
>>>> I'm not looking to turn it off, absolutely not, I'm looking to use it in
>>>>> the most effective way :)
>>>>>
>>>>> Is there something I can do right now to force these fields to be
>>>>> timestamp
>>>>> fields in Parquet?
>>>>>
>>>>> Regards,
>>>>>    -Stefan
>>>>>
>>>>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <[email protected]> wrote:
>>>>>
>>>>> Got it. You can also turn off dictionary encoding with an option on the
>>>>>
>>>>>> builder.
>>>>>>
>>>>>> For timestamp, the support was just released in Avro 1.8.0 and
>>>>>> there's a
>>>>>> pending pull request for adding the same logical types API to
>>>>>> parquet-avro:
>>>>>> https://github.com/apache/parquet-mr/pull/318
>>>>>>
>>>>>> Once that's merged, you'll just have to add conversions to your data
>>>>>> model
>>>>>> like this:
>>>>>>
>>>>>>     GenericData model = new GenericData();
>>>>>>     model.addLogicalTypeConversion(
>>>>>>         new TimeConversions.TimestampConversion());
>>>>>>
>>>>>> Then pass that model into the builder.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>>
>>>>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>>>
>>>>>>> Thank you for taking the time.
>>>>>>>
>>>>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on
>>>>>>> the
>>>>>>> optional dictionary encoding it's used for almost
>>>>>>> anything/everything. I
>>>>>>> even have some time-stamp fields that is turned into a dictionary. (I
>>>>>>> would
>>>>>>> have guessed delta integer)
>>>>>>>
>>>>>>> I have a ~5M entries  in my test file and the dictionary based one
>>>>>>> ends
>>>>>>> up
>>>>>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>>>>>
>>>>>>> So I started wondering if I could affect these decisions to compare
>>>>>>> size,
>>>>>>> speed etc. but I understand the rational behind automatic selection
>>>>>>> it
>>>>>>> just
>>>>>>> deemed somewhat naive in that Drill scenario.
>>>>>>>
>>>>>>> Another matter... can you point me to an example that shows me how to
>>>>>>> deal
>>>>>>> with Avro having no timestamp fields and conversion to Parquet.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>     -Stefán
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Stefán,
>>>>>>>
>>>>>>>
>>>>>>>> The Schema converter will map Avro types to their Parquet
>>>>>>>> equivalents,
>>>>>>>> for
>>>>>>>> which there really aren't really choices or options. The mapping is
>>>>>>>> straight-forward, like long to int64.
>>>>>>>>
>>>>>>>> For the individual column encodings, Parquet chooses those
>>>>>>>> automatically
>>>>>>>> based on the column type and data. For example, dictionary encoding
>>>>>>>> is
>>>>>>>> used
>>>>>>>> if it gets better results than plain encoding and integer columns
>>>>>>>> always
>>>>>>>> use the bit packing and run-length encoding hybrid. There aren't
>>>>>>>> many
>>>>>>>> choices you would make on a per-column basis here, either.
>>>>>>>>
>>>>>>>> There are two options you can control that affect encodings: the
>>>>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>>>>>> encodings are delta binary and delta integer, which can't be read by
>>>>>>>> some
>>>>>>>> older readers or by Apache Impala. They get great compression on
>>>>>>>> certain
>>>>>>>> types though. You can also control the maximum dictionary size,
>>>>>>>> which
>>>>>>>> could
>>>>>>>> help if you have columns that should be dictionary-encoded but are
>>>>>>>> falling
>>>>>>>> back to plain encoding because the dictionary gets too big.
>>>>>>>>
>>>>>>>> Both of those options are exposed by the builder when you create a
>>>>>>>> writer:
>>>>>>>>
>>>>>>>>      AvroParquetWriter.builder(outputPath)
>>>>>>>>            .withSchema(schema)
>>>>>>>>            .withDataModel(ReflectData.get())
>>>>>>>>
>>>>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>>>>>            .withDictionaryPageSize(2*1024*1024)
>>>>>>>>            .build();
>>>>>>>>
>>>>>>>> The default dictionary page size is 1MB.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>>>>>
>>>>>>>>> I want to control/override the encoding type for a column and I
>>>>>>>>> find
>>>>>>>>> no
>>>>>>>>> documentation or examples regarding that.
>>>>>>>>>
>>>>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>>>>>> wonder
>>>>>>>>> how I can either set or hint columns to use a particular encoding
>>>>>>>>> option.
>>>>>>>>> Is that possible?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>      -Stefán
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Cloudera, Inc.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Cloudera, Inc.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>>>
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>
>

Re: MessageType :: Type :: Encoding options

Reply via email to