Re: Design docs

Edmon Begoli Tue, 08 Sep 2015 19:13:24 -0700

Thanks, Cheng. This is helpful.

On Tuesday, September 8, 2015, Cheng Lian <[email protected]> wrote:


> The materials I found the most useful are those in parquet-format:
>
> - https://github.com/apache/parquet-format/blob/master/README.md
> - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> - https://github.com/apache/parquet-format/blob/master/Encodings.md
>
> Cheng
>
> On 9/8/15 11:43 PM, Edmon Begoli wrote:
>
>> and really - I am asking about the material, so I can differentiate
>> between
>> what is supported in Dremel vs. what is new or different in Parquet.
>> Lots of presentations I've seen talk about Dremel approach in the Parquet
>> context, and they use the same document example.
>>
>> If Parquet and Dremel (as published in 2010 VLDB) are in synch when it
>> comes to the the compression and representation then that is sufficient.
>>
>> If they are not, I would like to know where can I find any material that
>> outlines that (presentations, readmes, source code, etc.)
>>
>> I ask this because I am thinking of proposing a research effort for
>> representing research data formats, and I would like to understand that
>> state-pf-the-art vs. the modifications that would have to be performed as
>> part of the research).
>>
>> I hope this makes sense.
>>
>> Thank you,
>> Edmon
>>
>> On Tue, Sep 8, 2015 at 11:08 AM, Edmon Begoli <[email protected]> wrote:
>>
>> Understood.
>>>
>>> I would not be defining new types, but new standard nested structures, so
>>> for that I probably don't need to modify Parquet at all.
>>>
>>> For doing actual layout conversions and definition of required vs.
>>> optional fields, etc., would you suggest Avro or Thrift as the good media
>>> to do this?
>>>
>>> Something like:
>>> https://github.com/adobe-research/spark-parquet-thrift-example
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 8, 2015 at 10:59 AM, Cheng Lian <[email protected]>
>>> wrote:
>>>
>>> Parquet only provides a limited set of types as building blocks. Although
>>>> we can add more original types (also called converted types in some
>>>> contexts) to represent more application level data types, it's not open
>>>> to
>>>> extension for end users.
>>>>
>>>> Basically, you need to map your own application data types to and from
>>>> Parquet types and do the conversion at application level. One of the
>>>> example is the user-defined types in Spark SQL. We first map UDTs to
>>>> basic
>>>> Spark SQL data types, then convert Spark SQL data types to Parquet types
>>>> via a standard schema converter.
>>>>
>>>> Cheng
>>>>
>>>>
>>>> On 9/7/15 10:26 PM, Edmon Begoli wrote:
>>>>
>>>> Is there, or what is the best learning resource that would help me
>>>>> understand how to canonically map the currently unsupported, nested
>>>>> structured data formats into Parquet.
>>>>>
>>>>> Ideally, I would like to have access to something showing step by step
>>>>> or
>>>>> giving enough background explaining how to do it.
>>>>>
>>>>> If no such thing exist, maybe you can point me out to some basic
>>>>> examples
>>>>> that I could follow to learn the process.
>>>>>
>>>>> I will work to contribute back any tutorials and documentation I
>>>>> produce
>>>>> for my own and my teams use (as well as any produced code).
>>>>>
>>>>> Thank you,
>>>>> Edmon
>>>>>
>>>>>
>>>>>
>

Re: Design docs

Reply via email to