Thanks, Cheng. This is helpful. On Tuesday, September 8, 2015, Cheng Lian <[email protected]> wrote:
> The materials I found the most useful are those in parquet-format: > > - https://github.com/apache/parquet-format/blob/master/README.md > - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > - https://github.com/apache/parquet-format/blob/master/Encodings.md > > Cheng > > On 9/8/15 11:43 PM, Edmon Begoli wrote: > >> and really - I am asking about the material, so I can differentiate >> between >> what is supported in Dremel vs. what is new or different in Parquet. >> Lots of presentations I've seen talk about Dremel approach in the Parquet >> context, and they use the same document example. >> >> If Parquet and Dremel (as published in 2010 VLDB) are in synch when it >> comes to the the compression and representation then that is sufficient. >> >> If they are not, I would like to know where can I find any material that >> outlines that (presentations, readmes, source code, etc.) >> >> I ask this because I am thinking of proposing a research effort for >> representing research data formats, and I would like to understand that >> state-pf-the-art vs. the modifications that would have to be performed as >> part of the research). >> >> I hope this makes sense. >> >> Thank you, >> Edmon >> >> On Tue, Sep 8, 2015 at 11:08 AM, Edmon Begoli <[email protected]> wrote: >> >> Understood. >>> >>> I would not be defining new types, but new standard nested structures, so >>> for that I probably don't need to modify Parquet at all. >>> >>> For doing actual layout conversions and definition of required vs. >>> optional fields, etc., would you suggest Avro or Thrift as the good media >>> to do this? >>> >>> Something like: >>> https://github.com/adobe-research/spark-parquet-thrift-example >>> >>> >>> >>> >>> >>> On Tue, Sep 8, 2015 at 10:59 AM, Cheng Lian <[email protected]> >>> wrote: >>> >>> Parquet only provides a limited set of types as building blocks. Although >>>> we can add more original types (also called converted types in some >>>> contexts) to represent more application level data types, it's not open >>>> to >>>> extension for end users. >>>> >>>> Basically, you need to map your own application data types to and from >>>> Parquet types and do the conversion at application level. One of the >>>> example is the user-defined types in Spark SQL. We first map UDTs to >>>> basic >>>> Spark SQL data types, then convert Spark SQL data types to Parquet types >>>> via a standard schema converter. >>>> >>>> Cheng >>>> >>>> >>>> On 9/7/15 10:26 PM, Edmon Begoli wrote: >>>> >>>> Is there, or what is the best learning resource that would help me >>>>> understand how to canonically map the currently unsupported, nested >>>>> structured data formats into Parquet. >>>>> >>>>> Ideally, I would like to have access to something showing step by step >>>>> or >>>>> giving enough background explaining how to do it. >>>>> >>>>> If no such thing exist, maybe you can point me out to some basic >>>>> examples >>>>> that I could follow to learn the process. >>>>> >>>>> I will work to contribute back any tutorials and documentation I >>>>> produce >>>>> for my own and my teams use (as well as any produced code). >>>>> >>>>> Thank you, >>>>> Edmon >>>>> >>>>> >>>>> >
