and really - I am asking about the material, so I can differentiate between what is supported in Dremel vs. what is new or different in Parquet. Lots of presentations I've seen talk about Dremel approach in the Parquet context, and they use the same document example.
If Parquet and Dremel (as published in 2010 VLDB) are in synch when it comes to the the compression and representation then that is sufficient. If they are not, I would like to know where can I find any material that outlines that (presentations, readmes, source code, etc.) I ask this because I am thinking of proposing a research effort for representing research data formats, and I would like to understand that state-pf-the-art vs. the modifications that would have to be performed as part of the research). I hope this makes sense. Thank you, Edmon On Tue, Sep 8, 2015 at 11:08 AM, Edmon Begoli <[email protected]> wrote: > Understood. > > I would not be defining new types, but new standard nested structures, so > for that I probably don't need to modify Parquet at all. > > For doing actual layout conversions and definition of required vs. > optional fields, etc., would you suggest Avro or Thrift as the good media > to do this? > > Something like: > https://github.com/adobe-research/spark-parquet-thrift-example > > > > > > On Tue, Sep 8, 2015 at 10:59 AM, Cheng Lian <[email protected]> wrote: > >> Parquet only provides a limited set of types as building blocks. Although >> we can add more original types (also called converted types in some >> contexts) to represent more application level data types, it's not open to >> extension for end users. >> >> Basically, you need to map your own application data types to and from >> Parquet types and do the conversion at application level. One of the >> example is the user-defined types in Spark SQL. We first map UDTs to basic >> Spark SQL data types, then convert Spark SQL data types to Parquet types >> via a standard schema converter. >> >> Cheng >> >> >> On 9/7/15 10:26 PM, Edmon Begoli wrote: >> >>> Is there, or what is the best learning resource that would help me >>> understand how to canonically map the currently unsupported, nested >>> structured data formats into Parquet. >>> >>> Ideally, I would like to have access to something showing step by step or >>> giving enough background explaining how to do it. >>> >>> If no such thing exist, maybe you can point me out to some basic examples >>> that I could follow to learn the process. >>> >>> I will work to contribute back any tutorials and documentation I produce >>> for my own and my teams use (as well as any produced code). >>> >>> Thank you, >>> Edmon >>> >>> >> >
