The materials I found the most useful are those in parquet-format:
- https://github.com/apache/parquet-format/blob/master/README.md
- https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
- https://github.com/apache/parquet-format/blob/master/Encodings.md
Cheng
On 9/8/15 11:43 PM, Edmon Begoli wrote:
and really - I am asking about the material, so I can differentiate between
what is supported in Dremel vs. what is new or different in Parquet.
Lots of presentations I've seen talk about Dremel approach in the Parquet
context, and they use the same document example.
If Parquet and Dremel (as published in 2010 VLDB) are in synch when it
comes to the the compression and representation then that is sufficient.
If they are not, I would like to know where can I find any material that
outlines that (presentations, readmes, source code, etc.)
I ask this because I am thinking of proposing a research effort for
representing research data formats, and I would like to understand that
state-pf-the-art vs. the modifications that would have to be performed as
part of the research).
I hope this makes sense.
Thank you,
Edmon
On Tue, Sep 8, 2015 at 11:08 AM, Edmon Begoli <[email protected]> wrote:
Understood.
I would not be defining new types, but new standard nested structures, so
for that I probably don't need to modify Parquet at all.
For doing actual layout conversions and definition of required vs.
optional fields, etc., would you suggest Avro or Thrift as the good media
to do this?
Something like:
https://github.com/adobe-research/spark-parquet-thrift-example
On Tue, Sep 8, 2015 at 10:59 AM, Cheng Lian <[email protected]> wrote:
Parquet only provides a limited set of types as building blocks. Although
we can add more original types (also called converted types in some
contexts) to represent more application level data types, it's not open to
extension for end users.
Basically, you need to map your own application data types to and from
Parquet types and do the conversion at application level. One of the
example is the user-defined types in Spark SQL. We first map UDTs to basic
Spark SQL data types, then convert Spark SQL data types to Parquet types
via a standard schema converter.
Cheng
On 9/7/15 10:26 PM, Edmon Begoli wrote:
Is there, or what is the best learning resource that would help me
understand how to canonically map the currently unsupported, nested
structured data formats into Parquet.
Ideally, I would like to have access to something showing step by step or
giving enough background explaining how to do it.
If no such thing exist, maybe you can point me out to some basic examples
that I could follow to learn the process.
I will work to contribute back any tutorials and documentation I produce
for my own and my teams use (as well as any produced code).
Thank you,
Edmon