Storing Map in Parquet

Asifali Dauva Tue, 20 Jun 2017 11:51:11 -0700

Hello,

We have a use case where we wish to store a JavaRDD<Map<String,String>>
into Parquet. This JavaRDD<Map<String,String>> is produced by a map-reduce
job. The problem is that the keys in Map<String,String> are not known
beforehand so, it is not possible to define the schema for the data
upfront. I know that we can define the schema in the program but it is
expensive since the following steps need to taken.


a) JavaRDD<Map<String,String>> must be mapped to
JavaPairRDD<StructType,Map<String,String>> and persisted.

b) For each unique StructType generated in step a), the corresponding
Map<String,String> must be read (from persistence storage) and pushed in
Parquet format.

Is it possible for Parquet to keep a mutable Schema and update it based on
each data record. Finally, when its time to write the metadata to storage
it writes the final updated schema. Basically, I want parquet to infer the
schema based on each data record rather than being provided with one
upfront.

Thank you.

Storing Map in Parquet

Reply via email to