Hello, We have a use case where we wish to store a JavaRDD<Map<String,String>> into Parquet. This JavaRDD<Map<String,String>> is produced by a map-reduce job. The problem is that the keys in Map<String,String> are not known beforehand so, it is not possible to define the schema for the data upfront. I know that we can define the schema in the program but it is expensive since the following steps need to taken.
a) JavaRDD<Map<String,String>> must be mapped to JavaPairRDD<StructType,Map<String,String>> and persisted. b) For each unique StructType generated in step a), the corresponding Map<String,String> must be read (from persistence storage) and pushed in Parquet format. Is it possible for Parquet to keep a mutable Schema and update it based on each data record. Finally, when its time to write the metadata to storage it writes the final updated schema. Basically, I want parquet to infer the schema based on each data record rather than being provided with one upfront. Thank you.
