You can store arbitrary key values alongside the schema in the footer:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565>
struct FileMetaData {
/** Version of this file **/
1: required i32 version
/** Parquet schema for this file. This schema contains metadata for all the
columns.
* The schema is represented as a tree with a single root. The nodes of the
tree
* are flattened to a list by doing a depth-first traversal.
* The column metadata contains the path in the schema for that column which
can be
* used to map columns to nodes in the schema.
* The first element is the root **/
2: required list<SchemaElement> schema;
/** Number of rows in this file **/
3: required i64 num_rows
/** Row groups in this file **/
4: required list<RowGroup> row_groups
/** Optional key/value metadata **/
5: optional list<KeyValue> key_value_metadata
/** String for application that wrote this file. This should be in the format
* <Application> version <App Version> (build <App Build Hash>).
* e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
**/
6: optional string created_by
}
You could make the key something like "{some unique name prefix specific to
you}.PII.columns”=a.b.c,d.e.f
> On Jun 30, 2016, at 10:44 AM, Mohammad Islam <[email protected]>
> wrote:
>
> Hi All,
> What is the best way of tagging any field schema with metadata? Does Parquet
> support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
> I need to tag each field whether it is PII or not. I think someone may want
> to add description of a field as well.
> Regards,Mohammad
>