I don't think it's a good idea to use the metadata summary files or to merge footer information from multiple Parquet files. As you note, it's not possible to merge user-defined metadata, there are significant problems from merging schemas (without needing to), and keeping the summary files up-to-date is manual and often overlooked. The right thing to do is to reconcile the expected metadata with each file's metadata individually when that file is read. That distributes the work and avoids bottlenecks like those addressed in PARQUET-139 [1].
The only reason to merge file metadata is to infer an overall schema from a set of data files, which is not usually necessary because the schema for a table is tracked by the metastore. If you're not storing the table schema, then it's much better to use no schema and return null when columns are missing or require an expected schema from the reader (e.g., from Avro specific or thrift classes). rb [1]: https://issues.apache.org/jira/browse/PARQUET-139 On Thu, Jun 16, 2016 at 2:11 PM, Cheng Lian <[email protected]> wrote: > One problem of Parquet user-defined key/value metadata is that, when > merging footers of multiple Parquet files to generate the summary files, if > two Parquet files have key/value entries with the same key but different > values, Parquet doesn't know how to merge them, and simply throws an > exception and gives up writing the summary file. If you're appending new > data into an existing directory with old summary files, you may end up with > stale summary files since the old ones are not properly overwritten. > > This can be a problem in the case of schema evolution. For example, Spark > SQL writes JSON-ized schema strings to Parquet files as key/value metadata. > When appending new Parquet files into an existing directory containing > existing files with different but compatible schemata, summary files can't > be properly generated. > > But in practice this isn't a big problem since Parquet summary files are > not that important nowadays. > > > Cheng > > > > On 6/16/16 1:19 PM, Wes McKinney wrote: > >> To add a one bit of context, we're looking at the handling of integers >> other than INT32 and INT64 from the perspective of Apache Arrow. It >> seems that in Parquet 1 files, you may not be able to recover the >> original integer types from the file alone. The question is, should we >> put this metadata in the Parquet file? See >> >> https://github.com/apache/arrow/pull/89/files#diff-147a93dad8a2dfdac5531007c5c686b1R67 >> >> If it may cause problems, we can leave the physical storage type as is >> and leave users to explicitly cast on deserialization to another >> integer type. >> >> Thanks, >> Wes >> >> On Thu, Jun 16, 2016 at 12:57 PM, Uwe Korn <[email protected]> wrote: >> >>> Hello, >>> >>> I'm currently looking at the differences between Parquet 1 and Parquet 2 >>> to >>> implement these versions as a switch in parquet-cpp. The only list I >>> could >>> find is the rather undetailed changelog [1]. Is there maybe some better >>> list >>> or do I need to go through the referenced changesets entries myself to >>> find >>> the actual differences? (If the latter is the case, I'd also make a PR >>> afterwards that augments the documentation with some "(since version >>> 2.0)" >>> markings. But I'm hoping a bit that there is some blog post or so out >>> there >>> that could make my life easier. >>> >>> Thanks, >>> >>> Uwe >>> >>> [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md >>> >>> > -- Ryan Blue Software Engineer Netflix
