Good point about needing more documentation on the 2.0 spec. Right now, it's mostly only documented in the code and a table in the README [1]. But even that table is unclear because many of the additions we've made are forward-compatible and can be used in 1.0 files.
For example, the stats that were added are compatible with the 1.0 format because older readers will ignore the stats objects when reading. The "ConvertedType" annotations, like INT_8 or UINT_16 are similar. Thrift will ignore unknown enum values and the field is optional so a UINT_16 looks like an un-annotated INT32 to older readers. The addition or use of new logical type annotations is compatible with 1.0 and implementing read-side support should always be considered compatible with the format (though not necessarily with the API). The only features that aren't 1.0 compatible are those that cause a file to be unreadable by existing 1.0 readers, like the new delta page encodings and the addition of BROTLI to the compression enum [2]. rb [1]: https://github.com/apache/parquet-mr#features [2]: https://github.com/apache/parquet-format/pull/40 On Thu, Jun 16, 2016 at 1:19 PM, Wes McKinney <[email protected]> wrote: > To add a one bit of context, we're looking at the handling of integers > other than INT32 and INT64 from the perspective of Apache Arrow. It > seems that in Parquet 1 files, you may not be able to recover the > original integer types from the file alone. The question is, should we > put this metadata in the Parquet file? See > > https://github.com/apache/arrow/pull/89/files#diff-147a93dad8a2dfdac5531007c5c686b1R67 > > If it may cause problems, we can leave the physical storage type as is > and leave users to explicitly cast on deserialization to another > integer type. > > Thanks, > Wes > > On Thu, Jun 16, 2016 at 12:57 PM, Uwe Korn <[email protected]> wrote: > > Hello, > > > > I'm currently looking at the differences between Parquet 1 and Parquet 2 > to > > implement these versions as a switch in parquet-cpp. The only list I > could > > find is the rather undetailed changelog [1]. Is there maybe some better > list > > or do I need to go through the referenced changesets entries myself to > find > > the actual differences? (If the latter is the case, I'd also make a PR > > afterwards that augments the documentation with some "(since version > 2.0)" > > markings. But I'm hoping a bit that there is some blog post or so out > there > > that could make my life easier. > > > > Thanks, > > > > Uwe > > > > [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md > > > -- Ryan Blue Software Engineer Netflix
