Good point about needing more documentation on the 2.0 spec. Right now,
it's mostly only documented in the code and a table in the README [1]. But
even that table is unclear because many of the additions we've made are
forward-compatible and can be used in 1.0 files.

For example, the stats that were added are compatible with the 1.0 format
because older readers will ignore the stats objects when reading. The
"ConvertedType" annotations, like INT_8 or UINT_16 are similar. Thrift will
ignore unknown enum values and the field is optional so a UINT_16 looks
like an un-annotated INT32 to older readers. The addition or use of new
logical type annotations is compatible with 1.0 and implementing read-side
support should always be considered compatible with the format (though not
necessarily with the API).

The only features that aren't 1.0 compatible are those that cause a file to
be unreadable by existing 1.0 readers, like the new delta page encodings
and the addition of BROTLI to the compression enum [2].

rb


[1]: https://github.com/apache/parquet-mr#features
[2]: https://github.com/apache/parquet-format/pull/40

On Thu, Jun 16, 2016 at 1:19 PM, Wes McKinney <[email protected]> wrote:

> To add a one bit of context, we're looking at the handling of integers
> other than INT32 and INT64 from the perspective of Apache Arrow. It
> seems that in Parquet 1 files, you may not be able to recover the
> original integer types from the file alone. The question is, should we
> put this metadata in the Parquet file? See
>
> https://github.com/apache/arrow/pull/89/files#diff-147a93dad8a2dfdac5531007c5c686b1R67
>
> If it may cause problems, we can leave the physical storage type as is
> and leave users to explicitly cast on deserialization to another
> integer type.
>
> Thanks,
> Wes
>
> On Thu, Jun 16, 2016 at 12:57 PM, Uwe Korn <[email protected]> wrote:
> > Hello,
> >
> > I'm currently looking at the differences between Parquet 1 and Parquet 2
> to
> > implement these versions as a switch in parquet-cpp. The only list I
> could
> > find is the rather undetailed changelog [1]. Is there maybe some better
> list
> > or do I need to go through the referenced changesets entries myself to
> find
> > the actual differences? (If the latter is the case, I'd also make a PR
> > afterwards that augments the documentation with some "(since version
> 2.0)"
> > markings. But I'm hoping a bit that there is some blog post or so out
> there
> > that could make my life easier.
> >
> > Thanks,
> >
> > Uwe
> >
> > [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to