Re: Make the parquet schema consistent across all input formats

Nathan Howell Tue, 11 Aug 2015 08:03:42 -0700

On 8/11/15, 2:48 AM, "Damian Guy" <[email protected]> wrote:


>While it is possible for downstream users of parquet to handle the
>incompatibilities between the various different input formats it seems a
>shame that parquet doesn't produce a consistent schema across all formats.
>This would make working with parquet much simpler as the rules for
>converting to/from a list etc would always be the same. It would make for
>simpler, less error prone, code. Besides, i thought this was one of the
>reasons for using parquet...

After struggling with similar fixes for Pig and Hive last year, I tried a 
different approach: make parquet-column handle all of the type coercions and 
compatibility rules, instead of placing this burden on the parquet-<encoding> 
libraries and various applications. It already does a bit of this to handle 
different repeated levels (reading optional fields as a list, etc) and 
primitive widening (e.g. reading an int as long). Similarly, a list of 
non-nullable integers should be able to be read as a list of nullable integers, 
and so this should also be supported as a first class concept in parquet-column.

It’s an old and incomplete patch but it demonstrates that incorporating such a 
change into Pig added about 4 new lines of code and deleted 100. The Hive fix 
was similar.

Entire patch: 
https://github.com/NathanHowell/incubator-parquet-mr/commit/3e74b52a7cdade3d38b5829e47fd55315bbb02f9

Pig modification: 
https://github.com/NathanHowell/incubator-parquet-mr/commit/3e74b52a7cdade3d38b5829e47fd55315bbb02f9#diff-357f986dd06a50315198f8fd08bb81b6R139


-n

Re: Make the parquet schema consistent across all input formats

Reply via email to