On 8/11/15, 2:48 AM, "Damian Guy" <[email protected]> wrote:
>While it is possible for downstream users of parquet to handle the >incompatibilities between the various different input formats it seems a >shame that parquet doesn't produce a consistent schema across all formats. >This would make working with parquet much simpler as the rules for >converting to/from a list etc would always be the same. It would make for >simpler, less error prone, code. Besides, i thought this was one of the >reasons for using parquet... After struggling with similar fixes for Pig and Hive last year, I tried a different approach: make parquet-column handle all of the type coercions and compatibility rules, instead of placing this burden on the parquet-<encoding> libraries and various applications. It already does a bit of this to handle different repeated levels (reading optional fields as a list, etc) and primitive widening (e.g. reading an int as long). Similarly, a list of non-nullable integers should be able to be read as a list of nullable integers, and so this should also be supported as a first class concept in parquet-column. It’s an old and incomplete patch but it demonstrates that incorporating such a change into Pig added about 4 new lines of code and deleted 100. The Hive fix was similar. Entire patch: https://github.com/NathanHowell/incubator-parquet-mr/commit/3e74b52a7cdade3d38b5829e47fd55315bbb02f9 Pig modification: https://github.com/NathanHowell/incubator-parquet-mr/commit/3e74b52a7cdade3d38b5829e47fd55315bbb02f9#diff-357f986dd06a50315198f8fd08bb81b6R139 -n
