I think generally the best solution, if it's supported by the tools you're using, is to do schema evolution by *not* rewriting the files and just updating the metadata, and rely on the engine that's querying the table to promote the int32 to int64 if the parquet file has an int32 but the hive schema has an int64.
E.g. the support has been added in Impala and Hive: https://issues.apache.org/jira/browse/HIVE-12080, https://issues.apache.org/jira/browse/IMPALA-6373. I'm not sure about other engines. Generally Parquet is not designed to support modifying files in-place - if you want to change a file's schema, you would regenerate the file. On Tue, Jul 16, 2019 at 8:38 PM Ronnie Huang <[email protected]> wrote: > Hi Parquet Devs, > > Our team is working on userid changing from int to bigint in whole hadoop > system. It's easy for us to quick refresh non-partitioned tables, however, > more partitioned tables have huge partition files. We are trying to find a > quick solution to change data type fast without refreshing partition one by > one. That's why I send you this email. > > I take a look at your website https://github.com/apache/parquet-format to > understand parquet format but I still confused on metadata, so l list > following questions: > > 1. If I want to change one column type, I need to change it in file > metadata and column (chunk) metadata, am I right or missing anything? > 2. If I change one column type from int32 to int64 in file metadata and > column (chunk) metadata directly, can compressed data be read correctly? If > not, what's problem? > > Thank you so much for your time and we would be appreciated if you could > reply. > > Best Regards, > Ronnie > > >
