Re: Parquet schema migrations

Michael Armbrust Sun, 05 Oct 2014 17:37:20 -0700

Hi Cody,

Assuming you are talking about 'safe' changes to the schema (i.e. existing
column names are never reused with incompatible types), this is something
I'd love to support.  Perhaps you can describe more what sorts of changes
you are making, and if simple merging of the schemas would be sufficient.
If so, we can open a JIRA, though I'm not sure when we'll have resources to
dedicate to this.


In the near term, I'd suggest writing converters for each version of the
schema, that translate to some desired master schema.  You can then union
all of these together and avoid the cost of batch conversion.  It seems
like in most cases this should be pretty efficient, at least now that we
have good pushdown past union operators :)

Michael

On Sun, Oct 5, 2014 at 3:58 PM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Cody,
>
> I wasn't aware there were different versions of the parquet format.  What's
> the difference between "raw parquet" and the Hive-written parquet files?
>
> As for your migration question, the approaches I've often seen are
> convert-on-read and convert-all-at-once.  Apache Cassandra for example does
> both -- when upgrading between Cassandra versions that change the on-disk
> sstable format, it will do a convert-on-read as you access the sstables, or
> you can run the upgradesstables command to convert them all at once
> post-upgrade.
>
> Andrew
>
> On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
> > Wondering if anyone has thoughts on a path forward for parquet schema
> > migrations, especially for people (like us) that are using raw parquet
> > files rather than Hive.
> >
> > So far we've gotten away with reading old files, converting, and writing
> to
> > new directories, but that obviously becomes problematic above a certain
> > data size.
> >
>

Re: Parquet schema migrations

Reply via email to