Re: Schema evolution

Ryan Blue Tue, 19 Jan 2016 15:31:19 -0800

Hi Lloyd,

For both Parquet and Avro, a file's schema is set when you write it andcan't change. Avro supports re-opening and appending records to datafiles, but Parquet doesn't because its metadata is stored in the filefooter. Appending to Parquet isn't really what the format is intendedfor (I can provide more context if you're interested in why).

Schema evolution was designed around the idea of writing multiple filesover time. As your schema changes, newer files have schemas that havebeen updated but are still compatible with the existing data. That way,files don't have to be changed or rewritten. I've not seen a problem inthe past with this, so I'm curious about your use case. Why are youtrying to build your application using a single file instead of adirectory (or directory structure) of data files? Maybe if we understoodmore about what you're trying to build, we could help.


Thanks,

rb

On 01/18/2016 10:42 PM, Lloyd Haris wrote:

Hi,

Apologies if this has been asked before and I hope this is the correct
mailing list to ask this question too.

I've been trying to write a Parquet file using Avro as per Hadoop
Definitive Guide book and it's working okay. I have written my application
in Java and the file is saved on HDFS.

What I really want to do is play and learn how schema evolution works and I
am evaluating whether we can do the following with Avro and Parquet.

I want to have a single Parquet file and first write a bunch of records to
it. Then when I receive more data, I hope to append those records to the
same file. First, I don't know if this is possible.

Second thing is that we know our schema will evolve. For example, we might
add new fields to the schema and I am wondering whether it's possible to
add new records with the new schema onto the same file which was originally
written with old schema. What we basically want is to keep "the file" as a
database.

Can somebody please tell me if this is doable and if so could you also give
me some code samples because I couldn't find any example codes where it
appends new records to an existing parquet file using Avro as well as any
examples of how to change the schema and write new records based on new
schema to that file.

Thanks
Lloyd



--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Schema evolution

Reply via email to