Hi Lloyd,
For both Parquet and Avro, a file's schema is set when you write it and
can't change. Avro supports re-opening and appending records to data
files, but Parquet doesn't because its metadata is stored in the file
footer. Appending to Parquet isn't really what the format is intended
for (I can provide more context if you're interested in why).
Schema evolution was designed around the idea of writing multiple files
over time. As your schema changes, newer files have schemas that have
been updated but are still compatible with the existing data. That way,
files don't have to be changed or rewritten. I've not seen a problem in
the past with this, so I'm curious about your use case. Why are you
trying to build your application using a single file instead of a
directory (or directory structure) of data files? Maybe if we understood
more about what you're trying to build, we could help.
Thanks,
rb
On 01/18/2016 10:42 PM, Lloyd Haris wrote:
Hi,
Apologies if this has been asked before and I hope this is the correct
mailing list to ask this question too.
I've been trying to write a Parquet file using Avro as per Hadoop
Definitive Guide book and it's working okay. I have written my application
in Java and the file is saved on HDFS.
What I really want to do is play and learn how schema evolution works and I
am evaluating whether we can do the following with Avro and Parquet.
I want to have a single Parquet file and first write a bunch of records to
it. Then when I receive more data, I hope to append those records to the
same file. First, I don't know if this is possible.
Second thing is that we know our schema will evolve. For example, we might
add new fields to the schema and I am wondering whether it's possible to
add new records with the new schema onto the same file which was originally
written with old schema. What we basically want is to keep "the file" as a
database.
Can somebody please tell me if this is doable and if so could you also give
me some code samples because I couldn't find any example codes where it
appends new records to an existing parquet file using Avro as well as any
examples of how to change the schema and write new records based on new
schema to that file.
Thanks
Lloyd
--
Ryan Blue
Software Engineer
Cloudera, Inc.