Antonios,

Sounds like there are a few situations you're trying to address. For
aliasing columns without changing the underlying data, you should be able
to use the current Avro support with a different Avro read schema. That
will read the written column name and use the new one when constructing
records. This also works for adding new columns with a default value for
files that are already written (parquet-avro will fill in the default if it
is missing).

For adding a new data column to a file, there are a possible ways to do it
quickly. First, the ParquetFileWriter has a way to add already encoded row
groups, which you could modify to allow you to add a new column at the same
time. The other option is to write a file with just one column and point
the metadata for the other columns at the original file. That's supported
in the parquet-format spec, but probably not implemented. Both of these
options are risky because you have to align a new column of data and
mistakes will cause records to be mixed together.

The last case it sounds like you're interested in is appending to a Parquet
file, which I don't recommend. You can't add to existing row groups, so
you'd only be able to add row groups at the end (not more columns) and
that's not worth it when you can just write a new file.

I hope that helps,

rb

On Wed, Mar 30, 2016 at 12:32 PM, Antonios <[email protected]> wrote:

> There are some challenging tasks of i.e. reading 20 GBytes of parquet files
> from HDFS - generating some data and storing them back
>
> So in effect imagine a 20GB --processing-->20*,01*GB
>
> This takes time (with Spark/Parquet) - and we would all be happy if we
> could achieve that in a matter of a few seconds , i.e. 3-10 seconds. People
> from RDBMS worlds are capable of adding new columns in such timings...
>
>
> This is kind of impossible - unless we think out of the box
>
> The original 20 GB exist in 150 blocks.
>
> Given the parquet-format#file-format
> <https://github.com/Parquet/parquet-format#file-format> we could modify
> only the 1 last block out of the 150 existing ones, and either:
>
>
>    - Recreate it by adding the additional columns + metadata/footer with
>    the new column info / thrift | avro format
>
>
>    - Append to the existing file - new columns of data and new
>    metadata/footer and have parquet readers reading always the latest
> metadata
>    / footer info ( while column pointers just skip the old metadata )
>
> I'd like to discuss around the idea of adding new columns of data in
> parquet files.
>
>
> A capability like that could be used in our advantage in an RDBMS
> environment by directly manipulating Hive Metadata / or and using Avro
> aliases :
>
>   _price_ being an alias to  columnX  then
>   _price_ being an alias to  columnY
>
>
> What does the community think ?
> Antonios
>



-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to