Re: Appending columns to parquet files

Jason Altekruse Wed, 30 Mar 2016 14:02:33 -0700

I feel like this problem might be better solved with software on top of
parquet rather than in the format itself. I assume when you are adding
columns to existing datasets, all of the rows that have already been
written will just have this new column filled with nulls. Is that the
behavior you are trying to achieve? One part of your message makes me think
this might not be what you want, specifically where you said " Append to
the existing file - new columns of data and new metadata". Are you instead
trying to like join a new column of data onto an existing table?

Unfortunately in either case, I believe that you will need to modify all of
the existing blocks to add a column, because the row groups are designed to
fully encapsulate a list of records across all columns.

Question for the parquet devs, is it the case that a column chunk can be
completely excluded if all of the values are null? Or must there at least
be 1 page with RLE definition and repetition levels? If we could just add
metadata without needing to add a column chunk this would be feasible.

Overall, it seems to me like this should be solved by something which could
provide a union of the two schemas and prompt a reader of the file to
materialize nulls into columns that are not found om a particular file. If
you are trying to join a new column of non-null data into an existing
dataset I don't think this is likely to be possible without a full rewrite.

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Wed, Mar 30, 2016 at 12:32 PM, Antonios <[email protected]> wrote:

> There are some challenging tasks of i.e. reading 20 GBytes of parquet files
> from HDFS - generating some data and storing them back
>
> So in effect imagine a 20GB --processing-->20*,01*GB
>
> This takes time (with Spark/Parquet) - and we would all be happy if we
> could achieve that in a matter of a few seconds , i.e. 3-10 seconds. People
> from RDBMS worlds are capable of adding new columns in such timings...
>
>
> This is kind of impossible - unless we think out of the box
>
> The original 20 GB exist in 150 blocks.
>
> Given the parquet-format#file-format
> <https://github.com/Parquet/parquet-format#file-format> we could modify
> only the 1 last block out of the 150 existing ones, and either:
>
>
>    - Recreate it by adding the additional columns + metadata/footer with
>    the new column info / thrift | avro format
>
>
>    - Append to the existing file - new columns of data and new
>    metadata/footer and have parquet readers reading always the latest
> metadata
>    / footer info ( while column pointers just skip the old metadata )
>
> I'd like to discuss around the idea of adding new columns of data in
> parquet files.
>
>
> A capability like that could be used in our advantage in an RDBMS
> environment by directly manipulating Hive Metadata / or and using Avro
> aliases :
>
>   _price_ being an alias to  columnX  then
>   _price_ being an alias to  columnY
>
>
> What does the community think ?
> Antonios
>

Re: Appending columns to parquet files

Reply via email to