I feel like this problem might be better solved with software on top of parquet rather than in the format itself. I assume when you are adding columns to existing datasets, all of the rows that have already been written will just have this new column filled with nulls. Is that the behavior you are trying to achieve? One part of your message makes me think this might not be what you want, specifically where you said " Append to the existing file - new columns of data and new metadata". Are you instead trying to like join a new column of data onto an existing table?
Unfortunately in either case, I believe that you will need to modify all of the existing blocks to add a column, because the row groups are designed to fully encapsulate a list of records across all columns. Question for the parquet devs, is it the case that a column chunk can be completely excluded if all of the values are null? Or must there at least be 1 page with RLE definition and repetition levels? If we could just add metadata without needing to add a column chunk this would be feasible. Overall, it seems to me like this should be solved by something which could provide a union of the two schemas and prompt a reader of the file to materialize nulls into columns that are not found om a particular file. If you are trying to join a new column of non-null data into an existing dataset I don't think this is likely to be possible without a full rewrite. Jason Altekruse Software Engineer at Dremio Apache Drill Committer On Wed, Mar 30, 2016 at 12:32 PM, Antonios <[email protected]> wrote: > There are some challenging tasks of i.e. reading 20 GBytes of parquet files > from HDFS - generating some data and storing them back > > So in effect imagine a 20GB --processing-->20*,01*GB > > This takes time (with Spark/Parquet) - and we would all be happy if we > could achieve that in a matter of a few seconds , i.e. 3-10 seconds. People > from RDBMS worlds are capable of adding new columns in such timings... > > > This is kind of impossible - unless we think out of the box > > The original 20 GB exist in 150 blocks. > > Given the parquet-format#file-format > <https://github.com/Parquet/parquet-format#file-format> we could modify > only the 1 last block out of the 150 existing ones, and either: > > > - Recreate it by adding the additional columns + metadata/footer with > the new column info / thrift | avro format > > > - Append to the existing file - new columns of data and new > metadata/footer and have parquet readers reading always the latest > metadata > / footer info ( while column pointers just skip the old metadata ) > > I'd like to discuss around the idea of adding new columns of data in > parquet files. > > > A capability like that could be used in our advantage in an RDBMS > environment by directly manipulating Hive Metadata / or and using Avro > aliases : > > _price_ being an alias to columnX then > _price_ being an alias to columnY > > > What does the community think ? > Antonios >
