There are some challenging tasks of i.e. reading 20 GBytes of parquet files from HDFS - generating some data and storing them back
So in effect imagine a 20GB --processing-->20*,01*GB This takes time (with Spark/Parquet) - and we would all be happy if we could achieve that in a matter of a few seconds , i.e. 3-10 seconds. People from RDBMS worlds are capable of adding new columns in such timings... This is kind of impossible - unless we think out of the box The original 20 GB exist in 150 blocks. Given the parquet-format#file-format <https://github.com/Parquet/parquet-format#file-format> we could modify only the 1 last block out of the 150 existing ones, and either: - Recreate it by adding the additional columns + metadata/footer with the new column info / thrift | avro format - Append to the existing file - new columns of data and new metadata/footer and have parquet readers reading always the latest metadata / footer info ( while column pointers just skip the old metadata ) I'd like to discuss around the idea of adding new columns of data in parquet files. A capability like that could be used in our advantage in an RDBMS environment by directly manipulating Hive Metadata / or and using Avro aliases : _price_ being an alias to columnX then _price_ being an alias to columnY What does the community think ? Antonios
