There are some challenging tasks of i.e. reading 20 GBytes of parquet files
from HDFS - generating some data and storing them back

So in effect imagine a 20GB --processing-->20*,01*GB

This takes time (with Spark/Parquet) - and we would all be happy if we
could achieve that in a matter of a few seconds , i.e. 3-10 seconds. People
from RDBMS worlds are capable of adding new columns in such timings...


This is kind of impossible - unless we think out of the box

The original 20 GB exist in 150 blocks.

Given the parquet-format#file-format
<https://github.com/Parquet/parquet-format#file-format> we could modify
only the 1 last block out of the 150 existing ones, and either:


   - Recreate it by adding the additional columns + metadata/footer with
   the new column info / thrift | avro format


   - Append to the existing file - new columns of data and new
   metadata/footer and have parquet readers reading always the latest metadata
   / footer info ( while column pointers just skip the old metadata )

I'd like to discuss around the idea of adding new columns of data in
parquet files.


A capability like that could be used in our advantage in an RDBMS
environment by directly manipulating Hive Metadata / or and using Avro
aliases :

  _price_ being an alias to  columnX  then
  _price_ being an alias to  columnY


What does the community think ?
Antonios

Reply via email to