Hi,

Multiple idenpendant initiatives for fast copy on write have emerged
(correct me if I am wrong):
1.
https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
2.
https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/


The idea is to rely on RLI index to target only some row groups in a
given parquet file, and only serde that one when copying the file

Currently hudi generates one row group per parquet file (and having
large row group is what parquet and other advocates). 

The FCOW feature then need to use several row group per parquet to
provide some benefit, let's say 30MB as mentionned in the rfc68
discussion.

I have concerns about using small row groups for read performances such
as :
- more s3 throttle: if we have 5x more row group in a parquet files,
then it leads to 5x GET call
- worst read performances: since largest row group leads to better
performances overall


As a side question, I wonder how the writer can keep statistics within
parquet footer correct. If updates occurs somewhere, then the below
stuff present in the footer shall be updated accordingly:
- parquet row group/pages stats
- parquet dictionary
- parquet bloom filters

Thanks for your feedback on those

Reply via email to