Spliting parquet file into 5 row groups, leads to same benefit as creating 5 parquet files each 1 row group instead.
Also the later can involve more parallelism for writes. Am I missing something? On July 20, 2023 12:38:54 PM UTC, sagar sumit <cod...@apache.org> wrote: >Good questions! The idea is to be able to skip rowgroups based on index. >But, if we have to do a full snapshot load, then our wrapper should actually >be doing batch GET on S3. Why incur 5x more calls. >As for the update, I think this is in the context of COW. So, the footer >will be >recomputed anyways, so handling updates should not be that tricky. > >Regards, >Sagar > >On Thu, Jul 20, 2023 at 3:26 PM nicolas paris <nicolas.pa...@riseup.net> >wrote: > >> Hi, >> >> Multiple idenpendant initiatives for fast copy on write have emerged >> (correct me if I am wrong): >> 1. >> >> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md >> 2. >> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/ >> >> >> The idea is to rely on RLI index to target only some row groups in a >> given parquet file, and only serde that one when copying the file >> >> Currently hudi generates one row group per parquet file (and having >> large row group is what parquet and other advocates). >> >> The FCOW feature then need to use several row group per parquet to >> provide some benefit, let's say 30MB as mentionned in the rfc68 >> discussion. >> >> I have concerns about using small row groups for read performances such >> as : >> - more s3 throttle: if we have 5x more row group in a parquet files, >> then it leads to 5x GET call >> - worst read performances: since largest row group leads to better >> performances overall >> >> >> As a side question, I wonder how the writer can keep statistics within >> parquet footer correct. If updates occurs somewhere, then the below >> stuff present in the footer shall be updated accordingly: >> - parquet row group/pages stats >> - parquet dictionary >> - parquet bloom filters >> >> Thanks for your feedback on those >>