Spliting parquet file into 5 row groups, leads to same benefit as creating 5 
parquet files each 1 row group instead.

Also the later can involve more parallelism for writes.

Am I missing something?

On July 20, 2023 12:38:54 PM UTC, sagar sumit <cod...@apache.org> wrote:
>Good questions! The idea is to be able to skip rowgroups based on index.
>But, if we have to do a full snapshot load, then our wrapper should actually
>be doing batch GET on S3. Why incur 5x more calls.
>As for the update, I think this is in the context of COW. So, the footer
>will be
>recomputed anyways, so handling updates should not be that tricky.
>
>Regards,
>Sagar
>
>On Thu, Jul 20, 2023 at 3:26 PM nicolas paris <nicolas.pa...@riseup.net>
>wrote:
>
>> Hi,
>>
>> Multiple idenpendant initiatives for fast copy on write have emerged
>> (correct me if I am wrong):
>> 1.
>>
>> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
>> 2.
>> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>>
>>
>> The idea is to rely on RLI index to target only some row groups in a
>> given parquet file, and only serde that one when copying the file
>>
>> Currently hudi generates one row group per parquet file (and having
>> large row group is what parquet and other advocates).
>>
>> The FCOW feature then need to use several row group per parquet to
>> provide some benefit, let's say 30MB as mentionned in the rfc68
>> discussion.
>>
>> I have concerns about using small row groups for read performances such
>> as :
>> - more s3 throttle: if we have 5x more row group in a parquet files,
>> then it leads to 5x GET call
>> - worst read performances: since largest row group leads to better
>> performances overall
>>
>>
>> As a side question, I wonder how the writer can keep statistics within
>> parquet footer correct. If updates occurs somewhere, then the below
>> stuff present in the footer shall be updated accordingly:
>> - parquet row group/pages stats
>> - parquet dictionary
>> - parquet bloom filters
>>
>> Thanks for your feedback on those
>>

Reply via email to