Re: Discuss fast copy on write rfc-68

2023-07-20 Thread Nicolas Paris
Spliting parquet file into 5 row groups, leads to same benefit as creating 5 
parquet files each 1 row group instead.

Also the later can involve more parallelism for writes.

Am I missing something?

On July 20, 2023 12:38:54 PM UTC, sagar sumit  wrote:
>Good questions! The idea is to be able to skip rowgroups based on index.
>But, if we have to do a full snapshot load, then our wrapper should actually
>be doing batch GET on S3. Why incur 5x more calls.
>As for the update, I think this is in the context of COW. So, the footer
>will be
>recomputed anyways, so handling updates should not be that tricky.
>
>Regards,
>Sagar
>
>On Thu, Jul 20, 2023 at 3:26 PM nicolas paris 
>wrote:
>
>> Hi,
>>
>> Multiple idenpendant initiatives for fast copy on write have emerged
>> (correct me if I am wrong):
>> 1.
>>
>> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
>> 2.
>> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>>
>>
>> The idea is to rely on RLI index to target only some row groups in a
>> given parquet file, and only serde that one when copying the file
>>
>> Currently hudi generates one row group per parquet file (and having
>> large row group is what parquet and other advocates).
>>
>> The FCOW feature then need to use several row group per parquet to
>> provide some benefit, let's say 30MB as mentionned in the rfc68
>> discussion.
>>
>> I have concerns about using small row groups for read performances such
>> as :
>> - more s3 throttle: if we have 5x more row group in a parquet files,
>> then it leads to 5x GET call
>> - worst read performances: since largest row group leads to better
>> performances overall
>>
>>
>> As a side question, I wonder how the writer can keep statistics within
>> parquet footer correct. If updates occurs somewhere, then the below
>> stuff present in the footer shall be updated accordingly:
>> - parquet row group/pages stats
>> - parquet dictionary
>> - parquet bloom filters
>>
>> Thanks for your feedback on those
>>


Re: Discuss fast copy on write rfc-68

2023-07-20 Thread sagar sumit
Good questions! The idea is to be able to skip rowgroups based on index.
But, if we have to do a full snapshot load, then our wrapper should actually
be doing batch GET on S3. Why incur 5x more calls.
As for the update, I think this is in the context of COW. So, the footer
will be
recomputed anyways, so handling updates should not be that tricky.

Regards,
Sagar

On Thu, Jul 20, 2023 at 3:26 PM nicolas paris 
wrote:

> Hi,
>
> Multiple idenpendant initiatives for fast copy on write have emerged
> (correct me if I am wrong):
> 1.
>
> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
> 2.
> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>
>
> The idea is to rely on RLI index to target only some row groups in a
> given parquet file, and only serde that one when copying the file
>
> Currently hudi generates one row group per parquet file (and having
> large row group is what parquet and other advocates).
>
> The FCOW feature then need to use several row group per parquet to
> provide some benefit, let's say 30MB as mentionned in the rfc68
> discussion.
>
> I have concerns about using small row groups for read performances such
> as :
> - more s3 throttle: if we have 5x more row group in a parquet files,
> then it leads to 5x GET call
> - worst read performances: since largest row group leads to better
> performances overall
>
>
> As a side question, I wonder how the writer can keep statistics within
> parquet footer correct. If updates occurs somewhere, then the below
> stuff present in the footer shall be updated accordingly:
> - parquet row group/pages stats
> - parquet dictionary
> - parquet bloom filters
>
> Thanks for your feedback on those
>


Discuss fast copy on write rfc-68

2023-07-20 Thread nicolas paris
Hi,

Multiple idenpendant initiatives for fast copy on write have emerged
(correct me if I am wrong):
1.
https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
2.
https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/


The idea is to rely on RLI index to target only some row groups in a
given parquet file, and only serde that one when copying the file

Currently hudi generates one row group per parquet file (and having
large row group is what parquet and other advocates). 

The FCOW feature then need to use several row group per parquet to
provide some benefit, let's say 30MB as mentionned in the rfc68
discussion.

I have concerns about using small row groups for read performances such
as :
- more s3 throttle: if we have 5x more row group in a parquet files,
then it leads to 5x GET call
- worst read performances: since largest row group leads to better
performances overall


As a side question, I wonder how the writer can keep statistics within
parquet footer correct. If updates occurs somewhere, then the below
stuff present in the footer shall be updated accordingly:
- parquet row group/pages stats
- parquet dictionary
- parquet bloom filters

Thanks for your feedback on those