Definitely can't see a benefit to use 30MB row groups over just creating 30MB 
parquet files.

I would add that stats indexes are on the file level, so it's in favor to using 
row groups size=file size.

The only context it would help is when clustering is setup and targets 1GB 
files, w/ 128MB row groups.

Would love to be contradict on this. But so far the fast cow already exists, 
it's consist of reducing the parquet size for faster writes. It comes with 
drawback on read performances, as would be smaller row groups but it benefits 
from stats indexes better.


On July 20, 2023 9:28:07 PM UTC, Nicolas Paris <nicolas.pa...@riseup.net> wrote:
>Spliting parquet file into 5 row groups, leads to same benefit as creating 5 
>parquet files each 1 row group instead.
>
>Also the later can involve more parallelism for writes.
>
>Am I missing something?
>
>On July 20, 2023 12:38:54 PM UTC, sagar sumit <cod...@apache.org> wrote:
>>Good questions! The idea is to be able to skip rowgroups based on index.
>>But, if we have to do a full snapshot load, then our wrapper should actually
>>be doing batch GET on S3. Why incur 5x more calls.
>>As for the update, I think this is in the context of COW. So, the footer
>>will be
>>recomputed anyways, so handling updates should not be that tricky.
>>
>>Regards,
>>Sagar
>>
>>On Thu, Jul 20, 2023 at 3:26 PM nicolas paris <nicolas.pa...@riseup.net>
>>wrote:
>>
>>> Hi,
>>>
>>> Multiple idenpendant initiatives for fast copy on write have emerged
>>> (correct me if I am wrong):
>>> 1.
>>>
>>> https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
>>> 2.
>>> https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
>>>
>>>
>>> The idea is to rely on RLI index to target only some row groups in a
>>> given parquet file, and only serde that one when copying the file
>>>
>>> Currently hudi generates one row group per parquet file (and having
>>> large row group is what parquet and other advocates).
>>>
>>> The FCOW feature then need to use several row group per parquet to
>>> provide some benefit, let's say 30MB as mentionned in the rfc68
>>> discussion.
>>>
>>> I have concerns about using small row groups for read performances such
>>> as :
>>> - more s3 throttle: if we have 5x more row group in a parquet files,
>>> then it leads to 5x GET call
>>> - worst read performances: since largest row group leads to better
>>> performances overall
>>>
>>>
>>> As a side question, I wonder how the writer can keep statistics within
>>> parquet footer correct. If updates occurs somewhere, then the below
>>> stuff present in the footer shall be updated accordingly:
>>> - parquet row group/pages stats
>>> - parquet dictionary
>>> - parquet bloom filters
>>>
>>> Thanks for your feedback on those
>>>

Reply via email to