[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

GitBox Sun, 22 Nov 2020 19:31:03 -0800


asharma4-lucid commented on issue #2269:
URL: https://github.com/apache/hudi/issues/2269#issuecomment-731908622



   Thanks @bvaradar for your response. I have a few more questions:
   
   1) The reason we have kept the partition key that we are using is, because 
we wanted to gain O(1) read performance for the same. It is my understanding 
that this many number of partitions puts memory pressure on the executors as 
each executor creates as many writers as the partitions. (I am assuming in HDFS 
namenode would also be impacted, but since we are using S3, I am discounting 
that, but do let me know if I am mistaken). It is here that I wanted to confirm 
my understanding. Every day our process will update around ~12K partitions + 
insert ~33 K new partitions. So, my question is will the executors doing the 
hudi table write create ~44K writers contributing to the memory pressure. Or 
will the already existing partitions, i.e. ~300K also be touched in some way by 
the hudi table write executors leading to performance degradation as we 
continue to add more table to the hudi table?
   
   2) Just to confirm my understanding, when you mentioned s3 listing as the 
bottleneck, you meant that the s3 listing of all the partitions and files for 
the hudi table and not just the partitions updated and/or inserted for that 
specific process. So, in my case, that would imply that the Hudi table write 
process is doing an s3 listing of already existing ~300 K partitions and 
associated files and not just the ~44K partitions for the specific execution. 
And this is probably in line with what we have observed as well. Because for 
the intial 15 day processes, each hudi table write completed in around 4 hrs 
and then from the 16th day onwards, it gradually started increasing from 4 to 5 
to 6 and now to almost 9 hrs per day as we move ahead. Can you please confirm?
   
   3) If s3 listing requirement is made optional in hudi 0.7.0, then can we 
continue to use the partition key that we are using assuming that every day our 
process will add/update ~44K partitions in the hudi table? I understand that is 
not the best partition key as it has very high cardinality, but our read 
requirement is what is driving us towards this. I guess this might be related 
to question 1 above, but my question is, is there any other downside as well 
that you could glean from our use of this partition key apart from the s3 
listing dependency?
   
   4) We are trying to see if spark bucketing on the key would be a good middle 
ground between partition on the key and not using partitioning. Does hudi table 
write support bucketed writes and consequently, are the hudi table reads able 
to use the buckets for optimal read performance? Something like, O(1) hash + 
O(log m) binary search where m is the number of records in each bucketed file.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] asharma4-lucid commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

Reply via email to