karan867 edited a comment on issue #3077:
URL: https://github.com/apache/hudi/issues/3077#issuecomment-866913205


   @n3nash  Thank you for the suggestions. I tried the parameters you 
suggested. 
   
   * Setting hoodie.parquet.small.file.limit to zero did not make much 
difference and the subsequent commits turned slower. 
   
![image](https://user-images.githubusercontent.com/85880633/123118196-0b82c300-d460-11eb-8a40-b279e704cc98.png)
   
   
   * Setting the index to SIMPLE decreased the time from 12 mins to 9.5 mins. I 
did not try it out before because in our use case we have hourly batch jobs 
that will mostly impact the latest partitions and this blog 
[https://hudi.apache.org/blog/hudi-indexing-mechanisms/](https://hudi.apache.org/blog/hudi-indexing-mechanisms/
 ) says that simple index works best if the updates are random. Also is the 
time complexity of the simple index order of the rows present in the partition. 
Just want to make sure it does not increase with commits or partitions.
   
![image](https://user-images.githubusercontent.com/85880633/123118222-1178a400-d460-11eb-8ffa-02b780843745.png)
   
   
   I had a few more questions regarding hudi write  
   * Can you explain what is happening in the steps taking the most time? 
      * Load latest base files from partitions 
      * Building workload profile 
      * Getting small files from partitions step1 
      * Getting small files from partitions step2  
    
   * Are there some benchmarks of write latencies I can compare to? For 
example, the time taken to write 100k row of size 1 KB. Some rough estimates 
from your experience would also do. 
   
   *  Can we somehow insert the data with duplicates and support updates and 
deletion?  The primary feature for which we are using Hudi is to make our data 
lake GDPR compliant.  
   
   * Is the MOR metadata table created by setting hoodie.metadata.enable' as 
true used when writing or just when reading the data? 
   
   * Randomly in some commits the write takes very less time. Do you have some 
explanation for that? (It is not the 1st commit)
   
![image](https://user-images.githubusercontent.com/85880633/123119753-6e288e80-d461-11eb-83f6-1348ce752eb1.png)
   
     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to