KarthickAN commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711645166


   @nsivabalan @vinothchandar Thank you so much for all the explanations. If I 
think about it, having 10MB worth of index data may not be an issue as long as 
the file contains considerable amount of records. In my case there was a 
scenario where I had only 1000 records but with 10MB for index. So I switched 
to dynamic bloom now which is really helpful in this case. 
   
   We are dealing with two different types of data out of which one doesn't 
have much volume. That's where it threw it off where as for the other type 
where we do have good volume of data this didn't come out as an issue as we'd 
already have around 110-120MB worth of data plus index. As of now I've 
configure it like below
   
   IndexBloomNumEntries = 35000
   BloomIndexFilterType = DYNAMIC_V0
   BloomIndexFilterDynamicMaxEntries = 1400000
   
   starting off with 35k (1%  of max no of entries in a file) as a base and 
scaling it out till 1.4M(40% of max no of entries in a file) entries as the 
file grows. So that should solve the problem possibly. Anyways we need to test 
this out for the volume we are seeing right now and tune it further if required.
   
   @vinothchandar Yes. Having a blog around this will definitely be very 
helpful. I felt hudi has a lot of features that can be used efficiently with 
some more in depth explanations than what we have right now as part of the 
documentation. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to