asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-658837587
@bvaradar Thanks for quick response Balaji. To understand it correctly let
me quickly run with an example
The data that is generated for a dataset will be in some range of 1MB for
each 500 datasets. I had set the following properties
"hoodie.parquet.small.file.limit": 2*1024*1024,
"hoodie.parquet.max.file.size": 2*1024*1024*1024,
So to understand correctly when the data write happens the size 1 MB less
than the 2 MB small file limit the first parquet written will be 1 MB. The
second write of the data which is another 1 MB should merge to existing
parquet. For the 3 write the data will be 1 MB but the first partition is
already reached the 2 MB so second parquet will be created?
Where does maxsize will be used in this process?
Also this will happen automatically or I need to specify some other
properties to take this into effect apart from the two properties I have
specified.
Also if want to contribute to the development of the clustering feature what
will be the process for it?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]