sam-wmt opened a new issue #1782:
URL: https://github.com/apache/hudi/issues/1782


   **Describe the problem you faced**
   Currently we are streaming data, upserting into a Merge-On-Read table.  The 
total table will contain 350M entities bounded, and we expect the approximate 
table size to be 10+ TB.  For the first 10M records the batches were completing 
quickly however we started to see the batch time slowly growing by about 1 
minute per 5-10 batches.  We have enable inline compaction for testing and set 
INLINE_COMPACT_NUM_DELTA_COMMITS_PROP and INLINE_COMPACT_PROP to 12 and are 
currently running 15 minute batches in Spark.
   
   
   **Expected behavior**
   With the Merge-On-Read semantics I would expect most writes to be quick and 
the compaction to take longer and longer. however I see no difference in speed 
across the 12 batches they all take about the same amount of time
   
   
   **Environment Description**
   Spark running in Azure, Storage Layer ADLS_V2 and cross tested with GCS
   
   
   **Environment Description**
   Hudi version : 0.5.3
   Spark version : 2.4-1.0.5
   Hive version :
   Hadoop version : 2.7
   Storage (HDFS/S3/GCS..) : ADLSv3 and GCS
   Running on Docker? (yes/no) : yes and no, both
   
   
   **Additional context**
   Add any other context about the problem here.
   
   **Stacktrace**
   No exceptions are thrown, batch times are degrading.
   
   @christoph-wmt
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to