[I] [SUPPORT] s3 list cost increases exponentially when using COW table [hudi]

via GitHub Thu, 08 Aug 2024 09:32:31 -0700


ankit0811 opened a new issue, #11742:
URL: https://github.com/apache/hudi/issues/11742


   We started ingesting data from kafka using spark java and wanted to 
understand the storage cost associated with creating a table.
   Interestingly, we see that our s3 listing cost exponentially so wanted to 
understand if we are missing on something that can help us reduce this cost.
   
   We do have the metadata enabled so the assumption was s3 cost will be 
reduced but dont see that.
   
   When the spark job starts, we do see that the embedded timeline server 
starts and errors out post processing of the 1 micro batch. (Not sure if this 
is causing the listing cost and I believe the embedded timeline server was 
disabled as per this [issue](https://github.com/apache/hudi/issues/10432))
   
   <img width="1029" alt="Screenshot 2024-08-08 at 08 59 57" 
src="https://github.com/user-attachments/assets/b0982aff-0153-44ed-b0b6-049d194e94f7";>
   
   
   **Environment Description**
   
   * Hudi version : 0.15.0
   
   * Spark version : 3.4.3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   **Stacktrace**
   
   ```
   24/08/08 16:07:56 INFO TimelineService: Starting Timeline server on port 
:45043
   24/08/08 16:07:56 INFO EmbeddedTimelineService: Started embedded timeline 
server at <......>
   24/08/08 16:07:56 INFO BaseHoodieClient: Timeline Server already running. 
Not restarting the service
   24/08/08 16:07:56 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20240808160601237__commit__COMPLETED__20240808160625000]}
   
   ```
   
   ```
   24/08/08 16:08:02 INFO BaseHoodieClient: Embedded Timeline Server is 
disabled. Not starting timeline service
   24/08/08 16:08:02 INFO BaseHoodieClient: Embedded Timeline Server is 
disabled. Not starting timeline service
   
   ```
   
   
   ** Hudi config **
   
   ```java
   df.writeStream()
       .format("hudi")
       .option("hoodie.insert.shuffle.parallelism", "2")
       .option("hoodie.upsert.shuffle.parallelism", "2")
       .option("hoodie.delete.shuffle.parallelism", "2")
       .option(EMBEDDED_TIMELINE_SERVER_ENABLE.key(), "true")
   
       .option(HoodieWriteConfig.TBL_NAME.key(), newTName)
       .option("hoodie.datasource.write.table.type", 
HoodieTableType.COPY_ON_WRITE.name())
       .option("hoodie.datasource.write.operation", 
WriteOperationType.INSERT.value())
       .option(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key(), "ts_date")
       .option("checkpointLocation", newCheckPoint)
   
       .option(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key(), 
"col1,col2,col3,col4")
   
       .option(HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key(), "time_stamp")
       .option(INDEX_TYPE.key(), GLOBAL_BLOOM.name())
      
   
        // Clustering + Compaction config
       .option(ASYNC_CLUSTERING_ENABLE.key(), "true")
       .option("hoodie.clustering.plan.strategy.max.bytes.per.group", 
"524288000")
   
        // Metadata Config
       .option(ENABLE_METADATA_INDEX_COLUMN_STATS.key(), "true")
       .option(ASYNC_INDEX_ENABLE.key(), "true")
        
       .option(ASYNC_CLEAN.key(), "true")
        
       .outputMode(OutputMode.Append());
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] s3 list cost increases exponentially when using COW table [hudi]

Reply via email to