ankit0811 opened a new issue, #11742: URL: https://github.com/apache/hudi/issues/11742
We started ingesting data from kafka using spark java and wanted to understand the storage cost associated with creating a table. Interestingly, we see that our s3 listing cost exponentially so wanted to understand if we are missing on something that can help us reduce this cost. We do have the metadata enabled so the assumption was s3 cost will be reduced but dont see that. When the spark job starts, we do see that the embedded timeline server starts and errors out post processing of the 1 micro batch. (Not sure if this is causing the listing cost and I believe the embedded timeline server was disabled as per this [issue](https://github.com/apache/hudi/issues/10432)) <img width="1029" alt="Screenshot 2024-08-08 at 08 59 57" src="https://github.com/user-attachments/assets/b0982aff-0153-44ed-b0b6-049d194e94f7"> **Environment Description** * Hudi version : 0.15.0 * Spark version : 3.4.3 * Storage (HDFS/S3/GCS..) : S3 **Stacktrace** ``` 24/08/08 16:07:56 INFO TimelineService: Starting Timeline server on port :45043 24/08/08 16:07:56 INFO EmbeddedTimelineService: Started embedded timeline server at <......> 24/08/08 16:07:56 INFO BaseHoodieClient: Timeline Server already running. Not restarting the service 24/08/08 16:07:56 INFO HoodieActiveTimeline: Loaded instants upto : Option{val=[20240808160601237__commit__COMPLETED__20240808160625000]} ``` ``` 24/08/08 16:08:02 INFO BaseHoodieClient: Embedded Timeline Server is disabled. Not starting timeline service 24/08/08 16:08:02 INFO BaseHoodieClient: Embedded Timeline Server is disabled. Not starting timeline service ``` ** Hudi config ** ```java df.writeStream() .format("hudi") .option("hoodie.insert.shuffle.parallelism", "2") .option("hoodie.upsert.shuffle.parallelism", "2") .option("hoodie.delete.shuffle.parallelism", "2") .option(EMBEDDED_TIMELINE_SERVER_ENABLE.key(), "true") .option(HoodieWriteConfig.TBL_NAME.key(), newTName) .option("hoodie.datasource.write.table.type", HoodieTableType.COPY_ON_WRITE.name()) .option("hoodie.datasource.write.operation", WriteOperationType.INSERT.value()) .option(KeyGeneratorOptions.PARTITIONPATH_FIELD_NAME.key(), "ts_date") .option("checkpointLocation", newCheckPoint) .option(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key(), "col1,col2,col3,col4") .option(HoodieWriteConfig.PRECOMBINE_FIELD_NAME.key(), "time_stamp") .option(INDEX_TYPE.key(), GLOBAL_BLOOM.name()) // Clustering + Compaction config .option(ASYNC_CLUSTERING_ENABLE.key(), "true") .option("hoodie.clustering.plan.strategy.max.bytes.per.group", "524288000") // Metadata Config .option(ENABLE_METADATA_INDEX_COLUMN_STATS.key(), "true") .option(ASYNC_INDEX_ENABLE.key(), "true") .option(ASYNC_CLEAN.key(), "true") .outputMode(OutputMode.Append()); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
