dataproblems commented on issue #12116:
URL: https://github.com/apache/hudi/issues/12116#issuecomment-2460919696
Followed up with @ad1happy2go in Hudi Office hours and got more things to
try:
### Follow Up on Random Data: Use sort mode as `None` with timeline server
enabled:
For this follow up action item, I noticed that the data was written to s3
but the job got stuck. I used the 37 GB dataset generated with the random data
generation script I posted earlier.
### Follow Up on Random Data: Use sort mode as `None` and increase the ``
from 10 (default) to 10000
For this experiment, I used the following config:
```
val bulkWriteOptions: Map[String, String] = Map(
DataSourceWriteOptions.OPERATION.key() ->
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL,
DataSourceWriteOptions.TABLE_TYPE.key() ->
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL,
HoodieStorageConfig.PARQUET_COMPRESSION_CODEC_NAME.key() -> "snappy",
HoodieStorageConfig.PARQUET_MAX_FILE_SIZE.key() -> "2147483648",
"hoodie.parquet.small.file.limit" -> "1073741824",
HoodieTableConfig.POPULATE_META_FIELDS.key() -> "true",
HoodieWriteConfig.BULK_INSERT_SORT_MODE.key() ->
BulkInsertSortMode.NONE.name(),
HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key() -> "true",
HoodieIndexConfig.INDEX_TYPE.key() -> "RECORD_INDEX",
DataSourceWriteOptions.META_SYNC_ENABLED.key() -> "false",
"hoodie.metadata.record.index.enable" -> "true",
"hoodie.metadata.enable" -> "true",
"hoodie.datasource.write.hive_style_partitioning" -> "true",
"hoodie.clustering.inline" -> "true",
"hoodie.clustering.plan.strategy.target.file.max.bytes" -> "2147483648",
"hoodie.clustering.plan.strategy.small.file.limit" -> "1073741824",
"hoodie.datasource.write.partitionpath.field" -> "partition",
"hoodie.datasource.write.recordkey.field" -> "id",
"hoodie.datasource.write.precombine.field" -> "ts",
"hoodie.table.name" -> tableName,
DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key() ->
classOf[SimpleKeyGenerator].getName,
"hoodie.write.markers.type" -> "DIRECT",
"hoodie.embed.timeline.server" -> "true",
"hoodie.metadata.record.index.min.filegroup.count" -> "10000",
"hoodie.metadata.record.index.max.filegroup.count" -> "100000"
)
```
I also tried with `"hoodie.metadata.record.index.min.filegroup.count" ->
"1000"` and got the same outcome.
Here are the screenshots from the Spark UI.
#### Stage View

#### Stage Detail View

#### Metrics for the completed tasks

#### Event Timeline

#### Executor Summary Tab.

Here are the
[random_exp_executor_stderr.log](https://github.com/user-attachments/files/17653720/random_exp_executor_stderr.log)
and
[random_exp_executor_stdout.log](https://github.com/user-attachments/files/17653723/random_exp_executor_stdout.log)
from one of the executors with high GC time.
I see the executor heart beat timeouts here as well. Do you have any ideas
as to why we might be running into this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]